Tuesday, June 30, 2020

xtabs

The xtabs command lets you build contingency tables (two-way table, cross tablulation, or crosstab). It is similar to table. Both are depictions of spread, which display multivariate frequency distribution of two variables, though from the R documentation, it looks like you can use xtabs for more than two variables. In the STAT 3473 videos, Kern asserts that xtab is the easiest way to make a table.

See the table topic for an example of a crosstab.

For two variables, the xtabs command provides an initial output nearly identical to the output of table. It has 10 parameters. Of these, you'll always need the formula and data parameters.

The xtabs command is one of the first ones I encountered with the wacky formula. You have to introduce columns from a dataset with a tilde (~), and separate columns with the plus sign (+). That's the formula parameter. Then the name of the data.frame that contains the columns. Here's the form:

xtabs(~[column1] + [column2] + … [columnN], data = [data.frame])

R Example

This example uses the standard infert dataset. The infert dataset shows infertility data after abortion across a number of characteristics, like education.

xtabs(~education + induced + spontaneous, data = infert)
, , spontaneous = 0

induced
education 0 1 2
0-5yrs 2 1 6
6-11yrs 46 15 10
12+ yrs 19 29 13

, , spontaneous = 1

induced
education 0 1 2
0-5yrs 1 0 0
6-11yrs 19 9 5
12+ yrs 27 7 3

, , spontaneous = 2

induced
education 0 1 2
0-5yrs 1 1 0
6-11yrs 13 3 0
12+ yrs 15 3 0


Separate tables are produced for each value of spontaneous. You can put these together into one table with ftable (flatten table), which I don't go into further in this document.

> tt = xtabs(~education + induced + spontaneous, data = infert)
> ftable(tt)

                 spontaneous 0 1 2
education induced
0-5yrs    0                  2 1 1
          1                  1 0 1
          2                  6 0 0
6-11yrs   0                 46 19 13
          1                 15 9 3
          2                 10 5 0
12+ yrs   0                 19 27 15
          1                 29 7 3
          2                 13 3 0


You would think that xtabs would do the flattening part (and indeed it may). This is one of those commands I'm going to have to look into further.

Sunday, June 28, 2020

table


The table command lets you build contingency tables (two-way table, cross tablulation, or crosstab). It is similar to xtabs. Both are depictions of spread, which display multivariate frequency distribution of two variables, though from the R documentation, it looks like you can use xtabs for more than two variables.

Wikipedia provides this example for a contingency table. Dominant handedness is broken down by gender:

Handedness
Sex
Right handed
Left handed
Total
Male
43
9
52
Female
44
4
48
Total
87
13
100

The red values are the column and row margin totals.

You see contingency tables everywhere. One of my stats classes focused very heavily on them, pumping nearly all the homework and class examples through Excel.

The table command has 9 parameters, of which you will often use row.names and responseName. The example below uses quantile(Temp) as row.names, and Month as the responseName.

R Example

This example from the R docs uses the build in dataset airquality, taking the Temp column from airquality, and creating a contingency table with a row grouping by quartile of temperature values, and a column grouping by month number.

with(airquality, table(cut(Temp, quantile(Temp)),Month))

        Month
          5 6 7 8 9
(56,72] 24 3 0 1 10
(72,79] 5 15 2 9 10
(79,85] 1 7 19 7 5
(85,97] 0 5 10 14 5



The table reveals, for example, that when the value of the temperature is within the 1st quartile of temp values (between 56 and 72), the counts of when those temperatures occurred between May and September are:

Month
# of times the temperature was between 56 and 72
May
24
June
3
July
0
August
1
September
10

Sunday, June 14, 2020

cor

Correlation coefficient. A measure of the linear relationship between two variables. This is a unitless value.

In the R docs, var is listed on the same page as var (variance) and cov (covariance).

When you want to try and predict the effect an x value has on a y value, your calculate correlation (usually represented with 'r').

When r is greater than zero, it indicates a positive correlation, and when less than zero, it indicates a negative correlation. An r of zero means no correlation.

Correlation is a linear measure. Therefore you can only use it when a plot of the x/y relationship is more linear than otherwise.

R Example

> duration = faithful$eruptions   # eruption durations 
> waiting = faithful$waiting      # the waiting period 
> cor(duration, waiting)          # apply the cor function 

[1] 0.90081 

The value of r is nearly 1 (one), a positive correlation.

Friday, June 12, 2020

kurtosis

A measure of shape. Provides a number to represent the relative flatness of a distribution. You can select one of three algorithms for computing kurtosis (Type 1 is the default).


The kurtosis for a normal distribution is 3.0. Several terms are used to define different shape tendencies:

  • leptokurtic - steep spike and heavy tails (g2 positive)
  • platykurtic - flat shape and thin tails (g2 negative)
  • mesokurtic - rounded peak and moderate tails (normal dist) (g2 ~0)

 A kurtosis number is a value minus 3 (that is, a certain amount away from Normal).

R Example

> library(e1071)
> x <- rnorm(100)
> kurtosis(x)
[1] -0.7979941

Tuesday, June 9, 2020

skewness

A measure of shape. Provides a number to represent the relative positive or negative skew of a distribution. Perfectly normal distributions have a skewness of 0. Negative numbers indicate left skew, with positive numbers indicating right skew. If you get a skew number less than -1 or greater than 1, it is considered to be approaching extreme.

I was hoping for a table of values which showed .5, 1.0, 1.5, etc and the degree of extremeness of skew, but didn't find one.

R Example

> library(e1071)
> x <- rnorm(100)
> skewness(x)
[1] 0.1277998

rnorm is a random distribution generator.