I'm Learning R: 2020

Saturday, November 21, 2020

coplot

Conditioning plot, AKA shingle plot. Why shingle? It has to do with overlapping results. In this case, 'conditioning' is, apparently, 'under several conditions.' Think of a mosaic, where you have a series of related plots all together in one plot display. You print them separately because, I guess, if you put them on one graph it would be a mescla of data.

I don't know much about this type of plot just yet. Tableau builds these pretty well -- you are taking an xyplot and recalculating it on a different scale. It has a ton of options, but you only need the dataset and the three variables (the separate plots are technically extending an x/y relationship into a third dimension).

R Example

It's probably easier to understand what this thing is for if you look at what it does. This plot depicts the depth of earthquakes near Fiji across an array of latitude and longitude pairs.

> coplot(lat ~ long | depth, data = quakes)

Cute, right? What does it mean? Well, if I read this thing right, Fijiin quakes near 185, -10 generally occur deeper than do quakes near 165, -10. Or, probably a better way to look at it is given a depth of, say 300 feet, where are we likely to see a quake? 170, -10?

I have no idea. I'll have to research this for better examples.

In the 3473 Video, Kern explains it as: using the top 'header' (my term) chart, there are six zones outlined. Each zone is depicted in one of the shingle plots below. But that still doesn't answer it fully for me. Stay tuned.

Thursday, November 19, 2020

xyplot

An X-Y plot is also known as a scatter plot. You have your basic splash of points that you hope will exhibit some sort of pattern, which you can then use to predict something. You may need to create a fit line as a guide to a probably value of Y given X (and vice versa). Either way, the one you pick is the independent variable, and the one you look up based on the one you picked is the dependent variable (you nearly always have X as the independent variable).

Of course since I noted that it's usually between two variables, I read down in the Wikipedia article where it talks about multivariate scatter plots. Go figure.

You can use scatter plots whenever you have data. Remember though, that extrapolating outside of the cloud of data is fraught with danger. You cannot predict with any accuracy the value of a dependent variable based on an independent variable value which is either less than the smallest, or greater than the greatest value in the plot. Period.

R Example

Use the lattice:xyplot function to create scatter plots. The R docs have xyplot in the same topic as dotplot, barchart, and more. Plenty of options, but you really only need the name of a qualifying dataset.

This example creates a simple graph that plots height (as X) against weight (as Y) using the women dataset.

> library(lattice)
> xyplot(weight ~ height, data = women)

There is no men dataset. I guess that would be pointless.

Thursday, November 12, 2020

spineplot

They say that a spine plot is a special case of the mosaic, and a generalization of a stacked bar graph. Each vertical bar (I'll call it a 1st order category) is segmented along its length according to the relative proportions of 2nd order categories. In addition, each bar has a width which represents the proportion of its category among the several 1st order categories being examined. These are all similar to the mosaic. Unlike the mosaics I've seen, spine plots also feature a numeric scale on the y axis.

Even though there are some scholarly papers out there that show how to use a spine plot for more than just two variables, I'm going to go out here on a limb and say use a spine for just two.

When to use a spine plot instead of a bar graph? I think it depends on how you want to make your case. A bar graph is usually used to compare relative quantities or proportions among similar or related categories. I often have a dozen categories along the x axis, with each bar segmented along the y axis with random segments. That is, each bar doesn't need to have the same number of segments. Spine plots will involve few 'bars,' and each will have the same number of divisions. (At least, this is what I'm going with now until I find out different later.)

R Example

This example uses the UCBAdmission dataset to show the distribution of the genders as applicants for admission to UC Berkeley.

spineplot(xtabs(Freq ~ Admit + Gender, data = UCBAdmissions))

The plot shows:

more people were rejected than accepted -- the Rejected bar is wider than the Admitted. We don't know how much, but it eyeballs to about 2:3, or 66% rejected.
a higher percentage of male applicants were rejected than were female applicants (about 55% to 45%).
of those accepted, a higher percentage were males than female.

Okay, that's nice. The chart makes it seem that men are overly represented. But notice that we have no information on individual totals.

If we take the same data, and plot it with barplot, we can influence people in a completely different way:

barplot(xtabs(Freq ~ Admit + Gender, data = UCBAdmissions))

Okay, that's quite different. Though many more males are admitted, many more males applied. Further a much greater proportion of the females who applied were admitted. (Please don't try to read into this data why more males applied to Berkeley.)

If you wanted to show how skewed the admittance membership is, you would use a spine plot. If you wanted to show how skewed the admittance preferences are, you would use a stacked bar chart.

Welcome to descriptive statistics.

Thursday, October 29, 2020

mosaicplot

Mosaic plots let you view the relationship between two or more categorical variables. That's the official, universal definition. I have a lot of confusion between mosaicplot and spineplot. The docs stress that while mosaic plots let you see more than two variables, spine plots are limited to just two.

If you research mosaic plots on the web, you will invariably find the Titanic survival analysis at the top of the heap. Mosaic plots are all about proportional boxes representing related categories, so when you read a mosaic plot, you find yourself drilling down into it, reading it X-Y-X-Y to discover what you want. You'll see what I mean in the R example below.

R Example

Let's just get on with it. This example creates the default mosaic plot for the HairEyeColor dataset.

> mosaicplot(HairEyeColor)

It's all about Hair color (x axis) and Eye color (y axis). Eye color is subcategorized into four colors. Hair is also subcategorized into four colors, and also further subdivided by gender. The width of a segment (in any direction) indicates its relative proportion of the total (in the same direction).

As you look at the plot, you can see that people with blond hair are more likely to have blue eyes, and that a difference in gender does not effect a tendency towards eye color when the hair color is the same. So in this example, there are three variables being compared: eye color, hair color, and gender.

Two challenges I see with a mosaic plot is that your audience need some experience interpreting one before you present it. You'll be spending a lot of time explaining it otherwise. The other issue is that two people can come to different and equally valid conclusions, depending on the mosaic distribution (this hair/eye example doesn't really show that -- you just need to remember to examine all angles of the graph for contradictions if you plan on using it to support something it seems to show).

Friday, October 16, 2020

bwplot

Available with the lattice package, a bivariate trellis plot (bwplot, or box and whisker plot) displays multiple box plots based on a category in the dataset. Lattice describes bwplot on the same page as several other plots, so you have to go down the page to find info on it. It has over 20 options, but you only need the dataset name, the parameter to count, and the grouping parameter.

Plot comparisons are low in detail, but allow you to compare many items at a glance. In the case of bwplot, you are comparing the IQR of a dataset grouped by some category.

R Example

This example depicts the distribution of chicken weights, grouped by the type of feed. A comparison lets you judge with feed might be optimal to provide your hens.

> library(lattice)
> bwplot(~weight | feed, data = chickwts)

Looking at this, we might want to use sunflower, because it has the most predictable value. Or we might choose casein, because you may want to produce chickens of greater variance in size and weight.

Thursday, October 8, 2020

boxplot

Boxplots are a fast way to judge spread, shape, and center. Also called box and whisker plots, they depict the five number summary in a structured shape.

Use a boxplot when you want a visual summary of the data set, and you don't have (or don't want to use) many data points. I think the main negative thing about boxplots is that you may need to educate people what they are seeing, as opposed to a distribution or bar chart, which are ubiquitous in public culture.

The boxplot function has a pile of options, but you only need the data and count options to show something.

> boxplot(count ~ spray, data = InsectSprays, horizontal = TRUE)

Values are represented by the lines around each box.

• min - leftmost whisker. Insecticides A and B have min values about 6.

• IQR - the box shape. Insecticide F has the widest IQR, and insecticide D has the narrowest.

• median - the heavy line within the box

• max - the rightmost whisker. Insecticide D has a max equal to the upper value of its IQR

• outliers - circles outside a box (C and D in this plot both have an outlier)

Monday, September 14, 2020

stem

A stem and leaf plot is similar to a histogram. It displays numeric values in an ordered distribution, so you can gauge its shape. It lies rotated 90° from a typical histogram. A vertical line separates the tens (or hundreds) place from the units. The tens (or hundreds) ascend from top to bottom, and the units ascend from left to right.

Unlike a histogram, a stem and leaf plot preserves the value of each data point within the graph.

Note: there is also a stemplot, which involves plotting a matrix of y values along an x axis. I don't know much about stemplots at this time.

R has the built in stem command to create stem and leaf. It only requires a vector of values. It has a couple of options, the most useful one being scale, which controls the plot length. The aplpack package also provides stem.leaf, which has a pile of options that can help you trim outliers, create back-to-back stemplots, and more. Worth a look if you are going to be using stem and leaf a lot.

I remember going through many steps to create stem and leafs in my first statistics class. R makes it trivial.

R Example

I'm basing this on the topic in Kern's IPSUR book. It uses the rivers dataset.

> stem(rivers)

The decimal point is 2 digit(s) to the right of the |

0 | 4
2 | 011223334555566667778888899900001111223333344455555666688888999
4 | 111222333445566779001233344567
6 | 000112233578012234468
8 | 045790018
10 | 04507
12 | 1471
14 | 56
16 | 7
18 | 9
20 |
22 | 25
24 | 3
26 |
28 |
30 |
32 |
34 |
36 | 1

This plot is right skewed.

Read a data point by taking one value from the left side, and one from the right side of the vertical line. There is one value below 20, "04". The number 22 happened twice, the number 222 once, and the number 361 once. And so on.

Friday, September 11, 2020

pareto

Pronounced pa RAY toe. A pareto diagram is similar to a bar chart, in that it depicts categorical data. Unlike a bar chart, it orders the bars in decreasing size. This scheme not only makes it easy for you to see which categories are most important, it also lets you compare their relative sizes quickly.

Often pareto diagrams include an ogive line, which tracks the cumulative number of values across the bars. With the addition of an ogive, you can quickly estimate when a given percentage of the total data has been allocated to the major categories.

You need to include the qcc library to get pareto chart functionalty.

R Example

This example is similar to the barplot example. It shows the cumulative number of states within several regions. Along with the chart, the table of values is emitted by the R studio.

pareto.chart(table(state.division), ylab = "Frequency")

Pareto chart analysis for table(state.division)

Frequency Cum.Freq. Percentage Cum.Percent.
South Atlantic 8 8 16 16
Mountain 8 16 16 32
West North Central 7 23 14 46
New England 6 29 12 58
East North Central 5 34 10 68
Pacific 5 39 10 78
East South Central 4 43 8 86
West South Central 4 47 8 94
Middle Atlantic 3 50 6 100

Cumulative number of states within several regions

Tuesday, September 8, 2020

dotchart

The dotchart command provides a dot plot, a simple representation of frequency. It is part of the graphics package.

As with dot plot, advantages include a graph you can quickly make by hand. Outliers are easy to identify, and you can gage clusters, gaps, and histogram-like shapes. Most notably, dot plots preserve the numeric value of each data point (something lost with a bar chart, for example). See Dot plot (statistics) in Wikipedia.

It's use is limited to data sets of relatively few data points. Too many points and the graph becomes unreadable.

dotchart has 20 parameters, but the only one you must have is x -- a vector or matrix (table).

dotcart is a Variation on Bar Chart

After poking around with this a bit, I discover that dotchart is specialized dot plot function. The R documentation lists this as Cleveland's Dot Plot. In Wikipedia, I learn that a Cleveland Dot Plot is "…an alternative to the bar chart, in which dots are used to depict the quantitative values (e.g. counts) associated with categorical variables." These are supposed to be easier to read and interpret that bar charts.

R Example

This first example plots the numeric mass of a number of common animals.

> library(MASS)

> dotchart(airmiles)

Passenger miles on US airlines between 1937 and 1960

This example shows just one dot per category. The airmiles dataset lists passenger miles on US airlines between 1937 and 1960. In the structure, though, the date years are represented as Time-Series [1:24], so there's no hard-wired column of years.

I was initially confused with dotchart, because the examples didn't show me how to label the dots. (The plot function ekes the years out automatically.) I found that to label the points in dotchart requires creation of another set of points be paired with the airmiles data). There are also dotchart2 and dotchart3 which may handle time series years more elegantly. So I'm going with just another example:

dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
main="Gas Milage for Car Models",
xlab="Miles Per Gallon")

MPG for Common Cars

I think this version is easier to read than a barchart, for sure. The calibrate:textxy function allows you to do this.

Friday, September 4, 2020

barplot

Bar graphs or bar charts are created using barplot, and are frequency comparisons for categorical data (rather than numeric occurrences of a single variable as with histograms). Heights of bars are proportional to the frequency of occurrence of each category.

Use a bar graph when you want to compare the incidence of different but similar things. Survival of different mammals on a flooded island, on the number of paper subscriptions in several city neighborhoods.

Bar graphs can be single bars per category. They can also be "doubled-up" -- two or more bars in the same category turning on a split in the data. An example might be paper subscriptions in city neighborhoods, split by gender. Bar graphs may also be "stacked," with each bar divided into a number of differently colored segments representing random characteristics (that is, characteristics which may not appear in every bar in the chart). For example, paper subscriptions per neighborhood, with segments for morning, evening, and Sunday only subscriptions. In some neighborhoods there may be no Sunday only subscriptions, while in another neighborhood, everybody gets the morning edition.

R Example

This example plots the number of states which exist in the four main regions of the U.S.

barplot(table(state.region))

Number of States in Four U.S. Regions

Sunday, August 23, 2020

plot (Index plot)

This is a two dimensional plot where the index (such as time) is the x variable, and the measured value is the y variable. Use an index plot when you want to plot ordered data against an increasing scale. Kern assumes that the increasing scale will be time.

You can use the generic plot function to create a index plot (the RcmdrMisc.indexplot is also available). The plot function offers 9 ways to display the data, from points to steps.

R Example

This example shows the changes in the depth of Lake Huron from 1875 to 1972, using a single line ("l") display.

plot(LakeHuron, type = "l")

Reference

STAT 3743, Kern

Thursday, August 20, 2020

hist

Create a histogram. The hist function is in the base R package. With hist, the only param you need (out of about 20) is a data frame.

There are other histogram functions available from other packages (lattice.histogram, for example). Regardless of which you choose, the advantages and disadvantages are the same.

Use a histogram when you want to view the frequency of occurrence of something across a set of sequential bins. Another way of saying it is a histogram shows a distribution of a variable. Silt depth from shore to 3', from 3' to 6', 6' to 9', and so on. It is distinguished from a bar graph, which is used to compare one or more variables.

Histogram Weakness

You must take care determining the bin size for your histogram. A bin size that is less than optimal can provide a misleading depiction. With hist, experiment with the breaks parameter to adjust the graph for the best information display for your purposes.

R Example

This example shows precipitation for each of 70 U.S. + Puerto Rico cites.

hist(precip, breaks = 25)

The breaks parameter directs histogram is to have 25 bins (note that if you count the bins, you don't get 25 -- I'm thinking that R adjusts it +/- from the value of breaks for a better result). Anyway, 4 cities had rainfall between 0-5 inches, and one city had rainfall over 65 inches.

You can have hist show frequency or relative probability. It just changes the left hand scale.

Monday, August 17, 2020

stripchart (dot plot)

Also known as strip chart and strip plot. Call this function in the form:

stripchart(quantity ~ category, method = "stack", data = "data.frame")

stripchart produces a one-dimensional scatter plot on discrete, continuous, and univariate data. Data points are plotted above a single numeric scale. You can create dot plots by hand. Use them when the data set is small, and you want to identify outliers and clusters of values.

While the only required parameter for stripchart is the data frame, optional parameters let you set horizonal or vertical orientation, group data in categories, and add "jitter", to help keep values from superimposing.

R Example

This example plots the values from the standand rivers dataset, the lengths of 141 major rivers in North America.

stripchart(rivers, method = "jitter", xlab = "length")
dot plot of rivers dataset

Thursday, July 23, 2020

On to Basic Graphs

In the previous posts, I've focused on the typical commands you use to report and analyze numeric and tabular data. For the next dozen or so posts I'm going to review graphing functions.

Because now we are getting to the meat of the deal. Visualizations. I've been on a mission to understand how to interpret graphs for many years -- it's why I stepped up to study stats in 2018. The reasons why people need to understand what graphs tell us are broadcast by anyone who knows anything about statistics. To our detriment, until we look into it more deeply we simply nod and ignore.

I imagine that most of what people can be made to understand can be done using one or more of the graphs in this section.

Wednesday, July 22, 2020

colPercents

With colPercents you can get row, column, and total percents for a contingency table (the marginal values I write about in the table topic).

The colPercents command is part of the RcmdrMisc package that I installed as part of STAT 3743. From the R community, it looks like you can get the same value using prop.table and cbind, though that's a multi-step process.

Just two parameters, the table of frequency counts, and optional digits for % display.

R Example

This example uses the carData package Mroz data, which contains labor force participation for married women. This example shows labor force participation (lfp) against whether the wife attended college (wfc).

> colPercents(xtabs(~ lfp + wc, data=Mroz))
wc
lfp no yes
no 47.5 32.1
yes 52.5 67.9
Total 100.0 100.0
Count 541.0 212.0

This table reports that 47.5% of married women who did not attend college do not work.

Tuesday, June 30, 2020

xtabs

The xtabs command lets you build contingency tables (two-way table, cross tablulation, or crosstab). It is similar to table. Both are depictions of spread, which display multivariate frequency distribution of two variables, though from the R documentation, it looks like you can use xtabs for more than two variables. In the STAT 3473 videos, Kern asserts that xtab is the easiest way to make a table.

See the table topic for an example of a crosstab.

For two variables, the xtabs command provides an initial output nearly identical to the output of table. It has 10 parameters. Of these, you'll always need the formula and data parameters.

The xtabs command is one of the first ones I encountered with the wacky formula. You have to introduce columns from a dataset with a tilde (~), and separate columns with the plus sign (+). That's the formula parameter. Then the name of the data.frame that contains the columns. Here's the form:

xtabs(~[column1] + [column2] + … [columnN], data = [data.frame])

R Example

This example uses the standard infert dataset. The infert dataset shows infertility data after abortion across a number of characteristics, like education.

xtabs(~education + induced + spontaneous, data = infert)
, , spontaneous = 0

induced
education 0 1 2
0-5yrs 2 1 6
6-11yrs 46 15 10
12+ yrs 19 29 13

, , spontaneous = 1

induced
education 0 1 2
0-5yrs 1 0 0
6-11yrs 19 9 5
12+ yrs 27 7 3

, , spontaneous = 2

induced
education 0 1 2
0-5yrs 1 1 0
6-11yrs 13 3 0
12+ yrs 15 3 0

Separate tables are produced for each value of spontaneous. You can put these together into one table with ftable (flatten table), which I don't go into further in this document.

> tt = xtabs(~education + induced + spontaneous, data = infert)
> ftable(tt)

spontaneous 0 1 2
education induced
0-5yrs 0 2 1 1
1 1 0 1
2 6 0 0
6-11yrs 0 46 19 13
1 15 9 3
2 10 5 0

12+ yrs 0 19 27 15
1 29 7 3
2 13 3 0

You would think that xtabs would do the flattening part (and indeed it may). This is one of those commands I'm going to have to look into further.

Sunday, June 28, 2020

table

The table command lets you build contingency tables (two-way table, cross tablulation, or crosstab). It is similar to xtabs. Both are depictions of spread, which display multivariate frequency distribution of two variables, though from the R documentation, it looks like you can use xtabs for more than two variables.

Wikipedia provides this example for a contingency table. Dominant handedness is broken down by gender:

Handedness Sex	Right handed	Left handed	Total
Male	43	9	52
Female	44	4	48
Total	87	13	100

The red values are the column and row margin totals.

You see contingency tables everywhere. One of my stats classes focused very heavily on them, pumping nearly all the homework and class examples through Excel.

The table command has 9 parameters, of which you will often use row.names and responseName. The example below uses quantile(Temp) as row.names, and Month as the responseName.

R Example

This example from the R docs uses the build in dataset airquality, taking the Temp column from airquality, and creating a contingency table with a row grouping by quartile of temperature values, and a column grouping by month number.

with(airquality, table(cut(Temp, quantile(Temp)),Month))

Month
5 6 7 8 9
(56,72] 24 3 0 1 10
(72,79] 5 15 2 9 10
(79,85] 1 7 19 7 5
(85,97] 0 5 10 14 5

The table reveals, for example, that when the value of the temperature is within the 1^st quartile of temp values (between 56 and 72), the counts of when those temperatures occurred between May and September are:

Month	# of times the temperature was between 56 and 72
May	24
June	3
July	0
August	1
September	10

Sunday, June 14, 2020

cor

Correlation coefficient. A measure of the linear relationship between two variables. This is a unitless value.

In the R docs, var is listed on the same page as var (variance) and cov (covariance).

When you want to try and predict the effect an x value has on a y value, your calculate correlation (usually represented with 'r').

When r is greater than zero, it indicates a positive correlation, and when less than zero, it indicates a negative correlation. An r of zero means no correlation.

Correlation is a linear measure. Therefore you can only use it when a plot of the x/y relationship is more linear than otherwise.

R Example

> duration = faithful$eruptions   # eruption durations 
> waiting = faithful$waiting      # the waiting period 
> cor(duration, waiting)          # apply the cor function 

[1] 0.90081

The value of r is nearly 1 (one), a positive correlation.

Friday, June 12, 2020

kurtosis

A measure of shape. Provides a number to represent the relative flatness of a distribution. You can select one of three algorithms for computing kurtosis (Type 1 is the default).

The kurtosis for a normal distribution is 3.0. Several terms are used to define different shape tendencies:

leptokurtic - steep spike and heavy tails (g2 positive)
platykurtic - flat shape and thin tails (g2 negative)
mesokurtic - rounded peak and moderate tails (normal dist) (g2 ~0)

A kurtosis number is a value minus 3 (that is, a certain amount away from Normal).

R Example

> library(e1071)

> x <- rnorm(100)

> kurtosis(x)

[1] -0.7979941