I'm Learning R

Saturday, November 21, 2020

coplot

Conditioning plot, AKA shingle plot. Why shingle? It has to do with overlapping results. In this case, 'conditioning' is, apparently, 'under several conditions.' Think of a mosaic, where you have a series of related plots all together in one plot display. You print them separately because, I guess, if you put them on one graph it would be a mescla of data.

I don't know much about this type of plot just yet. Tableau builds these pretty well -- you are taking an xyplot and recalculating it on a different scale. It has a ton of options, but you only need the dataset and the three variables (the separate plots are technically extending an x/y relationship into a third dimension).

R Example

It's probably easier to understand what this thing is for if you look at what it does. This plot depicts the depth of earthquakes near Fiji across an array of latitude and longitude pairs.

> coplot(lat ~ long | depth, data = quakes)

Cute, right? What does it mean? Well, if I read this thing right, Fijiin quakes near 185, -10 generally occur deeper than do quakes near 165, -10. Or, probably a better way to look at it is given a depth of, say 300 feet, where are we likely to see a quake? 170, -10?

I have no idea. I'll have to research this for better examples.

In the 3473 Video, Kern explains it as: using the top 'header' (my term) chart, there are six zones outlined. Each zone is depicted in one of the shingle plots below. But that still doesn't answer it fully for me. Stay tuned.

Thursday, November 19, 2020

xyplot

An X-Y plot is also known as a scatter plot. You have your basic splash of points that you hope will exhibit some sort of pattern, which you can then use to predict something. You may need to create a fit line as a guide to a probably value of Y given X (and vice versa). Either way, the one you pick is the independent variable, and the one you look up based on the one you picked is the dependent variable (you nearly always have X as the independent variable).

Of course since I noted that it's usually between two variables, I read down in the Wikipedia article where it talks about multivariate scatter plots. Go figure.

You can use scatter plots whenever you have data. Remember though, that extrapolating outside of the cloud of data is fraught with danger. You cannot predict with any accuracy the value of a dependent variable based on an independent variable value which is either less than the smallest, or greater than the greatest value in the plot. Period.

R Example

Use the lattice:xyplot function to create scatter plots. The R docs have xyplot in the same topic as dotplot, barchart, and more. Plenty of options, but you really only need the name of a qualifying dataset.

This example creates a simple graph that plots height (as X) against weight (as Y) using the women dataset.

> library(lattice)
> xyplot(weight ~ height, data = women)

There is no men dataset. I guess that would be pointless.

Thursday, November 12, 2020

spineplot

They say that a spine plot is a special case of the mosaic, and a generalization of a stacked bar graph. Each vertical bar (I'll call it a 1st order category) is segmented along its length according to the relative proportions of 2nd order categories. In addition, each bar has a width which represents the proportion of its category among the several 1st order categories being examined. These are all similar to the mosaic. Unlike the mosaics I've seen, spine plots also feature a numeric scale on the y axis.

Even though there are some scholarly papers out there that show how to use a spine plot for more than just two variables, I'm going to go out here on a limb and say use a spine for just two.

When to use a spine plot instead of a bar graph? I think it depends on how you want to make your case. A bar graph is usually used to compare relative quantities or proportions among similar or related categories. I often have a dozen categories along the x axis, with each bar segmented along the y axis with random segments. That is, each bar doesn't need to have the same number of segments. Spine plots will involve few 'bars,' and each will have the same number of divisions. (At least, this is what I'm going with now until I find out different later.)

R Example

This example uses the UCBAdmission dataset to show the distribution of the genders as applicants for admission to UC Berkeley.

spineplot(xtabs(Freq ~ Admit + Gender, data = UCBAdmissions))

The plot shows:

more people were rejected than accepted -- the Rejected bar is wider than the Admitted. We don't know how much, but it eyeballs to about 2:3, or 66% rejected.
a higher percentage of male applicants were rejected than were female applicants (about 55% to 45%).
of those accepted, a higher percentage were males than female.

Okay, that's nice. The chart makes it seem that men are overly represented. But notice that we have no information on individual totals.

If we take the same data, and plot it with barplot, we can influence people in a completely different way:

barplot(xtabs(Freq ~ Admit + Gender, data = UCBAdmissions))

Okay, that's quite different. Though many more males are admitted, many more males applied. Further a much greater proportion of the females who applied were admitted. (Please don't try to read into this data why more males applied to Berkeley.)

If you wanted to show how skewed the admittance membership is, you would use a spine plot. If you wanted to show how skewed the admittance preferences are, you would use a stacked bar chart.

Welcome to descriptive statistics.

Thursday, October 29, 2020

mosaicplot

Mosaic plots let you view the relationship between two or more categorical variables. That's the official, universal definition. I have a lot of confusion between mosaicplot and spineplot. The docs stress that while mosaic plots let you see more than two variables, spine plots are limited to just two.

If you research mosaic plots on the web, you will invariably find the Titanic survival analysis at the top of the heap. Mosaic plots are all about proportional boxes representing related categories, so when you read a mosaic plot, you find yourself drilling down into it, reading it X-Y-X-Y to discover what you want. You'll see what I mean in the R example below.

R Example

Let's just get on with it. This example creates the default mosaic plot for the HairEyeColor dataset.

> mosaicplot(HairEyeColor)

It's all about Hair color (x axis) and Eye color (y axis). Eye color is subcategorized into four colors. Hair is also subcategorized into four colors, and also further subdivided by gender. The width of a segment (in any direction) indicates its relative proportion of the total (in the same direction).

As you look at the plot, you can see that people with blond hair are more likely to have blue eyes, and that a difference in gender does not effect a tendency towards eye color when the hair color is the same. So in this example, there are three variables being compared: eye color, hair color, and gender.

Two challenges I see with a mosaic plot is that your audience need some experience interpreting one before you present it. You'll be spending a lot of time explaining it otherwise. The other issue is that two people can come to different and equally valid conclusions, depending on the mosaic distribution (this hair/eye example doesn't really show that -- you just need to remember to examine all angles of the graph for contradictions if you plan on using it to support something it seems to show).

Friday, October 16, 2020

bwplot

Available with the lattice package, a bivariate trellis plot (bwplot, or box and whisker plot) displays multiple box plots based on a category in the dataset. Lattice describes bwplot on the same page as several other plots, so you have to go down the page to find info on it. It has over 20 options, but you only need the dataset name, the parameter to count, and the grouping parameter.

Plot comparisons are low in detail, but allow you to compare many items at a glance. In the case of bwplot, you are comparing the IQR of a dataset grouped by some category.

R Example

This example depicts the distribution of chicken weights, grouped by the type of feed. A comparison lets you judge with feed might be optimal to provide your hens.

> library(lattice)
> bwplot(~weight | feed, data = chickwts)

Looking at this, we might want to use sunflower, because it has the most predictable value. Or we might choose casein, because you may want to produce chickens of greater variance in size and weight.

Thursday, October 8, 2020

boxplot

Boxplots are a fast way to judge spread, shape, and center. Also called box and whisker plots, they depict the five number summary in a structured shape.

Use a boxplot when you want a visual summary of the data set, and you don't have (or don't want to use) many data points. I think the main negative thing about boxplots is that you may need to educate people what they are seeing, as opposed to a distribution or bar chart, which are ubiquitous in public culture.

The boxplot function has a pile of options, but you only need the data and count options to show something.

> boxplot(count ~ spray, data = InsectSprays, horizontal = TRUE)

Values are represented by the lines around each box.

• min - leftmost whisker. Insecticides A and B have min values about 6.

• IQR - the box shape. Insecticide F has the widest IQR, and insecticide D has the narrowest.

• median - the heavy line within the box

• max - the rightmost whisker. Insecticide D has a max equal to the upper value of its IQR

• outliers - circles outside a box (C and D in this plot both have an outlier)

Monday, September 14, 2020

stem

A stem and leaf plot is similar to a histogram. It displays numeric values in an ordered distribution, so you can gauge its shape. It lies rotated 90° from a typical histogram. A vertical line separates the tens (or hundreds) place from the units. The tens (or hundreds) ascend from top to bottom, and the units ascend from left to right.

Unlike a histogram, a stem and leaf plot preserves the value of each data point within the graph.

Note: there is also a stemplot, which involves plotting a matrix of y values along an x axis. I don't know much about stemplots at this time.

R has the built in stem command to create stem and leaf. It only requires a vector of values. It has a couple of options, the most useful one being scale, which controls the plot length. The aplpack package also provides stem.leaf, which has a pile of options that can help you trim outliers, create back-to-back stemplots, and more. Worth a look if you are going to be using stem and leaf a lot.

I remember going through many steps to create stem and leafs in my first statistics class. R makes it trivial.

R Example

I'm basing this on the topic in Kern's IPSUR book. It uses the rivers dataset.

> stem(rivers)

The decimal point is 2 digit(s) to the right of the |

0 | 4
2 | 011223334555566667778888899900001111223333344455555666688888999
4 | 111222333445566779001233344567
6 | 000112233578012234468
8 | 045790018
10 | 04507
12 | 1471
14 | 56
16 | 7
18 | 9
20 |
22 | 25
24 | 3
26 |
28 |
30 |
32 |
34 |
36 | 1

This plot is right skewed.

Read a data point by taking one value from the left side, and one from the right side of the vertical line. There is one value below 20, "04". The number 22 happened twice, the number 222 once, and the number 361 once. And so on.