Saturday, November 21, 2020

coplot

Conditioning plot, AKA shingle plot. Why shingle? It has to do with overlapping results. In this case, 'conditioning' is, apparently, 'under several conditions.'  Think of a mosaic, where you have a series of related plots all together in one plot display. You print them separately because, I guess, if you put them on one graph it would be a mescla of data.

I don't know much about this type of plot just yet. Tableau builds these pretty well -- you are taking an xyplot and recalculating it on a different scale. It has a ton of options, but you only need the dataset and the three variables (the separate plots are technically extending an x/y relationship into a third dimension).

R Example

It's probably easier to understand what this thing is for if you look at what it does. This plot depicts the depth of earthquakes near Fiji across an array of latitude and longitude pairs.

> coplot(lat ~ long | depth, data = quakes)



Cute, right? What does it mean? Well, if I read this thing right, Fijiin quakes near 185, -10 generally occur deeper than do quakes near 165, -10. Or, probably a better way to look at it is given a depth of, say 300 feet, where are we likely to see a quake? 170, -10? 

I have no idea. I'll have to research this for better examples.

In the 3473 Video, Kern explains it as: using the top 'header' (my term) chart, there are six zones outlined. Each zone is depicted in one of the shingle plots below. But that still doesn't answer it fully for me. Stay tuned. 

Thursday, November 19, 2020

xyplot

An X-Y plot is also known as a scatter plot. You have your basic splash of points that you hope will exhibit some sort of pattern, which you can then use to predict something. You may need to create a fit line as a guide to a probably value of Y given X (and vice versa). Either way, the one you pick is the independent variable, and the one you look up based on the one you picked is the dependent variable (you nearly always have X as the independent variable).

Of course since I noted that it's usually between two variables, I read down in the Wikipedia article where it talks about multivariate scatter plots. Go figure.

You can use scatter plots whenever you have data. Remember though, that extrapolating outside of the cloud of data is fraught with danger. You cannot predict with any accuracy the value of a dependent variable based on an independent variable value which is either less than the smallest, or greater than the greatest value in the plot. Period.

R Example

Use the lattice:xyplot function to create scatter plots. The R docs have xyplot in the same topic as dotplot, barchart, and more. Plenty of options, but you really only need the name of a qualifying dataset.

This example creates a simple graph that plots height (as X) against weight (as Y) using the women dataset. 

> library(lattice)
> xyplot(weight ~ height, data = women)

There is no men dataset. I guess that would be pointless.

Thursday, November 12, 2020

spineplot

They say that a spine plot is a special case of the mosaic, and a generalization of a stacked bar graph. Each vertical bar (I'll call it a 1st order category) is segmented along its length according to the relative proportions of 2nd order categories. In addition, each bar has a width which represents the proportion of its category among the several 1st order categories being examined. These are all similar to the mosaic. Unlike the mosaics I've seen, spine plots also feature a numeric scale on the y axis.

Even though there are some scholarly papers out there that show how to use a spine plot for more than just two variables, I'm going to go out here on a limb and say use a spine for just two. 

When to use a spine plot instead of a bar graph? I think it depends on how you want to make your case. A bar graph is usually used to compare relative quantities or proportions among similar or related categories.  I often have a dozen categories along the x axis, with each bar segmented along the y axis with random segments. That is, each bar doesn't need to have the same number of segments. Spine plots will involve few 'bars,' and each will have the same number of divisions. (At least, this is what I'm going with now until I find out different later.)

R Example

This example uses the UCBAdmission dataset to show the distribution of the genders as applicants for admission to UC Berkeley.

spineplot(xtabs(Freq ~ Admit + Gender, data = UCBAdmissions))


[depiction o f UCBAdmission dataset]

The plot shows:
  • more people were rejected than accepted -- the Rejected bar is wider than the Admitted. We don't know how much, but it eyeballs to about 2:3, or 66% rejected.
  • a higher percentage of male applicants were rejected than were female applicants (about 55% to 45%).
  • of those accepted, a higher percentage were males than female.
Okay, that's nice. The chart makes it seem that men are overly represented. But notice that we have no information on individual totals. 

If we take the same data, and plot it with barplot, we can influence people in a completely different way:

barplot(xtabs(Freq ~ Admit + Gender, data = UCBAdmissions))


Okay, that's quite different. Though many more males are admitted, many more males applied. Further a much greater proportion of the females who applied were admitted. (Please don't try to read into this data why more males applied to Berkeley.)

If you wanted to show how skewed the admittance membership is, you would use a spine plot. If you wanted to show how skewed the admittance preferences are, you would use a stacked bar chart.

Welcome to descriptive statistics.