I'm Learning R

Thursday, October 29, 2020

mosaicplot

Mosaic plots let you view the relationship between two or more categorical variables. That's the official, universal definition. I have a lot of confusion between mosaicplot and spineplot. The docs stress that while mosaic plots let you see more than two variables, spine plots are limited to just two.

If you research mosaic plots on the web, you will invariably find the Titanic survival analysis at the top of the heap. Mosaic plots are all about proportional boxes representing related categories, so when you read a mosaic plot, you find yourself drilling down into it, reading it X-Y-X-Y to discover what you want. You'll see what I mean in the R example below.

R Example

Let's just get on with it. This example creates the default mosaic plot for the HairEyeColor dataset.

> mosaicplot(HairEyeColor)

It's all about Hair color (x axis) and Eye color (y axis). Eye color is subcategorized into four colors. Hair is also subcategorized into four colors, and also further subdivided by gender. The width of a segment (in any direction) indicates its relative proportion of the total (in the same direction).

As you look at the plot, you can see that people with blond hair are more likely to have blue eyes, and that a difference in gender does not effect a tendency towards eye color when the hair color is the same. So in this example, there are three variables being compared: eye color, hair color, and gender.

Two challenges I see with a mosaic plot is that your audience need some experience interpreting one before you present it. You'll be spending a lot of time explaining it otherwise. The other issue is that two people can come to different and equally valid conclusions, depending on the mosaic distribution (this hair/eye example doesn't really show that -- you just need to remember to examine all angles of the graph for contradictions if you plan on using it to support something it seems to show).

Friday, October 16, 2020

bwplot

Available with the lattice package, a bivariate trellis plot (bwplot, or box and whisker plot) displays multiple box plots based on a category in the dataset. Lattice describes bwplot on the same page as several other plots, so you have to go down the page to find info on it. It has over 20 options, but you only need the dataset name, the parameter to count, and the grouping parameter.

Plot comparisons are low in detail, but allow you to compare many items at a glance. In the case of bwplot, you are comparing the IQR of a dataset grouped by some category.

R Example

This example depicts the distribution of chicken weights, grouped by the type of feed. A comparison lets you judge with feed might be optimal to provide your hens.

> library(lattice)
> bwplot(~weight | feed, data = chickwts)

Looking at this, we might want to use sunflower, because it has the most predictable value. Or we might choose casein, because you may want to produce chickens of greater variance in size and weight.

Thursday, October 8, 2020

boxplot

Boxplots are a fast way to judge spread, shape, and center. Also called box and whisker plots, they depict the five number summary in a structured shape.

Use a boxplot when you want a visual summary of the data set, and you don't have (or don't want to use) many data points. I think the main negative thing about boxplots is that you may need to educate people what they are seeing, as opposed to a distribution or bar chart, which are ubiquitous in public culture.

The boxplot function has a pile of options, but you only need the data and count options to show something.

> boxplot(count ~ spray, data = InsectSprays, horizontal = TRUE)

Values are represented by the lines around each box.

• min - leftmost whisker. Insecticides A and B have min values about 6.

• IQR - the box shape. Insecticide F has the widest IQR, and insecticide D has the narrowest.

• median - the heavy line within the box

• max - the rightmost whisker. Insecticide D has a max equal to the upper value of its IQR

• outliers - circles outside a box (C and D in this plot both have an outlier)

Monday, September 14, 2020

stem

A stem and leaf plot is similar to a histogram. It displays numeric values in an ordered distribution, so you can gauge its shape. It lies rotated 90° from a typical histogram. A vertical line separates the tens (or hundreds) place from the units. The tens (or hundreds) ascend from top to bottom, and the units ascend from left to right.

Unlike a histogram, a stem and leaf plot preserves the value of each data point within the graph.

Note: there is also a stemplot, which involves plotting a matrix of y values along an x axis. I don't know much about stemplots at this time.

R has the built in stem command to create stem and leaf. It only requires a vector of values. It has a couple of options, the most useful one being scale, which controls the plot length. The aplpack package also provides stem.leaf, which has a pile of options that can help you trim outliers, create back-to-back stemplots, and more. Worth a look if you are going to be using stem and leaf a lot.

I remember going through many steps to create stem and leafs in my first statistics class. R makes it trivial.

R Example

I'm basing this on the topic in Kern's IPSUR book. It uses the rivers dataset.

> stem(rivers)

The decimal point is 2 digit(s) to the right of the |

0 | 4
2 | 011223334555566667778888899900001111223333344455555666688888999
4 | 111222333445566779001233344567
6 | 000112233578012234468
8 | 045790018
10 | 04507
12 | 1471
14 | 56
16 | 7
18 | 9
20 |
22 | 25
24 | 3
26 |
28 |
30 |
32 |
34 |
36 | 1

This plot is right skewed.

Read a data point by taking one value from the left side, and one from the right side of the vertical line. There is one value below 20, "04". The number 22 happened twice, the number 222 once, and the number 361 once. And so on.

Friday, September 11, 2020

pareto

Pronounced pa RAY toe. A pareto diagram is similar to a bar chart, in that it depicts categorical data. Unlike a bar chart, it orders the bars in decreasing size. This scheme not only makes it easy for you to see which categories are most important, it also lets you compare their relative sizes quickly.

Often pareto diagrams include an ogive line, which tracks the cumulative number of values across the bars. With the addition of an ogive, you can quickly estimate when a given percentage of the total data has been allocated to the major categories.

You need to include the qcc library to get pareto chart functionalty.

R Example

This example is similar to the barplot example. It shows the cumulative number of states within several regions. Along with the chart, the table of values is emitted by the R studio.

pareto.chart(table(state.division), ylab = "Frequency")

Pareto chart analysis for table(state.division)

Frequency Cum.Freq. Percentage Cum.Percent.
South Atlantic 8 8 16 16
Mountain 8 16 16 32
West North Central 7 23 14 46
New England 6 29 12 58
East North Central 5 34 10 68
Pacific 5 39 10 78
East South Central 4 43 8 86
West South Central 4 47 8 94
Middle Atlantic 3 50 6 100

Cumulative number of states within several regions

Tuesday, September 8, 2020

dotchart

The dotchart command provides a dot plot, a simple representation of frequency. It is part of the graphics package.

As with dot plot, advantages include a graph you can quickly make by hand. Outliers are easy to identify, and you can gage clusters, gaps, and histogram-like shapes. Most notably, dot plots preserve the numeric value of each data point (something lost with a bar chart, for example). See Dot plot (statistics) in Wikipedia.

It's use is limited to data sets of relatively few data points. Too many points and the graph becomes unreadable.

dotchart has 20 parameters, but the only one you must have is x -- a vector or matrix (table).

dotcart is a Variation on Bar Chart

After poking around with this a bit, I discover that dotchart is specialized dot plot function. The R documentation lists this as Cleveland's Dot Plot. In Wikipedia, I learn that a Cleveland Dot Plot is "…an alternative to the bar chart, in which dots are used to depict the quantitative values (e.g. counts) associated with categorical variables." These are supposed to be easier to read and interpret that bar charts.

R Example

This first example plots the numeric mass of a number of common animals.

> library(MASS)

> dotchart(airmiles)

Passenger miles on US airlines between 1937 and 1960

This example shows just one dot per category. The airmiles dataset lists passenger miles on US airlines between 1937 and 1960. In the structure, though, the date years are represented as Time-Series [1:24], so there's no hard-wired column of years.

I was initially confused with dotchart, because the examples didn't show me how to label the dots. (The plot function ekes the years out automatically.) I found that to label the points in dotchart requires creation of another set of points be paired with the airmiles data). There are also dotchart2 and dotchart3 which may handle time series years more elegantly. So I'm going with just another example:

dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
main="Gas Milage for Car Models",
xlab="Miles Per Gallon")

MPG for Common Cars

I think this version is easier to read than a barchart, for sure. The calibrate:textxy function allows you to do this.

Friday, September 4, 2020

barplot

Bar graphs or bar charts are created using barplot, and are frequency comparisons for categorical data (rather than numeric occurrences of a single variable as with histograms). Heights of bars are proportional to the frequency of occurrence of each category.

Use a bar graph when you want to compare the incidence of different but similar things. Survival of different mammals on a flooded island, on the number of paper subscriptions in several city neighborhoods.

Bar graphs can be single bars per category. They can also be "doubled-up" -- two or more bars in the same category turning on a split in the data. An example might be paper subscriptions in city neighborhoods, split by gender. Bar graphs may also be "stacked," with each bar divided into a number of differently colored segments representing random characteristics (that is, characteristics which may not appear in every bar in the chart). For example, paper subscriptions per neighborhood, with segments for morning, evening, and Sunday only subscriptions. In some neighborhoods there may be no Sunday only subscriptions, while in another neighborhood, everybody gets the morning edition.

R Example

This example plots the number of states which exist in the four main regions of the U.S.

barplot(table(state.region))

Number of States in Four U.S. Regions