Monday, September 14, 2020

stem

A stem and leaf plot is similar to a histogram. It displays numeric values in an ordered distribution, so you can gauge its shape. It lies rotated 90° from a typical histogram. A vertical line separates the tens (or hundreds) place from the units. The tens (or hundreds) ascend from top to bottom, and the units ascend from left to right.

Unlike a histogram, a stem and leaf plot preserves the value of each data point within the graph.

Note: there is also a stemplot, which involves plotting a matrix of y values along an x axis. I don't know much about stemplots at this time.

R has the built in stem command to create stem and leaf. It only requires a vector of values. It has a couple of options, the most useful one being scale, which controls the plot length. The aplpack package also provides stem.leaf, which has a pile of options that can help you trim outliers, create back-to-back stemplots, and more. Worth a look if you are going to be using stem and leaf a lot.

I remember going through many steps to create stem and leafs in my first statistics class. R makes it trivial.

R Example

I'm basing this on the topic in Kern's IPSUR book. It uses the rivers dataset.

> stem(rivers)

The decimal point is 2 digit(s) to the right of the |

 0 | 4
 2 | 011223334555566667778888899900001111223333344455555666688888999
 4 | 111222333445566779001233344567
 6 | 000112233578012234468
 8 | 045790018
10 | 04507
12 | 1471
14 | 56
16 | 7
18 | 9
20 |
22 | 25
24 | 3
26 |
28 |
30 |
32 |
34 |
36 | 1




This plot is right skewed.

Read a data point by taking one value from the left side, and one from the right side of the vertical line. There is one value below 20, "04". The number 22 happened twice, the number 222 once, and the number 361 once. And so on.

Friday, September 11, 2020

pareto

Pronounced pa RAY toe. A pareto diagram is similar to a bar chart, in that it depicts categorical data. Unlike a bar chart, it orders the bars in decreasing size. This scheme not only makes it easy for you to see which categories are most important, it also lets you compare their relative sizes quickly.

Often pareto diagrams include an ogive line, which tracks the cumulative number of values across the bars. With the addition of an ogive, you can quickly estimate when a given percentage of the total data has been allocated to the major categories.

You need to include the qcc library to get pareto chart functionalty.

R Example

This example is similar to the barplot example. It shows the cumulative number of states within several regions. Along with the chart, the table of values is emitted by the R studio.

pareto.chart(table(state.division), ylab = "Frequency")

Pareto chart analysis for table(state.division)

               Frequency Cum.Freq. Percentage Cum.Percent.
South Atlantic     8         8        16          16
Mountain           8        16        16          32
West North Central 7        23        14          46
New England        6        29        12          58
East North Central 5        34        10          68
Pacific            5        39        10          78
East South Central 4        43         8          86
West South Central 4        47         8          94
Middle Atlantic    3        50         6         100

cumulative number of states within several regions
Cumulative number of states within several regions

Tuesday, September 8, 2020

dotchart

The dotchart command provides a dot plot, a simple representation of frequency. It is part of the graphics package.

As with dot plot, advantages include a graph you can quickly make by hand. Outliers are easy to identify, and you can gage clusters, gaps, and histogram-like shapes. Most notably, dot plots preserve the numeric value of each data point (something lost with a bar chart, for example). See Dot plot (statistics) in Wikipedia.

It's use is limited to data sets of relatively few data points. Too many points and the graph becomes unreadable.

dotchart has 20 parameters, but the only one you must have is x -- a vector or matrix (table).

dotcart is a Variation on Bar Chart

After poking around with this a bit, I discover that dotchart is specialized dot plot function. The R documentation lists this as Cleveland's Dot Plot. In Wikipedia, I learn that a Cleveland Dot Plot is "…an alternative to the bar chart, in which dots are used to depict the quantitative values (e.g. counts) associated with categorical variables." These are supposed to be easier to read and interpret that bar charts.

R Example

This first example plots the numeric mass of a number of common animals.

> library(MASS) 
> dotchart(airmiles)

passenger miles on US airlines between 1937 and 1960
Passenger miles on US airlines between 1937 and 1960

This example shows just one dot per category. The airmiles dataset lists passenger miles on US airlines between 1937 and 1960. In the structure, though, the date years are represented as Time-Series [1:24], so there's no hard-wired column of years.

I was initially confused with dotchart, because the examples didn't show me how to label the dots. (The plot function ekes the years out automatically.) I found that to label the points in dotchart requires creation of another set of points be paired with the airmiles data). There are also dotchart2 and dotchart3 which may handle time series years more elegantly. So I'm going with just another example:


dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
        main="Gas Milage for Car Models",
        xlab="Miles Per Gallon")



MPG for Common Cars
MPG for Common Cars
I think this version is easier to read than a barchart, for sure. The calibrate:textxy function allows you to do this.

Friday, September 4, 2020

barplot

Bar graphs or bar charts are created using barplot, and are frequency comparisons for categorical data (rather than numeric occurrences of a single variable as with histograms). Heights of bars are proportional to the frequency of occurrence of each category.

Use a bar graph when you want to compare the incidence of different but similar things. Survival of different mammals on a flooded island, on the number of paper subscriptions in several city neighborhoods.

Bar graphs can be single bars per category. They can also be "doubled-up" -- two or more bars in the same category turning on a split in the data. An example might be paper subscriptions in city neighborhoods, split by gender. Bar graphs may also be "stacked," with each bar divided into a number of differently colored segments representing random characteristics (that is, characteristics which may not appear in every bar in the chart). For example, paper subscriptions per neighborhood, with segments for morning, evening, and Sunday only subscriptions. In some neighborhoods there may be no Sunday only subscriptions, while in another neighborhood, everybody gets the morning edition.

R Example

This example plots the number of states which exist in the four main regions of the U.S.

barplot(table(state.region))

Number of States in Four U.S. Regions
Number of States in Four U.S. Regions