I'm Learning R: April 2020

Saturday, April 25, 2020

mad

A measure of spread. Median Absolute Deviation. So the mad function is supposed to be a superior way to calculate a more accurate IQR. It's superior because it's more robust, which means it is more resistant to outliers.

I'm not sure yet what the advantages and use cases are of selecting IQR over mad and vice versa.

R Example

> mad(rivers)

[1] 214.977

If you plot the rivers dataset you will see 6 or 7 outliers, which apparently account for a 155 difference (370 IQR vs 215 mad). Experience will show you when mad needs to be the choice.

Wednesday, April 22, 2020

IQR

A measure of spread. Inter-quartile range, or midspread. This covers the middle 50% of a distribution - the area between the 25^th and 75^th percentiles, which it calculates using the quantile function. You can opt to remove NA values.

R Example

Get the IQR of the rivers dataset (part of the datasets package).

> IQR(rivers)

[1] 370

In the rivers dataset, the 75% value minus the 25% value is 370. You prove this using fivenum:

> fivenum(rivers)

[1]  135  310  425  680 3710

680 - 310 == 370.

Monday, April 20, 2020

quantile

A measure of spread. Use quantile to get the percentile of an observation variable.

You will always need the x (the data.frame) and probs (list of percentiles you want) arguments. You will use this frequently when you want to determine an observation's position in the ordered distribution.

R Example

This example, which I found on r-tutor, extracts the 32^nd, 57^th, and 98^th percentiles for eruption duration in the faithful dataset.

> duration = faithful$eruptions     # the eruption durations 
> quantile(duration, c(.32, .57, .98)) 

   32%    57%    98% 

2.3952 4.1330 4.9330

Friday, April 17, 2020

sd (Standard Deviation)

Return the standard deviation of a vector. A measure of shape. It's the expectation of the deviation of a random variable from its mean.

You can opt to remove NA values that exist in the data. According to the R documentation, this uses n-1 in the denominator.

R Example

> sd(1:2)

[1] 0.7071068

> sd(1:2) ^ 2

[1] 0.5

Wednesday, April 15, 2020

range

A measure of spread. Given one or more sets of values, return the minimum and maximum. This function has two options which let you omit NA and non-finite numbers.

R Example

lst <- c(8:25)

> range(lst)

[1]  8 25

Monday, April 13, 2020

mode

A measure of center. In a data set, the number that occurs most frequently.

R has no native method for determining the mode. You need to either create a function (probably using which.max and table), or install the modeest package.

R Example

I'm not going to include it here on first writing. If I ever need to calculate it, I'll add the code I use at that time. Workarounds: create a histogram

Update: during my study I found a workaround using two functions, which.max and maxFreq:

Briefly:

> which.max(table(variable))
"Most Frequent Value"

> maxFreq(variable)
"How often"

Friday, April 10, 2020

median

A measure of center. The middle value of a set of ordered numbers. 50% of the set lies on either side of the median. Like mean, the median does not have to be a member of the set, so if the set has an even number of elements, the median is halfway between the two middle members.

R Example

> lst <- c(1:34, 35, -22)

> lst

 [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19

[20]  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 -22

> median(lst)

[1] 17.5

Wednesday, April 8, 2020

mean

A measure of center. Arithmetic mean, or average. The sum of all values in a list divided by the total number of values present.

In R, mean has a couple of options. You can elect to exclude NA values, which can occur in lists you get from any source. You can also 'trim' a portion of the values off to resolve outliers (outliers become the nearest limit value).

R Example

> lst <- c(1,3,5)

> mean(lst)

[1] 3

Saturday, April 4, 2020

R You Ready

An Elemental Journey

There's nothing special here, folks. I've embarked on a short journey to understand the use of R a bit better. I've got a bit of stats education and practical experience, so I'm not starting out completely blind.

Follow along if you like. Please know that this series will eventually end. Because I'm interested in the practical use of R in everyday analytics (yes, I think it's going to be a reality for all of us), I want it to be a somewhat elementary treatment of statistics, providing the R recipes to accomplish basic analysis.