Introduction

R is an extremely powerful statistical scripting language. It is open-source and quickly gaining traction across academia, research organizations, and businesses. It is often the tool of choice for statisticians, data scientists, quantitative financial analysts, and a myriad of other professions. It is used for research at the vast majority of graduate schools. It is currently used by companies like Facebook, Google, the NY Times and Wallstreet financial organizations.

R is open-source and is freely available to download. You can put base R on any government computer. You can use base R as-is to write and run R script. That being said, RStudio has provided a very useful “front-end” for R that is generally easier to use (R is still the “engine”; you can’t run RStudio without R). We will primarily use RStudio in SE350. Some DoD organizations, however, will not allow you to install RStudio. Remember though, you can still run everything just like we teach you from base R.

In R you run programs and calculations from a command line. While you will sometimes write directly in the command line (to explore data or run simple calculator computation), it is usually best to write your code in a *.R file. This file serves as “scratch-paper” for you. It is much easier to edit and adjust your code when it is contained on the “scratch-paper”. To open a *.R file, select File -> New File -> R Script. Now you have blank “scratch-paper”


Installation

  1. Install Base R by going to http://cran.r-project.org/bin/windows/base/
  2. Install RStudio by going to http://www.rstudio.com/products/rstudio/download/

R Environment and Workspace

R is always pointing to a certain folder on your computer. This is called your working directory. R will always directly read files and write files to this directory. You can see your working directory by typing

getwd()
## [1] "C:/Users/david.beskow/Google Drive/DataAnalysisLessons"

If you want to change where your working directory is, you can do this two ways. If you are using RStudio, you can go to Session -> Set Working Directory. If you want to change your working directory using a command (especially if you’re using base R), then you can type

setwd("C:/Users/david.beskow/Google Drive/DataAnalysisLessons")  ###Make sure you use Forward Slashes

If you want to see the names of files in your working directory without opening Windows Explorer, you can use the command

dir()
##  [1] "bacon.csv"                 "baseball.csv"             
##  [3] "births.csv"                "DataAnalysisLessons.Rproj"
##  [5] "figure"                    "Lesson1.html"             
##  [7] "Lesson1.Rmd"               "Lesson1Slides-figure"     
##  [9] "Lesson1Slides.md"          "Lesson1Slides.Rpres"      
## [11] "Lesson2.html"              "Lesson2.Rmd"              
## [13] "Lesson2_cache"             "Lesson2_files"            
## [15] "mortality.csv"             "R Tutorials"

Note that this gives the names of the files in your working directory, which saves you the time of opening up Windows Explorer to remind yourself what you names your data file.

Types and Shape of Data

Before we get into data, I first want to show you that your command line can operate like a calculator

5+4+7*7
## [1] 58

or

pi*7.2^2
## [1] 162.9

Note that in both of these examples, the answer is printed to the screen, but not stored in memory. In other words, I cannot access that answer without redoing the calculation. If I want to store it in memory, then I assign the answer to a given computation to a name. We use the symbol <- to mean “assign”. In other words, the result of the computation on the right of the symbol is assigned to the name on the left of the symbol. For example:

x<-4*4

I have now assigned the result of my computation to the name x. If I want to see this value of x in the future, I can just type it in the console.

x
## [1] 16

And I can also use it in future computations:

y<-x/2

x is now stored in your Global Environment. Think of this as your “workbench” that contains all of the data and values that you are working on. In RStudio, you can usually see what is in your Global Environment in the top right part of the RStudio window. If you’re using base R, you can list the variables that are in your Global Environment by typing

ls()
## [1] "metadata" "x"        "y"

When you close either RStudio or base R, it will ask you if you want to save your work space. It is essentially asking you if you want to save what is on your workbench. If you choose “yes”, then it will save an *.RData file of everything that is in your workspace in your working directory. If you restart R from this working directory, it will load all of these items into your workspace. Generally it is not a good idea to save your workspace as long as you have all of the code it would take to quickly recreate all of the items in your workspace. However, if you have some code that takes along time to run, then it is best to save these items in a workspace so that you don’t have to wait hours/days a second time to recreate them. For example, I created some R code to “clean” Afghanistan Blue Force Tracker data. It took approximately 11 days to clean the data. In this case, I would want to save my results so I don’t have to wait 11 days again for this to run. In general, however, R takes seconds to run, and it is best to not save your workspace as long as you have clean and easy to run code.

Input/Output Data

Now that we have all of that done, let’s learn how to read and write data. To do this, we will read in the birth data that was provided to you. This is a 10% random sample of births in the United States in 2006.

A key for the variables is given below:

Name Description
DOB_MM Month of date of birth
DOB_WK Day of week of birth
MAGER Mother’s age
TBO_REC Total birth order
WTGAIN Weight gain by mother
SEX a factor with levels F M, representing the sex of the child
APGAR5 APGAR score
DMEDUC Mother’s education level
UPREVIS Number of prenatal visits
ESTGEST Estimated weeks of gestation
DMETH_REC Delivery Method
DPLURAL “Plural Births;” levels include 1 Single, 2 Twin, 3 Triplet, 4 Quadruplet, and 5 Quintuplet or higher
DBWT Birth weight, in grams

We use the command read.csv to read in data. We also make sure to assign this to a name

births<-read.csv("births.csv",as.is=TRUE)

Now that we’ve read the file in, let’s check on its size.

dim(births)
## [1] 427323     14

This command gives us the number of rows and columns in the data set. We see that there are 427,323 records of 14 variables. I usually also use the command head to let me see the first 5 rows. This gives titles of the variables (columns) as well as a feel for the data:

head(births)
##         X DOB_MM DOB_WK MAGER TBO_REC WTGAIN SEX APGAR5
## 1  591430      9      1    25       2     NA   F     NA
## 2 1827276      2      6    28       2     26   M      9
## 3 1705673      2      2    18       2     25   F      9
## 4 3368269     10      5    21       2      6   M      9
## 5 2990253      7      7    25       1     36   M     10
## 6  966967      3      3    28       3     35   M      8
##                   DMEDUC UPREVIS ESTGEST DMETH_REC  DPLURAL DBWT
## 1                   NULL      10      99   Vaginal 1 Single 3800
## 2     2 years of college      10      37   Vaginal 1 Single 3625
## 3                   NULL      14      38   Vaginal 1 Single 3650
## 4                   NULL      22      38   Vaginal 1 Single 3045
## 5 2 years of high school      15      40   Vaginal 1 Single 3827
## 6                   NULL      18      39   Vaginal 1 Single 3090

Finally, if we want to get a comprehensive summary of all of the variables, we can use the summary command

summary(births)
##        X               DOB_MM         DOB_WK         MAGER     
##  Min.   :     15   Min.   : 1.0   Min.   :1.00   Min.   :12.0  
##  1st Qu.:1065612   1st Qu.: 4.0   1st Qu.:3.00   1st Qu.:23.0  
##  Median :2136041   Median : 7.0   Median :4.00   Median :27.0  
##  Mean   :2134894   Mean   : 6.6   Mean   :4.07   Mean   :27.4  
##  3rd Qu.:3203484   3rd Qu.:10.0   3rd Qu.:6.00   3rd Qu.:32.0  
##  Max.   :4273212   Max.   :12.0   Max.   :7.00   Max.   :50.0  
##                                                                
##     TBO_REC         WTGAIN          SEX                APGAR5     
##  Min.   :1.0    Min.   : 0      Length:427323      Min.   : 0     
##  1st Qu.:1.0    1st Qu.:21      Class :character   1st Qu.: 9     
##  Median :2.0    Median :30      Mode  :character   Median : 9     
##  Mean   :2.4    Mean   :31                         Mean   : 9     
##  3rd Qu.:3.0    3rd Qu.:40                         3rd Qu.: 9     
##  Max.   :8.0    Max.   :98                         Max.   :10     
##  NA's   :3134   NA's   :75856                      NA's   :58231  
##     DMEDUC             UPREVIS      ESTGEST      DMETH_REC        
##  Length:427323      Min.   : 0   Min.   :12.0   Length:427323     
##  Class :character   1st Qu.: 9   1st Qu.:38.0   Class :character  
##  Mode  :character   Median :12   Median :39.0   Mode  :character  
##                     Mean   :14   Mean   :46.6                     
##                     3rd Qu.:14   3rd Qu.:40.0                     
##                     Max.   :99   Max.   :99.0                     
##                                                                   
##    DPLURAL               DBWT     
##  Length:427323      Min.   : 227  
##  Class :character   1st Qu.:2972  
##  Mode  :character   Median :3310  
##                     Mean   :3265  
##                     3rd Qu.:3629  
##                     Max.   :8165  
##                     NA's   :434

This give is the min, max, median, mean as well as the 1st and 3rd quantile (remember that the median is the 2nd quantile). For example, looking at the MAGER variable we see that the youngest woman to give birth was 12 years old, the oldest woman to give birth was 50 years old, with the mean age of 27.37.

We will play with the birth data later. Now that we understand how to get data into R, you may need to push data out of R. We usually do this by writing it to a comma separated value (csv) file. These are raw data files that Microsoft Excel can read. You can write to a csv by using the command write.csv. To learn more about this command, see help(write.csv). Note that you can use this on any command to learn more about it or to remember what the arguments are.

Data can have different classes of data. The basic building blocks are integer, numeric, character, date, boolean (logical) or factor classes. The first three should be self explanatory, and examples of all three are below:

x<-4                   #integer
x<-4.56                #numeric
x<-TRUE                #boolean
x<-"Start the Corps!"  #character

Use the class command to find out what type of data you have. Note that because we were using x for all three, that we were writing over the value of x. At the end of running these three lines of code, x would equal the last line of code: the character string “Start the Corps!”

class(x)
## [1] "character"

R does not automatically recognize date data. When you read date data into R, it is initially converted to character data. If you want R to recognize it as a date, you need to explicity change it

x<-"2014-01-01"
x<-as.Date(x)
class(x)
## [1] "Date"

There is also a type of data called factor. This is character data that has a numeric value tied to it for certain types of models. Character data is often coerced to the factor class when you have nominal data. You can think of a list of data that has either “male” or “female”. If I change this into a factor, it will still be represented as “male” and “female”, but it will also be represented numerically. You need to be very careful when using factors, since many of the functions in R can’t handle factor data. You can see the use of factor data below:

y<-c("male","male","female","male","female")

This is character data. If I tried to plot y right now, R would show an error, since you can’t print character data. Lets convert this to a factor now:

y<-as.factor(y)
y
## [1] male   male   female male   female
## Levels: female male

Now watch when I try to plot this:

plot(y)

plot of chunk unnamed-chunk-19

It plots a barchart because R recognizes this as a factor and has a numeric value associated with both of the “levels” in the factor

There are also different dimmensions of data. So far we’ve been using scalars, in which our variable x is a single value. Data can have 1, 2, or many dimensions, however. A one dimensional list of data is known as a vector. An example of a vector is given below:

x<-c(1,6,3,9,8,2)

If you need to create a vector of sequential integers, you can use a colon:

x<-c(1:10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10

If you need to create a vector of the same number, you can use the repeat command:

rep(1,10)  # Repeat 1 ten times
##  [1] 1 1 1 1 1 1 1 1 1 1

I can refer to certain elements of a vector using subscripts or a boolean vector. Using subscripts it looks like

a <- c(1,2,5.3,6,-2,4) # numeric vector
a[c(2,4)] # 2nd and 4th elements of vector
## [1] 2 6

To use a boolean vector, it would look like this:

a <- c(1,2,5.3,6,-2,4) # numeric vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
a[c]
## [1]  1.0  2.0  5.3 -2.0

Notice that this will produce a vector of only those elements of “a” that had TRUE in their respective location in the “c” vector.

We can also work with two dimensional data. This is the most common way that data comes. You can think of an excel sheet that somes with rows and columns. There are several ways of storing two dimensional data in R; we will primarily focus on data frames. Data frames are the most common method of pulling data into R. In a data frame, each column must be the same type of data (all numeric, all character, etc.), but the columns don’t have to have the same type of data as other columns. I can therefore have a column with date data, a column with numeric data, and a column with character data. All columns in a dataframe must have the same length.

d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(ID=d,Color=e,Passed=f)

Use the name command to access and/or change the names of each column in a data.frame

names(mydata)[3]<-"Passed1"      #Changes the name of the third column
names(mydata)       #Lists the names of the data frame
## [1] "ID"      "Color"   "Passed1"

In order to access and use a single column just like we would a vector, we use the “$”. For example, to access the Color column, we use

mydata$Color
## [1] red   white red   <NA> 
## Levels: red white

Note that when you bring data into R with write.csv, it automatically comes in as a data frame. We can check this by looking at the class for the birth data.

class(births)
## [1] "data.frame"

Subset Data

We will now return to the birth data and learn how to subset data. Remember that the birth data had 14 variables. Let’s pretend that we really don’t need all of the variables, but are really only concerned with the Month of Birth, Day of Week, Age of mother, Sex, Apgar Score and Estimated Gestation. We can use one of the following two commands to subset our data:

births.sub1<-births[,c(2,3,4,7,8,11)]

Note in this example that the comma separates numbers relating to rows and numbers relating to columns. Since the numeric vector is on the right side of the comma, it is referring to columns. As we will see in a little bit, it it was on the left side of the comma, it would refer to the row numbers.

Another way to do this is

myvars <- names(births) %in% c("DOB_MM","DOB_WK","MAGER","SEX","APGAR5","ESTGEST")
births.sub2 <- births[,myvars]

Note that if we produce the top 5 rows of this data set, it is paired down:

head(births.sub1)
##   DOB_MM DOB_WK MAGER SEX APGAR5 ESTGEST
## 1      9      1    25   F     NA      99
## 2      2      6    28   M      9      37
## 3      2      2    18   F      9      38
## 4     10      5    21   M      9      38
## 5      7      7    25   M     10      40
## 6      3      3    28   M      8      39

Now that we’ve learned to subset the column data, let’s subset row data. The first way is to use numbers, just like we did with columns. Remember that the numeric vector on the left side of the comma refers to row numbers. So in the following, we have selected the first 100 rows of data:

births.sub3<-births[1:100,]

Now lets use births.sub1 and select only those children that are female. To do this we use the following code:

births.F<-subset(births.sub1,SEX=="F")

You can look up into your Global Environment and see that there are 208,653 observations in this subset (meaning there were 208,653 females in this random sample)

Note that you can use the AND/OR functions (&/|). Let’s say we want all male children born to women who are 18 years or younger. We can do this with this command:

births.M18<-subset(births.sub1,SEX=="F" & MAGER<=18)

Once again looking at our Global Environment will show us that there were 13,118 records in this subset.

##Table Data

Another common way to store information is in a table. Here we look at how to define both one way and two way tables.

The first example is for a one way table. One way tables are not the most interesting example, but it is a good place to start. One way to create a table is using the table command. The arguments it takes is a vector of factors, and it calculates the frequency that each factor occurs. Here is an example of how to create a one way table:

a <- factor(c("A","A","B","A","B","B","C","A","C"))
table(a)
## a
## A B C 
## 4 3 2

A table is usually the input command for a barplot

barplot(table(a))

plot of chunk unnamed-chunk-37

or a pieplot

pie(table(a))

plot of chunk unnamed-chunk-38

We will talk more about plots and graphs next lesson.

If you want to add rows to your table just add another vector to the argument of the table command. In the table below we see the incidence of two questions in a survey:

a <- c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes","Never")
b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
table(a,b)
##            b
## a           Maybe No Yes
##   Always        2  0   0
##   Never         0  1   1
##   Sometimes     2  1   1

The table command allows us to do a very quick calculation, and we can immediately see that two people who said Maybe to the first question also said sometimes€ to the second question.

A two way table is also the basic input of a stacked barplot:

barplot(table(a,b))

plot of chunk unnamed-chunk-40

Note that the table command is very useful in exploring data. Let’s use it to explore the birth data. If we wanted to quickly see the number of males and females in the data set, we could use the command

table(births$SEX)
## 
##      F      M 
## 208653 218670

If we wanted to see the education options and numbers for the data set, we could use the following:

table(births$DMEDUC)
## 
##            1 year of college        1 year of high school 
##                         7850                        12672 
## 1 Years of elementary school           2 years of college 
##                          550                        36713 
## 2 Years of elementary school       2 years of high school 
##                          515                        65462 
##           3 years of college 3 Years of elementary school 
##                        24600                          583 
##       3 years of high school 4 Years of elementary school 
##                        15942                         4279 
##       4 years of high school 5 Years of elementary school 
##                        23863                          915 
## 6 Years of elementary school 7 Years of elementary school 
##                         3672                         8686 
## 8 Years of elementary school          No formal education 
##                         9466                          353 
##           Not on certificate                         NULL 
##                          143                       211059

This would quickly show us that close to half of our data has NULL as the value for education.