Learning Objectives
- Familiarize participants with R syntax
- Understand the concepts of objects and assignment
- Understand what functions and packages are, and how to obtain them.
- Understand data frames, vectors, and data types
Start by showing an example of a script
<-
=
for arguments#
and how they are used to document function and its content$
operatorlibrary(dplyr)
library(tidyr)
library(readr)
#' Convert a zone-to-zone record of trucks into a trip table.
#'
#' @param trucks A data frame of truck plans, from TAZ i to TAZ j. Also includes
#' truck class.
#' @param taz A vector containing all i.
#'
#' @return a data frame with i, j, and volume by class.
#'
#' @import dplyr
#' @import tidyr
#'
sum_to_taz <- function(x){
x <- x %>%
# determine if truck is MU or SU
mutate(class = ifelse(grepl("SU", config), "SU", "MU")) %>%
# Add up to i, j, by class
group_by(origin, destination, class) %>%
summarise(n = n()) %>%
# spread across types
spread(class, n, fill = 0)
return(x)
}
trucks <- read_csv("county_plans.csv", col_types = "ccccc") %>%
sum_to_taz()
write_csv(trucks, "county_od_config.csv")
You can get output from R simply by typing in math in the console
3 + 5
12/7
However, to do useful and interesting things, we need to assign values to objects. To create an object, we need to give it a name followed by the assignment operator <-
, and the value we want to give it:
weight_kg <- 55
Objects can be given any name such as x
, current_temperature
, or subject_id
. You want your object names to be explicit and not too long. They cannot start with a number (2x
is not valid, but x2
is). R is case sensitive (e.g., weight_kg
is different from Weight_kg
). There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if
, else
, for
, see here for a complete list). In general, even if it’s allowed, it’s best to not use other function names (e.g., c
, T
, mean
, data
, df
, weights
). In doubt check the help to see if the name is already in use. It’s also best to avoid dots (.
) within a variable name as in my.dataset
. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it’s best to avoid them. It is also recommended to use nouns for variable names, and verbs for function names. It’s important to be consistent in the styling of your code (where you put spaces, how you name variable, etc.). In R, two popular style guides are Hadley Wickham’s and Google’s.
When assigning a value to an object, R does not print anything. You can force to print the value by using parentheses or by typing the name:
weight_kg <- 55 # doesn't print anything
weight_kg # and so does typing the name of the object
Now that R has weight_kg
in memory, we can do arithmetic with it. For instance, we may want to convert this weight in pounds (weight in pounds is 2.2 times the weight in kg):
2.2 * weight_kg
We can also change a variable’s value by assigning it a new one:
weight_kg <- 57.5
2.2 * weight_kg
This means that assigning a value to one variable does not change the values of other variables. For example, let’s store the animal’s weight in pounds in a new variable, weight_lb
:
weight_lb <- 2.2 * weight_kg
and then change weight_kg
to 100.
weight_kg <- 100
What do you think is the current content of the object weight_lb
? 126.5 or 200?
R objects can refer to more than single values. Other common objects are functions, vectors, lists, data frames, and many others. In today’s lesson we will talk mostly about vectors and data frames.
A vector is a set of values of a common type. You can create a vector with the c
function. You can examine the structure of an object with the str
function.
a <- c(100, 80, 90)
a
str(a)
If you mix types, then the vector will default to a common type. Question: Why are there quotes around the values?
a_character <- c(100, 80, "90")
a_character
str(a_character)
If you have a vector of data that you want to be numeric but it is read in as characters, you can change it to numeric.
as.numeric(a_character)
If you have a field that cannot be turned into a numeric value, it becomes the missing value NA
. One of R’s advantages over other programming languages is that it supports missing values.
as.numeric(c(100, 80, "ninety"))
## Warning: NAs introduced by coercion
Most statistical functions in R apply to vectors, like sum
and length
sum(a)
length(a)
We can write our own functions, too.
average <- function(x){
sum(x) / length(x)
}
average(a)
Of course, there’s a mean
function already defined that we should use instead of writing our own…
mean(a)
Note that if there is an NA
value in your vector, then mean
will by default return NA
. You can exclude these values by saying na.rm = TRUE
mean(c(5, 7, 4, NA))
mean(c(5, 7, 4, NA), na.rm = TRUE)
A data frame is a list of vectors that must be of the same length. This object works like tabular data you have used before.
d <- data.frame(
name = c(1, 2, 3),
value = c(72, 34, 85)
)
d
str(d)
You can do operations on single vector elements from a data frame with the $
selection operator.
mean(d$value)
Most data frames aren’t created by a user, but are read in from tabular data. To explore data frames more effectively with some real-world data examples, we’ll want to load a couple of packages first.
Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str()
or mean()
, come built into R; packages give you access thousands of functions that have been written by programmers and scientists around the world. Before you use a package for the first time you need to install it on your machine. You should have already installed the tidyverse
collection of packages; if you have not, you can do so with the following command:
install.packages("tidyverse")
You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; we recommend the RStudio mirror. The tidyverse actually contains a number of packages, including the ones we will use in this lab:
readr
contains functions to read tabular data (like .csv
files) into your working environment as a data frame. dplyr
contains functions for manipulating data frames. R already contains functions to do both of these things in the base
package, which we’ve been exploring so far. These packages have several improvements over base
functions, however:
In order to use functions in a package, you need to load it from your library.
library(tidyverse)
Download the data for this lesson: nhts_per.csv. Place this csv file in your project’s data/
folder. Then read it into your workspace with read_csv
. These are real NHTS person records, but with only the first 200 individuals for expediency.
nhts_per <- read_csv("data/nhts_per.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## HOUSEID = col_integer(),
## VARSTRAT = col_integer(),
## WTPERFIN = col_double(),
## SFWGT = col_double(),
## DRVRCNT = col_integer(),
## HHSIZE = col_integer(),
## HHVEHCNT = col_integer(),
## NUMADLT = col_integer(),
## WRKCOUNT = col_integer(),
## CNTTDTR = col_integer(),
## CARRODE = col_integer(),
## CDIVMSAR = col_integer(),
## DELIVER = col_integer(),
## FMSCSIZE = col_integer(),
## FXDWKPL = col_integer(),
## GCDWORK = col_double(),
## LSTTRDAY = col_integer(),
## MCUSED = col_integer(),
## NBIKETRP = col_integer(),
## NWALKTRP = col_integer()
## # ... with 18 more columns
## )
## See spec(...) for full column specifications.
nhts_per
When you print a data frame that you loaded with read_csv
and dplyr
, it will show you how many rows are in the data, and the names and types of all the variables. Additionally, RStudio is equipped with a function that will allow you to explore a data frame interactively.
View(nhts_per)
Now that we have a data_frame in our workspace, we’re going to learn some of the most common dplyr
functions: select()
, filter()
, mutate()
, group_by()
, and summarize()
. To select columns of a data frame, use select()
. The first argument to this function is the data frame (surveys
), and the subsequent arguments are the columns to keep.
select(nhts_per, HOUSEID, HHSIZE, HHVEHCNT, USEPUBTR)
To choose rows, use filter()
:
filter(nhts_per, USEPUBTR == "01")
How can you tell that this changed?
But what if you wanted to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. Pipes in R look like %>%
and are made available via the magrittr
package installed as part of dplyr
.
nhts_per %>%
filter(USEPUBTR == "01") %>%
select(HOUSEID, PERSONID, HHSIZE, HHVEHCNT, USEPUBTR)
In the above we use the pipe to send the nhts_per
data set first through filter
, to keep rows where USEPUBTR
was 01
, and then through select
to keep the household size, vehicles, and public transit columns. When the data frame is being passed to the filter()
and select()
functions through a pipe, we don’t need to include it as an argument to these functions anymore.
If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:
use_transit <- nhts_per %>%
filter(USEPUBTR == "01") %>%
select(HOUSEID, PERSONID, HHSIZE, HHVEHCNT, USEPUBTR)
Note that the final data frame is the leftmost part of this expression.
Challenge
Using pipes, subset the data to include people 18 and older who live in households where the number of vehicles is less than the number of adults. (
R_AGE
is the age of the individual.)
## Answer
nhts_per %>%
filter(R_AGE >= 18) %>%
filter(HHVEHCNT < NUMADLT)
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or find the ratio of values in two columns. For this use mutate()
.
To create a new column which is a logical value if the person used public transit,
nhts_per %>%
mutate(use_transit = ifelse(USEPUBTR == "01", TRUE, FALSE)) %>%
select(HOUSEID, PERSONID, use_transit)
Because most of these are FALSE
, we can use filter
to zero in on them. Note that because use_transit
is logical, we don’t need to be explicit about what we are filtering.
nhts_per %>%
mutate(use_transit = ifelse(USEPUBTR == "01", TRUE, FALSE)) %>%
filter(use_transit) %>%
select(HOUSEID, PERSONID, use_transit)
You can negate any R function with the !
symbol. If we wanted non-transit users, we could do this:
nhts_per %>%
mutate(use_transit = ifelse(USEPUBTR == "01", TRUE, FALSE)) %>%
filter(!use_transit) %>%
select(HOUSEID, PERSONID, use_transit)
Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function. group_by()
splits the data into groups upon which some operations can be run. For example, if we wanted to find how many people were in each household and their average age,
nhts_per %>%
group_by(HOUSEID) %>%
summarize(
mean_age = mean(R_AGE),
number_people = n()
)
You can group by multiple columns too:
nhts_per %>%
group_by(HHSIZE, HHVEHCNT) %>%
summarise(
n = n(),
weighted_n = sum(WTPERFIN)
)
Much of this lesson was copied or adapted from Jeff Hollister’s materials