Chapter 2 R basics
2.1 Overview
This chapter covers some basic tools and characteristics of the R language that we’ll use in the workshop. While this obviously can’t be a comprehensive introduction to R, we’ll demonstrate some essential R notions and features that are commonly used in social science data analysis, including (egocentric) social network analysis.
The script covers the following topics:
- Starting R, getting help with R.
- Creating and saving R objects.
- Vectors and matrices, data frames and tibbles.
- Arithmetic and relational operations.
- Subsetting vectors, matrices, and data frames.
2.2 Starting R and loading packages
- Before starting any work in R, you normally want to do two things:
- Make sure your R session is pointing to the correct working directory.
- Install and/or load the packages you are going to use.
- Working directory. By default, R will look for files and save new files in this directory.
- Type
getwd()
in the console to view your current working directory. - If you opened RStudio by double-clicking on a project (
.Rproj
) file, then the working directory is the folder where that file is located. - You can always use
setwd()
to manually change your working directory to any path, but it’s usually more convenient to work with R projects and their default working directory instead. - In RStudio, you can also check the current working directory by clicking on the
Files
panel.
- Type
- R packages. There are two steps to using a package in R:
- Install the package. You do this just once. Use
install.packages("package_name")
or the appropriate RStudio menu (Tools > Install Packages...
). Once you install a package, the package files are in your system R folder and R will be able to always find the package there. - Load the package in your current session. Use
library(package_name)
(no quotation marks around the package name). You do this in each R session in which you need the package, that is, every time you start R and you need the package.
- Install the package. You do this just once. Use
- An R package is just a collection of functions. You can only use an R function if that function is included in a package you loaded in the current session.
- Sometimes two different functions from two different packages have the same name. For example, both the
igraph
package and thesna
package have a function calleddegree
. If both packages are loaded, typing justdegree
might give you unexpected results, because R will pick one of the two functions (the one in the package that was loaded most recently), which might not be the function you meant.- To avoid this problem, you can use the
package::function()
notation:igraph::degree()
will always call thedegree
function from theigraph
package, whilesna::degree()
will call thedegree
function from thesna
package.
- To avoid this problem, you can use the
- Tip: To check the package that a function comes from, just go to that function’s manual page. The package will be indicated in the first line of the page. E.g., type
?degree
to see where thedegree
function comes from.- If no currently loaded package has a function called
degree
, then typing?degree
will cause a warning (No documentation for 'degree'
). - If multiple, currently loaded packages have a function called
degree
, then typing?degree
will bring up a page with the list of all those packages.
- If no currently loaded package has a function called
- This workshop will use a number of packages, listed here.
2.2.1 Console vs scripts
- When you open RStudio, you typically see two separate windows: the script editor and the console. You can write R code in either of them.
- Console. Here you write R code line by line. Once you type a line, you press
ENTER
to execute it. By pressingARROW UP
you go back to the last line you ran. By continuing to pressARROW UP
, you can navigate through all the lines of code you previously executed. This is called the “commands history” (all the lines of code executed in the current session). You will lose all this code (all the history) when you quit R, unless you explicitly save the history to a file (which is not what you typically do, you should just write the code in a script).
- Script editor. Here you write a script. This is the most common way of working with R. A script is simply a plain text file where all your R code is saved. If your work is in a script, it is reproducible.
- Both the R standard GUI and RStudio have a script editor with several helpful tools. Among other things, these allow you to run a script while you write it. By pressing
CTRL+ENTER
(Windows) orCMD+ENTER
(Mac), you run the script line your cursor is on (or the selected script region).- Note that with RStudio you can run the single script line where your cursor is; a whole highlighted region of code; the region of code from the beginning of the script up to the line where your cursor is; the region of code from the line where your cursor is up to the end of the script. See the Code menu and its keyboard shortcuts.
- The script editor also allows you to save your script. In RStudio, see
File > Save
and its keyboard shortcut. R script files commonly have a.R
extension (e.g. “myscript.R
”). But note that a script file is just a text file (like any.txt
file), which you can open and edit in any text editor, or in Microsoft Word and the likes. - You can also run a whole script altogether — this is called sourcing a script. By running
source("myscript.R")
, you source the script filemyscript.R
(assuming the file is in your working directory, otherwise you’ll have to enter the whole file path). In RStudio: see Code > Source and its keyboard shortcut. - In both the console and the script editor, any line that starts by
#
is called a comment. R disregards comments — it just prints them as they are in the console (does not parse and execute them as programming code). Remember to always use comments to document what your code is doing (this is good for yourself and for others). - In RStudio you can navigate the script headings in your script with a drop-down menu in the bottom-left of the script editor. Any line that starts by
#
and ends by####
,----
, or====
is read as a heading by RStudio.
2.2.2 Getting help
- Getting help is one of the most common things you do when using R. As a beginner, you’ll constantly need to get help (for example, read manual pages) about R functions. Also as an experienced user, you’ll often need to go back to the manual pages of particular functions or other R help resources. At any experience level, using R involves constantly using its documentation and help resources.
- The following are a few help tools in R:
help(...)
or?...
are the most common ways of getting help: they send you to the R manual page for a specific function. E.g.help(sum)
or?sum
(they are equivalent).help.start()
(or RStudio:Help > R Help
) gives you general help pages in html (introduction to R, references to all functions in all installed packages, etc.).demo()
gives you demos on specific topics. Rundemo()
to see all available topics.example()
gives you example code on specific functions, e.g.example(sum)
for the functionsum
.
help.search(...)
or??...
search for a specific string in the manual pages, e.g.??histogram
.
- In addition to built-in help facilities within R, there are plenty of ways to get R help online. Certain popular R packages have their own website, e.g. ggplot2, igraph, and statnet. Other websites for general R help include rdocumentation.org and stackoverflow.com. See the workshop slides or talk to me for more information.
# What's the current working directory?
# getwd()
# Un-comment to check your actual working directory.
# Change the working directory.
# setwd("/my/working/directory")
# (Delete the leading "#" and type in your actual working directory's path
# instead of "/my/working/directory")
# You should use R projects (.Rproj) to point to a working directory instead of
# manually changing it.
# Suppose that we want to use the package "igraph" in the following code.
library(igraph)
# Note that we can only load a package if we have it installed. In this case, I
# have igraph already installed. Had this not been the case, I would have have
# needed to install it:
# install.packages("igraph").
# (Packages can also be installed through an RStudio menu item).
# Let's load another suite of packages we'll use in the rest of this script.
library(tidyverse)
# Note that whatever is typed after the "#" sign has no effect but to be printed
# as is in the console: it is a comment.
2.3 Objects in R
In R, everything that exists is an object. Everything that happens is a function call. - John Chambers
- R is an object-oriented programming language. Everything is contained in an object, including data, analysis tools and analysis results. Things such as datasets, commands (called “functions” in R), regression results, descriptive statistics, etc., are all objects.
- An object has a name and a value. You create an object by assigning a value to a name.
- You assign with
<-
or with=
. - R is case-sensitive: the object named
mydata
is different from the object namedMydata
.
- You assign with
- Whenever you run an operation or execute a function in R, you need to assign the result to an object if you want to save it and re-use it later. Assign it or lose it: anything that is not assigned to an object is just printed to the console and lost.
- Objects have a size (bytes, megabytes, etc.) and a type (technically, a class, a type and a mode — more on this later).
- A function is a particular type of object. Functions take other objects as arguments (input) and return more objects as a result (output). R functions are what other data analysis programs call “commands”. See Section 8.1.2 for more about functions.
2.3.1 The workspace
- During your R session, objects (data, results) are located in the computer’s main memory. They make up your workspace: the set of all the objects currently in memory. They will disappear when you quit R, unless you save them to files on disk.
- What’s in your current workspace?
- The function
ls()
shows you a full list of the objects currently in the workspace. - Alternatively, in RStudio open the Environment panel to get a clickable list of objects currently in the workspace (if you don’t see your Environment panel, check
Preferences... > Pane Layout
).
- The function
2.3.2 Saving and removing objects
- Two main functions to save R objects to files:
save()
(saves specific objects, its arguments);save.image()
(saves all the current workspace). - Unless you specify a different path, all files you save from R are put in your current working directory.
- The most common file extensions for files that store R objects are
.rda
and.RData
. - If you have a file with R objects, say
objects.rda
, you can load it in you current R session using theload()
function:load(file= "objects.rda")
. This assumesobjects.rda
is in your current working directory (otherwise you’ll have to specify the whole file path). - The function
rm()
removes specific objects from the workspace. You can use it to clear the workspace from all existing objects by typingrm(list=ls())
(remember thatls()
returns a character vector with the names of all the objects in the current workspace).
# Create the object a: assign the value 50 to the name "a"
a <- 50
# Display ("print") the object a.
a
## [1] 50
## [1] "Mark"
## [1] 10
## [1] 53
## [1] "a" "b" "obj" "result"
## [1] 53
## [1] 106
## Error in eval(expr, envir, enclos): object 'reSult' not found
## character(0)
2.3.3 Vector and matrix objects
- Vectors are the most basic objects you use in R. Vectors can be numeric (numerical data), logical (TRUE/FALSE data), or character (string data).
- The basic function to create a vector is
c
(concatenate). - Other useful functions to create vectors:
rep
andseq
.- Another function we’ll use to create vectors later in the workshop is
seq_along
.seq_along(x)
creates a vector consisting of a sequence of integers from 1 tolength(x)
in steps of 1. - Also keep in mind the
:
shortcut:c(1, 2, 3, 4)
is the same as1:4
.
- Another function we’ll use to create vectors later in the workshop is
- The length (number of elements) is a basic property of vectors:
length(x)
returns the length of vectorx
. - When we
print
vectors, the numbers in square brackets indicate the positions of vector elements. - To create a matrix:
matrix
. Its main arguments are the cell values (withinc()
), number of rows (nrow
) and number of columns (ncol
). Values are arranged in anrow
xncol
matrix by column. See?matrix
. - When we
print
matrices, the numbers in square brackets indicate the row and column numbers.
2.3.4 Data frames
- “Data frame” is R’s name for dataset. A dataset is a collection of cases (rows), and variables (columns) which are measured on those cases.
- When
print
ed in R, data frames look like matrices. However, unlike matrix columns, data frame columns can be of different types, e.g. a numeric variable and a character variable. - On the other hand, just like matrix columns, data frame columns (variables) must all have the same
length
(number of cases). You can’t put together variables (vectors) of different length in the same data frame. - Although data frames look like matrices, in R’s mind they are a specific kind of list (more about lists in Section 5.3). In fact, the
class
of a data frame isdata.frame
, but thetype
of a data frame islist
. The list elements for a data frame are its variables (columns). - Tibbles. The tidyverse packages, which we use in this workshop, rely on a more efficient form of data frame, called tibble.
- A tibble has class
tbl_df
anddata.frame
. This means that, to R, a tibble is also a data frame, and any function that works on data frames normally also works on tibbles. - A tibble has a number of advantages over a traditional data frame, some of which we’ll see in this workshop.
- One of the advantages is the clearer and more informative way in which tibbles are printed. When we print a tibble data frame we can immediately see its number of rows, number of columns, names of variables, and type of each variable (numeric, integer, character, etc.).
- To convert an existing data frame to tibble:
as_tibble
. To create a tibble from scratch (similar to thedata.frame
function in base R):tibble
.
- A tibble has class
- While data frames can be created manually in R (with the functions
data.frame
in base R andtibble
intidyverse
), data are most commonly imported into R from external sources, like a csv or txt file. - We’ll import data from csv files using the
read_csv()
function fromtidyverse
.read_csv()
reads csv files (values separated by “,” or “;”).read_delim()
reads files in which values are separated by any delimiter.- These functions have many arguments that make them very flexible and allow users to import basically any kind of table stored in a text file. Check out
?read_delim
. - In base R, the corresponding functions are
read.csv()
andread.table()
.
- Data can also be imported into R from most external file formats (SAS, SPSS, Stata, Excel, etc.) using the tidyverse packages
readxl
andhaven
, or theforeign
package in traditional R. - Note that you can click on a data frame’s name in RStudio’s Environment pane. That will open the data frame in a window, similar to SPSS’s data view.
## [1] 1 2 3 4
## [1] 1 2 3 4
## [1] 4
# Note that when we print vectors, numbers in square brackets indicate positions
# of the vector elements.
# Create a simple matrix.
adj <- matrix(c(0,1,0, 1,0,0, 1,1,0), nrow= 3, ncol=3)
# This is what our matrix looks like:
adj
## [,1] [,2] [,3]
## [1,] 0 1 1
## [2,] 1 0 1
## [3,] 0 0 0
# Notice the row and column numbers in square brackets.
# Normally we create data frames by importing data from external files, for
# example csv files.
ego.df <- read_csv("./Data/raw_data/ego_data.csv")
## Rows: 102 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): ego.sex, ego.edu, ego.empl.bin, ego.age.cat
## dbl (5): ego_ID, ego.age, ego.arr, ego.inc, empl
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 102 × 9
## ego_ID ego.sex ego.age ego.arr ego.edu ego.inc empl ego.empl.bin ego.age.cat
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 28 Male 61 2008 Second… 350 3 Yes 60+
## 2 29 Male 38 2000 Primary 900 4 Yes 36-40
## 3 33 Male 30 2010 Primary 200 3 Yes 26-30
## 4 35 Male 25 2009 Second… 1000 3 Yes 18-25
## 5 39 Male 29 2007 Primary 0 1 No 26-30
## 6 40 Male 56 2008 Second… 950 4 Yes 51-60
## 7 45 Male 52 1975 Primary 1600 3 Yes 51-60
## 8 46 Male 35 2002 Second… 1200 4 Yes 31-35
## 9 47 Male 22 2010 Second… 700 4 Yes 18-25
## 10 48 Male 51 2007 Primary 950 4 Yes 51-60
## # ℹ 92 more rows
2.4 Arithmetic, statistical, and relational operations
2.4.1 Arithmetic operations and recycling
- R can work as a normal calculator.
- Addition/subtraction:
7+3
- Multiplication:
7*3
- Negative:
-7
- Division:
7/3
- Integer division:
7%/%3
- Integer remainder:
7%%3
- Exponentiation:
7^3
- Addition/subtraction:
- Many operations involving vectors in R are performed element-wise, i.e., separately on each element of the vector (see examples below).
- Most operations on vectors use the recycling rule: if a vector is too short, its values are re-used the number of times needed to match the desired length (see examples below).
- Examples of vector operations and recycling:
[1 2 3 4] + [1 2 3 4] = [1+1 2+2 3+3 4+4]
(element-wise addition.)[1 2 3 4] + 1 = [1+1 2+1 3+1 4+1]
(1
is recycled 3 times to match thelength
of the first vector.)[1 2 3 4] + [1 2] = [1+1 2+2 3+1 4+2]
([1 2]
is recycled once.)[1 2 3 4] + [1 2 3] = [1+1 2+2 3+3 4+1]
([1 2 3]
is recycled one third of a time: R will warn that the length of longer vector is not a multiple of the length of shorter vector.)
2.4.2 Relational operations and logical vectors
- Relational operators:
>
,<
,<=
,>=
. Equal is==
(NOT=
). Not equal is!=
.- Note: equal is
==
, whereas=
has a different meaning.=
is used to assign function arguments (e.g.matrix(x, nrow = 3, ncol = 4)
), or to assign objects (x <- 2
is the same asx = 2
).
- Note: equal is
- Relational operations result in logical vectors: vectors of
TRUE
/FALSE
values. - Like arithmetic operations, relational ones are performed element-wise on vectors, and recycling applies.
- Logical operators:
&
for AND,|
for OR. - Negation (i.e. opposite) of a logical vector:
!
. - Is value x in vector y?
x %in% y
. - R can convert logical vectors to numeric (
as.numeric()
,as.integer()
). In this conversion,TRUE
becomes1
andFALSE
becomes0
. Conversely, if converted to logical (as.logical()
),1/0
areTRUE/FALSE
.- Therefore, if
x
is a logical vector,sum(x)
gives the count ofTRUE
s inx
(sum of1
s in the vector). mean(x)
gives the proportion of1
s inx
(mean of a binary vector: sum of1
s in the vector divided by number of elements in the vector).
- Therefore, if
2.4.3 Examples of arithmetic and statistical functions
- Arithmetic vector functions are performed element-wise, and return a vector. Examples:
- Exponential:
exp(x)
. - Logarithm:
log(x)
(base e) orlog10(x)
(base 10).
- Exponential:
- Statistical scalar functions are executed on the set of all vector elements taken together, and return a scalar. Examples:
mean(x)
andmedian(x)
.- Standard deviation and variance:
sd(x)
,var(x)
. - Minimum and maximum:
min(x)
,max(x)
. sum(x)
: sum of all elements inx
.
table(x)
is another basic statistical function (but it’s not scalar):table(x)
returns atable
object with the absolute frequencies of values inx
.
- Functions such as
sum()
,mean()
andtable()
are very useful when programming in R (for example, when writing your own functions). However, if you just need descriptive statistics for the data, there are more convenient tools you can use. Some of these are the following functions, which work well withtidyverse
(we’ll see them in Ch. 4):- The
skim
function (from theskimr
package) for descriptive statistics of continuous/quantitative variables. - The
tabyl
function (from thejanitor
package) for frequencies of categorical variables.
- The
2.4.4 Missing and infinite values
- Missing values in R are represented by
NA
(Not Available).- If your data has a different code for missing values (e.g., -99), you’ll have to recode that to NA for R to properly handle missing values in your data.
- Infinity may result from arithmetic operations:
Inf
and-Inf
(e.g.3/0
).NaN
also may result, meaning Not a Number (e.g.0/0
).- While
NA
s can appear in any type of object,Inf
,-Inf
andNaN
can only appear in numeric objects.
- While
is.na(x)
checks if each element ofx
is NA and returns TRUE if that’s the case, FALSE otherwise. It’s a vector function (its value has the samelength
asx
).
# Just a few arithmetic operations between vectors to demonstrate element-wise
# calculations and the recycling rule.
(v1 <- 1:4)
## [1] 1 2 3 4
## [1] 1 2 3 4
## [1] 2 4 6 8
## [1] 2 3 4 5
## [1] 1 2 3 4
## [1] 1 2
## [1] 2 4 4 6
## Warning in 1:4 + 1:3: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 5
# Relational operations.
# Let's take a single variable from the ego attribute data: ego's age (in years)
# for the first 10 respondents. Let's put the result in a separate vector. This
# code involves indexing, we'll explain it better below.
age <- ego.df$ego.age[1:10]
age
## [1] 61 38 30 25 29 56 52 35 22 51
# Note how the following comparisons are performed element-wise, and the value
# to which age is compared (30) is recycled.
# Is age equal to 30?
age==30
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# The resulting logical vector is TRUE for those elements (i.e., respondents)
# who meet the condition.
# Which respondent's age is greater than 40?
age > 40
## [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
# Notice the difference between OR (|) and AND (&).
# Is 30 in "age"? I.e., is one of the respondents of 30 years of age?
30 %in% age
## [1] TRUE
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [1] 1 0 0 0 0 1 1 0 0 1
# This allows us to use the sum() and the mean() functions to get the count and
# proportion of TRUE's in a logical vector.
# Count of TRUE's: Number of respondents (elements of "age") that are older than
# 40.
sum(age > 45)
## [1] 4
## [1] 6
# Proportion of TRUE's: What's the proportion of "age" elements (respondents)
# that are greater than 50?
mean(age > 50)
## [1] 0.4
# ***** EXERCISES
#
# (1) Obtain a logical vector indicating which elements of "age" are smaller than 30
# OR greater than 50. Then obtain a logical vector indicating which elements of
# "age" are smaller than 30 AND greater than 50. Why all elements are FALSE in
# the latter vector?
#
# (2) Using the : shortcut, create a vector that goes from 1 to 100 in steps of 1.
# Obtain a logical vector that is TRUE for the first 10 elements and the last 10
# elements of the vector.
#
# (3) Use the age vector with relational operators and sum/mean to answer these
# questions: How many respondents are younger than 50? What percentage of
# respondents is between 30 and 40 years of age, including 30 and 40? Is there
# any respondent who is younger than 20 OR older than 70 (use any())?
#
# (4) How many elements of the vector 1:100 are greater than the length of that
# vector divided by 2? Use sum().
#
# *****
2.5 Subsetting
- Subsetting is crucial in R. It means extracting one (or more) of an object’s elements or components, typically by appending an index (or subscript) to that object (this is also called “indexing” or “subscripting”).
- Subsetting can be used to extract (view, query) the component of an object, or to replace it (assign a different value to that element).
- The basic notation for subsetting in R is
[ ]
:x[i]
gives you the i-th element of objectx
. - Numeric subsetting uses integers in square brackets
[ ]
: e.g.x[3]
. Note that you can use negative integers to index (select) everything but that element: e.g.x[-3]
,x[c(-2,-4)]
. - Logical subsetting uses logical vectors in square brackets
[ ]
. It’s used to subset objects based on a condition, e.g., to index all values inx
that are greater than 3 (see example code below). - Name subsetting uses element names. Elements in a vector, and rows or columns in a matrix can have names.
- Names can be displayed and assigned using the
names
function in base R, orset_names
(just to assign names) in tidyverse.
- Names can be displayed and assigned using the
- When subsetting you must take into account the number of dimensions of an object. For example, vectors have one dimension, matrices have two. Arrays can be defined with three dimensions or more (e.g. three-way tables).
- Square brackets typically contain a slot for each dimension of the object, separated by a comma:
x[i]
indexes the i-th element of the one-dimensional objectx
;x[i,j]
indexes the i,j-th element of the two-dimensional objectx
(e.g.x
is a matrix, i refers to a row and j refers to a column);x[i,j,k]
indexes the i,j,k-th element of the three-dimensional objectx
, etc.
- Notice that a dimension’s slot may be empty, meaning that we index all elements in that dimension. So, if
x
is a matrix,x[3,]
will index the whole 3rd row of the matrix – i.e.[
row 3, all columns]
. - If
x
has more than one dimension (e.g. it’s a matrix), thenx[3]
(no comma, just one slot) is still valid, but it might give you unexpected results.
- Square brackets typically contain a slot for each dimension of the object, separated by a comma:
- Matrices have special functions that can be used for subsetting, e.g.
diagonal()
,upper.tri()
,lower.tri()
. These can be useful for manipulating adjacency matrices. - Particular subsetting rules may apply to particular
class
es of objects, for example lists and data frames (see next Section 2.5.1).
2.5.1 Subsetting data frames
- List notations. Data frames are a special class of lists (see Section 5.3 for more about lists). Just like any list, data frames can be subset in the following three ways:
[ ]
notation, e.g.df[3]
ordf["variable.name"]
. This returns another data frame that only includes the indexed element(s), e.g. only the 3rd element. Note:- This notation preserves the
data.frame
class: the result is still a data frame. - This notation can be used to index multiple elements of a data frame into a new data frame, e.g.
df[c(1,3,5)]
ordf[c("sex", "age")]
- This notation preserves the
[[ ]]
notation, e.g.df[[3]]
ordf[["variable.name"]]
. This returns the specific element (column) selected, not as a data frame but as a vector with its own type and class, e.g. the numeric vector within the 3rd element ofdf
. Note two differences from the[ ]
notation:[[ ]]
does not preserve thedata.frame
class. The result is not a data frame.- Consistently,
[[ ]]
can only be used to index a single element (column) of the data frame, not multiple elements.
- The
$
notation. If variable.name is the name of a specific variable (column) indf
, thendf$variable.name
indexes that variable. This is the same as the[[ ]]
notation:df$variable.name
is the same asdf[["variable.name"]]
, and it’s also the same asdf[[i]]
(wherei
is the position of the variable called variable.name in the data frame).
- Matrix notation. Data frames can also be subset like a matrix, with the
[ , ]
notation:df[2,3]
,df[2, ]
,df[ ,3]
.df[,"age"]
,df[,c("sex", "age")]
,df[5,"age"]
- Keep in mind the difference between the following:
- Extracting a data frame’s variable (column) in itself, as a vector (numeric, character, etc.) — The single pepper packet by itself in the figure below (panel C). This is given by
df[[i]]
,df[["variable.name"]]
,df$variable.name
. - Extracting another data frame of just one variable (column) – The single pepper packet within the pepper shaker in the figure below (panel B). This is given by
df[i]
,df["variable.name"]
.
- Extracting a data frame’s variable (column) in itself, as a vector (numeric, character, etc.) — The single pepper packet by itself in the figure below (panel C). This is given by
- Subsetting verbs in tidyverse. In addition to subsetting data frames via the base indexing syntax described above, we can also use the subsetting functions introduced by the
dplyr
package in tidyverse (see below).
# Numeric subsetting ----
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Let's use our vector of ego age values again.
age
## [1] 61 38 30 25 29 56 52 35 22 51
## [1] 38
## [1] 38 25 29
## [1] 29 56 52
## [1] 61 45 30 25 29 56 52 35 22 51
## [,1] [,2] [,3]
## [1,] 0 1 1
## [2,] 1 0 1
## [3,] 0 0 0
## [1] 1
## [1] 1 0 0
## [,1] [,2] [,3]
## [1,] 1 0 1
## [2,] 0 0 0
# Logical subsetting ----
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Which values of "age" are between 40 and 60?
# Let's create a logical index that flags these values.
(ind <- age > 40 & age < 60)
## [1] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE
## [1] 45 56 52 51
## [1] 45 56 52 51
# Subsetting data frames ----
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# We'll use our ego-level data frame.
ego.df
## # A tibble: 102 × 9
## ego_ID ego.sex ego.age ego.arr ego.edu ego.inc empl ego.empl.bin ego.age.cat
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 28 Male 61 2008 Second… 350 3 Yes 60+
## 2 29 Male 38 2000 Primary 900 4 Yes 36-40
## 3 33 Male 30 2010 Primary 200 3 Yes 26-30
## 4 35 Male 25 2009 Second… 1000 3 Yes 18-25
## 5 39 Male 29 2007 Primary 0 1 No 26-30
## 6 40 Male 56 2008 Second… 950 4 Yes 51-60
## 7 45 Male 52 1975 Primary 1600 3 Yes 51-60
## 8 46 Male 35 2002 Second… 1200 4 Yes 31-35
## 9 47 Male 22 2010 Second… 700 4 Yes 18-25
## 10 48 Male 51 2007 Primary 950 4 Yes 51-60
## # ℹ 92 more rows
# Numeric subsetting works on data frames too: it allows you to index variables.
# The 3rd variable.
ego.df[3]
## # A tibble: 102 × 1
## ego.age
## <dbl>
## 1 61
## 2 38
## 3 30
## 4 25
## 5 29
## 6 56
## 7 52
## 8 35
## 9 22
## 10 51
## # ℹ 92 more rows
## [1] 61 38 30 25 29 56 52 35 22 51 50 45 51 32 57 42 32 55 44 25 28 60 54 30 37
## [26] 41 47 33 37 29 39 31 52 30 57 44 47 42 44 31 31 56 36 47 47 35 43 22 55 32
## [51] NA 60 51 53 25 33 38 37 27 34 32 26 36 53 42 45 36 54 49 35 27 32 58 51 55
## [76] 59 27 35 24 52 27 42 47 43 53 54 35 50 51 40 49 33 33 36 32 35 55 61 34 44
## [101] 50 28
## [1] "tbl_df" "tbl" "data.frame"
## [1] "numeric"
# The [[ ]] notation extracts the actual column as a vector, while [ ] keeps
# the data frame class.
# We can also subset data frames as matrices.
# The second and third columns.
ego.df[,2:3]
## # A tibble: 102 × 2
## ego.sex ego.age
## <chr> <dbl>
## 1 Male 61
## 2 Male 38
## 3 Male 30
## 4 Male 25
## 5 Male 29
## 6 Male 56
## 7 Male 52
## 8 Male 35
## 9 Male 22
## 10 Male 51
## # ℹ 92 more rows
## # A tibble: 3 × 9
## ego_ID ego.sex ego.age ego.arr ego.edu ego.inc empl ego.empl.bin ego.age.cat
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 28 Male 61 2008 Seconda… 350 3 Yes 60+
## 2 29 Male 38 2000 Primary 900 4 Yes 36-40
## 3 33 Male 30 2010 Primary 200 3 Yes 26-30
## # A tibble: 102 × 1
## ego.age
## <dbl>
## 1 61
## 2 38
## 3 30
## 4 25
## 5 29
## 6 56
## 7 52
## 8 35
## 9 22
## 10 51
## # ℹ 92 more rows
## [1] 61 38 30 25 29 56 52 35 22 51 50 45 51 32 57 42 32 55 44 25 28 60 54 30 37
## [26] 41 47 33 37 29 39 31 52 30 57 44 47 42 44 31 31 56 36 47 47 35 43 22 55 32
## [51] NA 60 51 53 25 33 38 37 27 34 32 26 36 53 42 45 36 54 49 35 27 32 58 51 55
## [76] 59 27 35 24 52 27 42 47 43 53 54 35 50 51 40 49 33 33 36 32 35 55 61 34 44
## [101] 50 28
## [1] 61 38 30 25 29 56 52 35 22 51 50 45 51 32 57 42 32 55 44 25 28 60 54 30 37
## [26] 41 47 33 37 29 39 31 52 30 57 44 47 42 44 31 31 56 36 47 47 35 43 22 55 32
## [51] NA 60 51 53 25 33 38 37 27 34 32 26 36 53 42 45 36 54 49 35 27 32 58 51 55
## [76] 59 27 35 24 52 27 42 47 43 53 54 35 50 51 40 49 33 33 36 32 35 55 61 34 44
## [101] 50 28
## [1] TRUE
# With tidyverse, this type of subsetting syntax is replaced by new "verbs"
# (see section below):
# * Index data frame rows: filter() instead of []
# * Index data frame columns: select() instead of []
# * Extract data frame variable as a vector: pull() instead of [[]] or $
# ***** EXERCISES
#
# (1) What's the mean age of respondents whose education is NOT "Primary"? Use
# the ego.age and ego.edu variables. Remember that the "different from"
# comparison operator is !=.
#
# (2) What is the modal education level (i.e., the most common education level)
# of respondents who are older than 40? Use table() or tabyl().
#
# (3) Create the fictitious variable var <- c(1:30, rep(NA, 3), 34:50). Use is.na()
# to index all the NA values in the variable. Then use is.na() to index all
# values that are *not* NA. Hint: Remember the operator used to negate a logical
# vector. Finally, use this indexing to remove all NA values from var.
#
# (4) Use the $ notation to extract the "ego.arr" variable in ego.df. Recode all
# values equal to 2008 as 99. Hint: Index all values equal to 2008, then replace
# them with 99 via the assignment operator.
#
# *****
2.6 The tidyverse
syntax
2.6.1 Pipes and the |>
operator
- The original pipe operator,
%>%
, was introduced by themagrittr
package in 2014. It quickly gained popularity in the R community and was adopted byigraph
andtidyverse
(among other packages), which we use in this workshop. In 2021, R incorporated the pipe idea with a new, similar (but not identical) operator:|>
. See this page for an overview of the differences between|>
and%>%
. - The idea behind pipes is in essence very simple:
f(g(x))
becomesx |> g() |> f()
.- For example:
mean(table(x))
becomesx |> table() |> mean()
.
- So
|>
pipes the output of the previous function (e.g.,table()
) into the input of the following function (e.g.,mean()
). This turns inside-to-outside code into left-to-right code. Because left to right is the direction most of us are used to read in (at least in English and other Western languages), pipes make R code easier to read and follow. - You may also see pipes concatenating multiple lines of code. That’s possible and a common coding style. Instead of
x |> table() |> mean()
you can write
x |>
table() |>
mean()
2.6.2 Subsetting data frames in tidyverse
- Tidyverse includes the
dplyr
package for data frame manipulation. This is a very powerful package for all kinds of data wrangling. To learn more, see the package cheatsheet and vignettes. - Subsetting data frames with
dplyr
:dplyr::filter()
is used to subset rows (cases) of a data frame based on one or multiple conditions (for example, respondents with certain values on one or more variables). This preserves the data frame class, similar to[ ]
indexing.dplyr::select()
is used to subset columns (variables) of a data frame. You can use full variable names or select variables in many other ways (see examples in theselect
manual page). This preserves the data frame class, similar to[ ]
indexing.dplyr::pull()
is used to extract a column as a vector. This does not preserve the data frame class, similar to[[ ]]
or$
.
# The pipe operator ----
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Ego education variable
(edu <- ego.df$ego.edu)
## [1] "Secondary" "Primary" "Primary" "Secondary" "Primary"
## [6] "Secondary" "Primary" "Secondary" "Secondary" "Primary"
## [11] "Secondary" "Secondary" "University" "Primary" "Secondary"
## [16] "Primary" "Primary" "Secondary" "Secondary" "Primary"
## [21] "University" "University" "Primary" "Secondary" "Primary"
## [26] "Primary" "Secondary" "Primary" "Primary" "Primary"
## [31] "Secondary" "Secondary" "Secondary" "Primary" "Primary"
## [36] "Secondary" "Secondary" "Primary" "Secondary" "Primary"
## [41] "Primary" "Primary" "Primary" "Secondary" "Primary"
## [46] "Primary" "Primary" "University" "Primary" "Secondary"
## [51] "University" "Primary" "University" "Secondary" "Secondary"
## [56] "University" "Primary" "Primary" "Secondary" "Secondary"
## [61] "Secondary" "University" "University" "University" "Secondary"
## [66] "Secondary" "Secondary" "University" "Primary" "Secondary"
## [71] "Secondary" "Secondary" "University" "Primary" "Secondary"
## [76] "Secondary" "Secondary" "Secondary" "Secondary" "Secondary"
## [81] "Secondary" "Primary" "Primary" "University" "Primary"
## [86] "Primary" "Secondary" "Secondary" "Secondary" "University"
## [91] "Secondary" "Secondary" "Primary" "Primary" "University"
## [96] "Primary" "Primary" "Primary" "Primary" "Secondary"
## [101] "Primary" "Secondary"
## edu
## Primary Secondary University
## 42 45 15
## [1] 34
## [1] 34
# Subsetting data frames with dplyr: filter and select ----
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Filter to egos who are older than 40
ego.df |>
filter(ego.age > 40)
## # A tibble: 51 × 9
## ego_ID ego.sex ego.age ego.arr ego.edu ego.inc empl ego.empl.bin ego.age.cat
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 28 Male 61 2008 Second… 350 3 Yes 60+
## 2 40 Male 56 2008 Second… 950 4 Yes 51-60
## 3 45 Male 52 1975 Primary 1600 3 Yes 51-60
## 4 48 Male 51 2007 Primary 950 4 Yes 51-60
## 5 49 Male 50 2001 Second… 1300 4 Yes 41-50
## 6 51 Male 45 2011 Second… 480 3 Yes 41-50
## 7 52 Male 51 2002 Univer… 1200 4 Yes 51-60
## 8 55 Male 57 1979 Second… 470 7 No 51-60
## 9 56 Male 42 1992 Primary 1100 4 Yes 41-50
## 10 58 Male 55 1990 Second… 1450 4 Yes 51-60
## # ℹ 41 more rows
## # A tibble: 42 × 9
## ego_ID ego.sex ego.age ego.arr ego.edu ego.inc empl ego.empl.bin ego.age.cat
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 29 Male 38 2000 Primary 900 4 Yes 36-40
## 2 33 Male 30 2010 Primary 200 3 Yes 26-30
## 3 39 Male 29 2007 Primary 0 1 No 26-30
## 4 45 Male 52 1975 Primary 1600 3 Yes 51-60
## 5 48 Male 51 2007 Primary 950 4 Yes 51-60
## 6 53 Male 32 2003 Primary 1600 1 No 31-35
## 7 56 Male 42 1992 Primary 1100 4 Yes 41-50
## 8 57 Male 32 2000 Primary 1200 4 Yes 31-35
## 9 60 Male 25 2011 Primary 0 1 No 18-25
## 10 64 Male 54 1981 Primary 300 3 Yes 51-60
## # ℹ 32 more rows
## # A tibble: 21 × 9
## ego_ID ego.sex ego.age ego.arr ego.edu ego.inc empl ego.empl.bin ego.age.cat
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 45 Male 52 1975 Primary 1600 3 Yes 51-60
## 2 48 Male 51 2007 Primary 950 4 Yes 51-60
## 3 56 Male 42 1992 Primary 1100 4 Yes 41-50
## 4 64 Male 54 1981 Primary 300 3 Yes 51-60
## 5 68 Male 41 2008 Primary 600 4 Yes 41-50
## 6 82 Male 57 1982 Primary 1190 4 Yes 51-60
## 7 85 Male 42 2008 Primary 400 2 No 41-50
## 8 90 Male 56 2004 Primary 450 4 Yes 51-60
## 9 93 Male 47 2010 Primary 880 4 Yes 41-50
## 10 95 Male 43 1997 Primary 800 4 Yes 41-50
## # ℹ 11 more rows
## # A tibble: 72 × 9
## ego_ID ego.sex ego.age ego.arr ego.edu ego.inc empl ego.empl.bin ego.age.cat
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 28 Male 61 2008 Second… 350 3 Yes 60+
## 2 29 Male 38 2000 Primary 900 4 Yes 36-40
## 3 33 Male 30 2010 Primary 200 3 Yes 26-30
## 4 39 Male 29 2007 Primary 0 1 No 26-30
## 5 40 Male 56 2008 Second… 950 4 Yes 51-60
## 6 45 Male 52 1975 Primary 1600 3 Yes 51-60
## 7 48 Male 51 2007 Primary 950 4 Yes 51-60
## 8 49 Male 50 2001 Second… 1300 4 Yes 41-50
## 9 51 Male 45 2011 Second… 480 3 Yes 41-50
## 10 52 Male 51 2002 Univer… 1200 4 Yes 51-60
## # ℹ 62 more rows
# Note that the object ego.df hasn't changed. To re-use any of the filtered data
# frames above, we have to assign them to an object.
ego.df.40 <- ego.df |>
filter(ego.age > 40)
ego.df.40
## # A tibble: 51 × 9
## ego_ID ego.sex ego.age ego.arr ego.edu ego.inc empl ego.empl.bin ego.age.cat
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr>
## 1 28 Male 61 2008 Second… 350 3 Yes 60+
## 2 40 Male 56 2008 Second… 950 4 Yes 51-60
## 3 45 Male 52 1975 Primary 1600 3 Yes 51-60
## 4 48 Male 51 2007 Primary 950 4 Yes 51-60
## 5 49 Male 50 2001 Second… 1300 4 Yes 41-50
## 6 51 Male 45 2011 Second… 480 3 Yes 41-50
## 7 52 Male 51 2002 Univer… 1200 4 Yes 51-60
## 8 55 Male 57 1979 Second… 470 7 No 51-60
## 9 56 Male 42 1992 Primary 1100 4 Yes 41-50
## 10 58 Male 55 1990 Second… 1450 4 Yes 51-60
## # ℹ 41 more rows
## # A tibble: 51 × 2
## ego.sex ego.age
## <chr> <dbl>
## 1 Male 61
## 2 Male 56
## 3 Male 52
## 4 Male 51
## 5 Male 50
## 6 Male 45
## 7 Male 51
## 8 Male 57
## 9 Male 42
## 10 Male 55
## # ℹ 41 more rows
# As usual, we can re-assign the result to the same data object
ego.df.40 <- ego.df.40 |>
dplyr::select(ego.sex, ego.age)
ego.df.40
## # A tibble: 51 × 2
## ego.sex ego.age
## <chr> <dbl>
## 1 Male 61
## 2 Male 56
## 3 Male 52
## 4 Male 51
## 5 Male 50
## 6 Male 45
## 7 Male 51
## 8 Male 57
## 9 Male 42
## 10 Male 55
## # ℹ 41 more rows
## [1] 61 38 30 25 29 56 52 35 22 51 50 45 51 32 57 42 32 55 44 25 28 60 54 30 37
## [26] 41 47 33 37 29 39 31 52 30 57 44 47 42 44 31 31 56 36 47 47 35 43 22 55 32
## [51] NA 60 51 53 25 33 38 37 27 34 32 26 36 53 42 45 36 54 49 35 27 32 58 51 55
## [76] 59 27 35 24 52 27 42 47 43 53 54 35 50 51 40 49 33 33 36 32 35 55 61 34 44
## [101] 50 28
## [1] 61 38 30 25 29 56 52 35 22 51 50 45 51 32 57 42 32 55 44 25 28 60 54 30 37
## [26] 41 47 33 37 29 39 31 52 30 57 44 47 42 44 31 31 56 36 47 47 35 43 22 55 32
## [51] NA 60 51 53 25 33 38 37 27 34 32 26 36 53 42 45 36 54 49 35 27 32 58 51 55
## [76] 59 27 35 24 52 27 42 47 43 53 54 35 50 51 40 49 33 33 36 32 35 55 61 34 44
## [101] 50 28