# Addition
4 + 2
[1] 6
# Subtraction
4 - 2
[1] 2
# Multiplication
4 * 2
[1] 8
# Division
4 / 2
[1] 2
In this course, we’ll be using R, a powerful programming language designed specifically for statistical computations and data visualization. While R is not suited for building websites or games, it excels at handling data and performing statistical operations. In this course we will mostly focus on applying R for the end goal of data analysis and this chapter doesn’t cover all the basics of R and programming, but it provides a foundation for the rest of the course.
You may want to read the chapter Workflow: basics in R4DS as it also covers some basic R.
Before getting started, you’ll need to set up R and RStudio, a popular integrated development environment (IDE) for R. You can download R from https://www.r-project.org/about.html and RStudio from https://posit.co/products/open-source/rstudio/. Once you have these installed, you’re ready to go!
In this chapter we cover how to perform basic calculations; load libraries (packages) into R; what objects (variables), functions, vectors, and data frames are; and how logical operators and conditionals work.
Let’s start by treating R as a giant calculator. You can perform basic arithmetic operations like addition, subtraction, multiplication, and division:
# Addition
4 + 2
[1] 6
# Subtraction
4 - 2
[1] 2
# Multiplication
4 * 2
[1] 8
# Division
4 / 2
[1] 2
Here we used #
to comment our code. Any text following a #
on a line will not be executed by the program. Instead comments are used to explain the code for others (and for future you!), which comes in handy as our code gets more complicated.
Instead of writing our own code for everything, we use packages (or libraries). Packages contain pre-written code and functions. One essential package we’ll use is the tidyverse package, a collection of packages designed for data science. It simplifies data loading, transformation, and visualization, making our data analysis tasks much more manageable.
To install and load tidyverse use
install.packages("tidyverse")
library(tidyverse)
If you need to install and/or load another package, just replace tidyverse
above with the package name.
You will be learning much more about tidyverse and how to use it when you start working in R4DS, written by the creator of tidyverse, Hadley Wickham, among others. There you will also learn all you need about visualizing your data.
tidyverse
package using the code above. You will need it for later exercises in this chapter.While R can be used for simple calculations, we are often more interested in performing more complicated tasks. To achieve this we use objects, also called variables, to store our values, the results of our computations. Think of objects as labels for your data. You can store values in objects using the assignment operator, <-
, like this
<- 4 * 2
x # Printing the value of x
x
[1] 8
=
instead of <-
, what happens?Functions are like machines that take inputs and produce outputs. You can think of a function like a vending machine: you input coins and a number for a product and your desired product comes out. The function transforms your input into output. Inputs are also called arguments or parameters, i.e. the coins and number are the arguments of the vending machine. Below we create a simple function called multiply4
that takes \(x\) as argument, multiplies it with 4, and then returns the result.
# Define function
<- function(x) {
multiply4 * 4
x
}# Use the function two times with different inputs
multiply4(1)
[1] 4
multiply4(10)
[1] 40
You don’t need to create your own functions in this course, but you’ll use plenty of them, and of course the above function is pretty useless. We can just run 1*4
or 10*4
directly in R instead of over-complicating things with a function, but it serves a good example of how functions work.
We can also save the result of our function in our own variable. Below we take the result of our function, save it in the two_times_four
variable, and print the result plus 2
<- multiply4(2)
two_times_four print(two_times_four + 2)
[1] 10
multiply4
function above, taking two arguments and multiplies them together. Copy the code and use the function to compute \(36 \cdot 52\)<- function(x, y) {
multiply * y
x }
subtract
that subtracts two numbers instead of multiplying them, and then use the function to compute \(36 - 52\).subtract
function by calling subtract(y = 52, x = 36)
. Is the result different?subtract(y <- 52, x <- 36)
. What happens?Sometimes we want to work with multiple items of the same type together. To achieve this in R, we can use vectors. A vector is declared using the c
function (for combine). The arguments to c
are the items we want in the vector as separate arguments. For example, if we want to store a sequence of different numbers, we can use:
<- c(4, 2, 453)
vec print(vec)
[1] 4 2 453
We can also store objects in vectors. Here we store three text variables (usually called strings in a programming context) in the same vector:
<- "string 1"
x <- "string 2"
y <- "string 3"
z <- c(x, y, z)
vec print(vec)
[1] "string 1" "string 2" "string 3"
If we want to access a specific element in a vector, we can use square brackets []
. The first element in a vector is accessed by vec[1]
, the second by vec[2]
, and so on. Below we access the first element in the vector vec
:
print(vec[1])
[1] "string 1"
print(vec[2])
[1] "string 2"
print(vec[3])
[1] "string 3"
In other programming languages, e.g. Python, vector indices commonly start from 0. Watch out for this when moving between languages as it is a common mistake! Starting indices from 0 is more true to how a computer works, while starting from 1 is common in mathematics.
In this course, we’ll often use vectors implicitly, especially when dealing with data frames.
vec1 <- c(1, 2)
and vec2 <- c(3, 4)
. What happens?<- c(1, 2)
vec1 <- c(3, 4)
vec2 <- c(3, "four")`
vec3
vec1 + vec2
vec2 + vec3
A vector stores objects in one dimension. Sometimes we encounter problems when we need to store data in two dimensions, for example when we are dealing with correlations between different variables. In this case, we can use matrices. A matrix is a two-dimensional array that stores data in rows and columns. We can create a matrix using the matrix
function. The first argument to the function is the data we want to store in the matrix, the second argument is the number of rows, and the third argument is the number of columns. We can also specify if we want to fill the matrix by row or by column. Below we create a 3x3 matrix filled by row:
# initialize matrix with ones on the diagonal
<- matrix(
m c(1, 2, 3,
2, 1, 2,
4, 2, 1),
nrow = 3,
byrow = TRUE)
We can access a specific element in a matrix using square brackets []
. The first argument is the row index, and the second argument is the column index. We can also access a whole row or column by leaving that index blank. Below are some examples.
# Select first element of matrix
1, 1] m[
[1] 1
# Select first row of matrix
1, ] m[
[1] 1 2 3
# Select first column of matrix
1] m[,
[1] 1 2 4
Later in this chapter we will see that we may also use conditional statements to select elements in a matrix, see Section Section 1.10.2
Tables are essential in data analysis. A table stores data in a row/column format:
Name | Age | Favorite Color |
---|---|---|
Alice | 25 | Blue |
Bob | 30 | Red |
Charlie | 22 | Green |
where Name, Age, and Favorite Color are the columns and Alice, 25, Blue is the first row. We can create this table in R using the data.frame
function:
<- data.frame(
df Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 22),
Favorite_Color = c("Blue", "Red", "Green")
)print(df)
Name Age Favorite_Color
1 Alice 25 Blue
2 Bob 30 Red
3 Charlie 22 Green
Or alternatively, by using the tibble
function. Tibbles are similar to data frames, but with some extended functionality for printing and other things which we will not cover in detail. To create a tibble, we need first need to load the tidyverse
package:
library(tidyverse)
<- tibble(
tib Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 22),
Favorite_Color = c("Blue", "Red", "Green")
)
If you want to access one of the columns in a data frame or tibble you can use the dollar operator, $
, after the data frame or tibbles name. Below we access the Name
column and print out the values in it.
$Name tib
[1] "Alice" "Bob" "Charlie"
There are a number of useful functions that you can use with data frames and tibbles, in the table below you can see some of them and a description of what they do.
Feature | Description |
---|---|
glimpse() | Display a concise overview of the data frame’s structure. |
summary() | Provide summary statistics for numeric objects in the data frame. |
head() | Display the top rows of the data frame. |
tail() | Display the bottom rows of the data frame. |
nrow() | Show the length of the data frame. |
ncol() | Show the number of columns in the data frame. |
colnames() or names() | Display column names (as character objects). |
Here is an example of how to use them all and their output for the tibble tib
that we defined before.
glimpse(tib)
Rows: 3
Columns: 3
$ Name <chr> "Alice", "Bob", "Charlie"
$ Age <dbl> 25, 30, 22
$ Favorite_Color <chr> "Blue", "Red", "Green"
summary(tib)
Name Age Favorite_Color
Length:3 Min. :22.00 Length:3
Class :character 1st Qu.:23.50 Class :character
Mode :character Median :25.00 Mode :character
Mean :25.67
3rd Qu.:27.50
Max. :30.00
head(tib)
# A tibble: 3 × 3
Name Age Favorite_Color
<chr> <dbl> <chr>
1 Alice 25 Blue
2 Bob 30 Red
3 Charlie 22 Green
tail(tib)
# A tibble: 3 × 3
Name Age Favorite_Color
<chr> <dbl> <chr>
1 Alice 25 Blue
2 Bob 30 Red
3 Charlie 22 Green
nrow(tib)
[1] 3
ncol(tib)
[1] 3
names(tib)
[1] "Name" "Age" "Favorite_Color"
As data frames and tibbles are very similar, we will be using the words data frame and tibble interchangeably.
palmerpenguins
package using the code below and look at the output from calling the data penguins
. Try running ?penguins
in the console to learn more about the dataset. We will be using the penguins
data throughout the course, so have a look at it, but there will be more time to study it in detail later.install.packages("palmerpenguins")
library(palmerpenguins)
penguins
flipper_length_mm
column of the penguins dataset using the dollar operator, $
.penguins
) and $
without any space and then press tab. What do you see?tidyverse
package and apply all the above functions on penguins
, you can copy the code and replace tib
with penguins
. Again, don’t spend too much time, but try to understand the functions and what they do.Logicals and conditionals are used to make decisions in our code. As their names suggest, logicals lets us define logical expressions that can either be TRUE
or FALSE
, while conditionals lets us define what should happen in the code if a logical is either TRUE
or FALSE
. For example, if it is raining when you leave your house you may want to take an umbrella, but if it’s not raining you’d rather leave the umbrella at home. We can express this as a logical and a condition: IF(\(raining\)) THEN take umbrella ELSE leave umbrella. Here IF represents a conditional, and \(raining\) represents a logical.
In R, you can use logical operators to compare values. When a logical condition is true, R returns the valueTRUE
, and when it is false FALSE
. There are several different operators to compare values (or objects) in R:
==
) and Inequality (!=
): Check if two objects are equal or not equal using ==
and !=
, respectively. For example:4 == 4
[1] TRUE
4 == 2
[1] FALSE
4 != 2
[1] TRUE
>
), Less Than (<
): Compare values to check if one is greater or less than another:4 > 2
[1] TRUE
4 < 2
[1] FALSE
>=
), Less Or Equal (<=
): Compare values to check if one is greater or equal or less or equal than another:4 > 4
[1] FALSE
4 >= 4
[1] TRUE
&
) and Logical OR (|
): Combine multiple conditions using &
for AND and |
for OR:<- TRUE
is_raining <- TRUE
has_umbrella # Testing if is raining and has umbrella
& has_umbrella is_raining
[1] TRUE
<- FALSE
has_raincoat # Testing if has umbrella or raincoat
| has_raincoat has_umbrella
[1] TRUE
We can use logical operators to select elements in a vector or matrix. When we test a logical condition on a vector, R returns a logical vector of the same length as the original vector. This logical vector can be used to index the original vector. For example, if we have a vector of numbers and we want to select only the numbers that are greater than 5, we can use the following code:
<- c(1, 6, 3, 8, 2, 9)
vec > 5 vec
[1] FALSE TRUE FALSE TRUE FALSE TRUE
> 5] vec[vec
[1] 6 8 9
This can also be applied to matrices,
# initialize matrix with ones on the diagonal
<- matrix(
m c(1, 2, 3,
2, 1, 2,
4, 2, 1),
nrow = 3,
byrow = TRUE)
# Test the elementwise condition m is equal to one
== 1 m
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE TRUE
# Selecting elements in m based on a condition
> 1] m[m
[1] 2 4 2 2 3 2
# Selecting elements in m based on multiple conditions
> 1 & m < 3] m[m
[1] 2 2 2 2
Testing a condition returns another matrix of the same size, showing where the condition is true. Selecting elements returns the elements matching the statement.
Logical operators are useful to compare objects but the real power of logical operators shows when combined with conditional statements. Conditional statements allow you to execute different parts of the code based on logical conditions. In R, you can use ifelse
for conditional assignment:
<- TRUE
is_raining # IF(is_raining) THEN bring umbrella ELSE leave umbrella
ifelse(is_raining, "bring umbrella", "leave umbrella")
[1] "bring umbrella"
<- FALSE
is_raining ifelse(is_raining, "bring umbrella", "leave umbrella")
[1] "leave umbrella"
ifelse
is useful when transforming objects, as you will learn more about later in the course.
In this course we will often talk about functions, inputs, parameters and arguments, and outputs, so having an idea of these concepts is useful, even if you don’t need to write them yourself (you are of course free to try!).