Data Science with R, Part 2: Beyond Basics

May 21, 2018

Data Science with R, Part 2: Intermediate R

This post is the second part of Data Science with R series.

Previous post in Data Science with R:

  1. Data Science with R, Part 1: Introduction to R

Conditionals and Control Flow

Relational Operators

Relational operators allow us to compare values in R. When I said values, it includes logicals, numerics and characters.

  • Equality
    • == - equal to
    • != - not equal to
  • Less and greater than
    • < - less than
    • > - greater than
    • <= - less than or equal to
    • >= - greater than or equal to

Important - single = is the same as <- in R, use for assigning a value to a variable.

Now let’s see how we can use these operators.

# Assign value to a variable
a <- 1
b <- 10

# a is equal to b
a == b
## [1] FALSE
# a is not equal to b
a != b
## [1] TRUE

Note that the result of these comparisons will be Boolean, true or false.

Let’s see another example using vectors of numerics.

# Create vectors
d <- c(1, 7, 3, 9, 5)
e <- c(6, 2, 8, 4, 10)

# d is greater than e
d > e
## [1] FALSE  TRUE FALSE  TRUE FALSE
# d is less than e
d < e
## [1]  TRUE FALSE  TRUE FALSE  TRUE

The result from this example returns a vector of Boolean. This is due to the fact that, the comparison occurs in an element-wise fashion. For example, for ‘d is greater than e’, 1 is greater than 6 is true, 7 is greater than 2 is false.

Important - Under the hood in R, TRUE is equal to 1, FALSE is equal to 0. This characteristic is useful in this situation where we want to count something in R.

# Create a character vector
f <- c("a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b")


# Count how many "a" in vector f
sum(f == "a")
## [1] 8
# Let's see the result of f == "a"
f == "a"
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
## [12] FALSE

sum() will interpret TRUE as 1 and add these and return us with a sum value. We can see that there 8 true’s in the result.

Now, let’s use this relational operators in a dataset. This example will use ‘quine’ dataset in MASS package. You can use a question mark to read the description of the dataset, like this, ?quine

# Load MASS package
library(MASS)

# Create a subset: all aboriginal students
abo <- subset(quine, Eth == "A")

# View abo
abo
##    Eth Sex Age Lrn Days
## 1    A   M  F0  SL    2
## 2    A   M  F0  SL   11
## 3    A   M  F0  SL   14
## 4    A   M  F0  AL    5
## 5    A   M  F0  AL    5
## 6    A   M  F0  AL   13
## 7    A   M  F0  AL   20
## 8    A   M  F0  AL   22
## 9    A   M  F1  SL    6
## 10   A   M  F1  SL    6
## 11   A   M  F1  SL   15
## 12   A   M  F1  AL    7
## 13   A   M  F1  AL   14
## 14   A   M  F2  SL    6
## 15   A   M  F2  SL   32
## 16   A   M  F2  SL   53
## 17   A   M  F2  SL   57
## 18   A   M  F2  AL   14
## 19   A   M  F2  AL   16
## 20   A   M  F2  AL   16
## 21   A   M  F2  AL   17
## 22   A   M  F2  AL   40
## 23   A   M  F2  AL   43
## 24   A   M  F2  AL   46
## 25   A   M  F3  AL    8
## 26   A   M  F3  AL   23
## 27   A   M  F3  AL   23
## 28   A   M  F3  AL   28
## 29   A   M  F3  AL   34
## 30   A   M  F3  AL   36
## 31   A   M  F3  AL   38
## 32   A   F  F0  SL    3
## 33   A   F  F0  AL    5
## 34   A   F  F0  AL   11
## 35   A   F  F0  AL   24
## 36   A   F  F0  AL   45
## 37   A   F  F1  SL    5
## 38   A   F  F1  SL    6
## 39   A   F  F1  SL    6
## 40   A   F  F1  SL    9
## 41   A   F  F1  SL   13
## 42   A   F  F1  SL   23
## 43   A   F  F1  SL   25
## 44   A   F  F1  SL   32
## 45   A   F  F1  SL   53
## 46   A   F  F1  SL   54
## 47   A   F  F1  AL    5
## 48   A   F  F1  AL    5
## 49   A   F  F1  AL   11
## 50   A   F  F1  AL   17
## 51   A   F  F1  AL   19
## 52   A   F  F2  SL    8
## 53   A   F  F2  SL   13
## 54   A   F  F2  SL   14
## 55   A   F  F2  SL   20
## 56   A   F  F2  SL   47
## 57   A   F  F2  SL   48
## 58   A   F  F2  SL   60
## 59   A   F  F2  SL   81
## 60   A   F  F2  AL    2
## 61   A   F  F3  AL    0
## 62   A   F  F3  AL    2
## 63   A   F  F3  AL    3
## 64   A   F  F3  AL    5
## 65   A   F  F3  AL   10
## 66   A   F  F3  AL   14
## 67   A   F  F3  AL   21
## 68   A   F  F3  AL   36
## 69   A   F  F3  AL   40

Our abo now contains only Aboriginal students.

Logical Operators

Logical operators is used to do Boolean operations.

  • AND - &
    • TRUE & TRUE returns TRUE
    • TRUE & FALSE returns FALSE
    • FALSE & FALSE returns FALSE
  • OR - |
    • TRUE | TRUE returns TRUE
    • TRUE | FALSE returns TRUE
    • FALSE | FALSE returns FALSE
  • NOT - ! (negates output)
    • !TRUE returns FALSE
    • !FALSE returns TRUE

Let’s see these operators in action.

# Create a subset of quine dataset: Aboriginal, Male students
abom <- subset(quine, Eth == "A" & Sex == "M")

# View abom
abom
##    Eth Sex Age Lrn Days
## 1    A   M  F0  SL    2
## 2    A   M  F0  SL   11
## 3    A   M  F0  SL   14
## 4    A   M  F0  AL    5
## 5    A   M  F0  AL    5
## 6    A   M  F0  AL   13
## 7    A   M  F0  AL   20
## 8    A   M  F0  AL   22
## 9    A   M  F1  SL    6
## 10   A   M  F1  SL    6
## 11   A   M  F1  SL   15
## 12   A   M  F1  AL    7
## 13   A   M  F1  AL   14
## 14   A   M  F2  SL    6
## 15   A   M  F2  SL   32
## 16   A   M  F2  SL   53
## 17   A   M  F2  SL   57
## 18   A   M  F2  AL   14
## 19   A   M  F2  AL   16
## 20   A   M  F2  AL   16
## 21   A   M  F2  AL   17
## 22   A   M  F2  AL   40
## 23   A   M  F2  AL   43
## 24   A   M  F2  AL   46
## 25   A   M  F3  AL    8
## 26   A   M  F3  AL   23
## 27   A   M  F3  AL   23
## 28   A   M  F3  AL   28
## 29   A   M  F3  AL   34
## 30   A   M  F3  AL   36
## 31   A   M  F3  AL   38

if, else if and else Statements

Let’s look at the structure of writing a combination of these three statements.
if, else if, else

A few things you need to note:

  • Conditions must be in parentheses ()
  • Statements are enclosed in curly brackets {}
  • else if comes between the if and else statements

Before we create an example, let’s understand how these statements are executed. flow

Let’s start with our if statement. If the condition for our if is satisfied, the statement for if will be executed and the output produced will be based on the statement for if.

If the condition for if is not satisfied, R will evaluate the next condition, in this case, I have included else if, so R will evaluate the condition in else if statement. If the condition for our else if is satisfied, the statement for else if will be executed and the output produced will be based on the statement for else if.

If the condition for else if is not satisfied, R will evaluate our final statement, else statement. The statement for else will be executed and the output produced will be based on the statement for else.

Important:

  • An if statement alone can run on its own without having to have else if or else statements attached to it, but this does not hold true for else if or else
  • It should be noted that else if is almost like an optional statement, which means that if there is no else if statement but else statement is present, R will move to else statement
  • You can have a few else if statements
  • Once a condition is satisfied, R will ignore the rest of the conditions that come after that

Let’s see how these statements work based on this example.

g <- 10

# Condition of if statement is satisfied
if (g < 13) {
  print("g is less than 13")
} else if (g == 13) {
  print("g is 13")
} else {
  print("g is greater than 13")
}
## [1] "g is less than 13"

Notice that, R has ignored the else if and else statements since the if condition is satisfied.

g <- 13

# Condition of else if statement is satisfied
if (g < 13) {
  print("g is less than 13")
} else if (g == 13) {
  print("g is 13")
} else {
  print("g is greater than 13")
}
## [1] "g is 13"
g <- 14

# Two else if statements, condition of second else if statement is satisfied
if (g < 13) {
  print("g is less than 13")
} else if (g == 13) {
  print("g is 13")
} else if (g == 14) {
  print("Oh it's 14!")
} else {
  print("g is greater than 13")
}
## [1] "Oh it's 14!"
g <- 15

# No conditions are satisfied, so else statement is executed
if (g < 13) {
  print("g is less than 13")
} else if (g == 13) {
  print("g is 13")
} else {
  print("g is greater than 13")
}
## [1] "g is greater than 13"

Loops

In data analysis, often you will find yourself doing a repeated action, for example, extracting certain information and dump the result into a new column. This repetitive task can be automated by using loops in R. We will cover two loops in this article: while and for.

While Loop

Similar to the if statement, while loop has a condition, statement to be executed and on top of that, the loop needs to initialised for it to run. Let’s see the structure of a while loop.
basic while

Since we have learnt the conditional statements, let’s throw an if statement in the while loop for the craic. This time, I will introduce you to the break statement. It will stop the execution of the while loop when the condition is met.
while with break

Before, we create an example, let’s see how a while loop is executed.
while flow

From the flow diagram, we can see that a while loop will keep running as long as the condition is satisfied. The output will be re-evaluated against the condition after each cycle or iteration. Now, let’s play.

# Initialise while loop
dogs <- 1

# while loop (basic)
while (dogs < 8) {
  print(paste("I have ", dogs, " doggos!")) # statement 1
  dogs = dogs + 1 # increment
}
## [1] "I have  1  doggos!"
## [1] "I have  2  doggos!"
## [1] "I have  3  doggos!"
## [1] "I have  4  doggos!"
## [1] "I have  5  doggos!"
## [1] "I have  6  doggos!"
## [1] "I have  7  doggos!"

From the output, we can see that 8 is not included as we use the < operator instead of <=. We can also see that the first output, is grammatically wrong in the doggo world. Let’s fix this with an if statement.

# Initialise while loop
dogs <- 1

# while loop with an if and else statements
while (dogs < 8) {
  if (dogs == 1) {
    print(paste("I have ", dogs, " doggo!")) # statement 2
  } else
  print(paste("I have ", dogs, " doggos!")) # statement 1
  dogs = dogs + 1 # increment
}
## [1] "I have  1  doggo!"
## [1] "I have  2  doggos!"
## [1] "I have  3  doggos!"
## [1] "I have  4  doggos!"
## [1] "I have  5  doggos!"
## [1] "I have  6  doggos!"
## [1] "I have  7  doggos!"

In our second while loop, we have put an if and also an else. If we don’t put an else, “I have 1 doggos!” will still be printed as an output and we don’t want that.

Let’s throw in the break statement in our while loop.

# Initialise while loop
dogs <- 1

# while loop with a break statement + if, else if, else statements
while (dogs < 8) {
  if (dogs == 1) {
    print(paste("I have ", dogs, " doggo!")) # statement 2
  } else if (dogs == 5) {                    # break statement
    break
  } else
  print(paste("I have ", dogs, " doggos!")) # statement 1
  dogs = dogs + 1 # increment
}
## [1] "I have  1  doggo!"
## [1] "I have  2  doggos!"
## [1] "I have  3  doggos!"
## [1] "I have  4  doggos!"

For Loop

Our next loop is the famous for loop. The structure of this loop is slightly different from while loop.
for loop

Let’s see the flow diagram of a for loop.
for flow

The difference in the for loop:

  • Instead of condition, variables in the sequence are evaluated, which means, for loop will execute the statement if the variable is in the sequence, if it is not in the sequence, the for loop will not execute the statement
  • The output is not evaluated to keep the for loop running

Let’s create a for loop.

# Create vector
doggies <- c("Jodi", "Loki", "Ru", "Pip", "Bear")

# Create a simple for loop
for (dogs in doggies) {                   # variable in sequence
  print(paste("My dog's name is", dogs))  # statement 1
}
## [1] "My dog's name is Jodi"
## [1] "My dog's name is Loki"
## [1] "My dog's name is Ru"
## [1] "My dog's name is Pip"
## [1] "My dog's name is Bear"

Let’s throw in a break statement into our for loop.

# Create a loop with a break statement
for (dogs in doggies) {                   # variable in sequence
  if (nchar(dogs) == 2) {                 # condition for if statement
    break                                 # break statement
  }
  print(paste("My dog's name is", dogs))  # statement 1
}
## [1] "My dog's name is Jodi"
## [1] "My dog's name is Loki"

In this loop, we have specified that the loop should stop when it encounters a dog’s name of two characters nchar(dogs) == 2. So, only Jodi and Loki are printed out. They are the best dogs in the world!

But what if we still want the for loop to go through Pip and Bear? We can do that by using a next statement instead of break.

# Create a loop with a next statement
for (dogs in doggies) {                   # variable in sequence
  if (nchar(dogs) == 2) {                 # condition for if statement
    next                                  # next statement
  }
  print(paste("My dog's name is", dogs))  # statement 1
}
## [1] "My dog's name is Jodi"
## [1] "My dog's name is Loki"
## [1] "My dog's name is Pip"
## [1] "My dog's name is Bear"

What if we want to print “Jodi is dog number 1 in the list”? How do we iterate this process using the for loop?

# Create a generalised for loop
for (dogs in 1:length(doggies)) {                                    # variable in sequence
  print(paste(doggies[dogs], "is dog number", dogs, "in the list"))  # statement 1
}
## [1] "Jodi is dog number 1 in the list"
## [1] "Loki is dog number 2 in the list"
## [1] "Ru is dog number 3 in the list"
## [1] "Pip is dog number 4 in the list"
## [1] "Bear is dog number 5 in the list"

Let’s see the result of 1:length(doggies).

# View 1:length(doggies)
1:length(doggies)
## [1] 1 2 3 4 5

So, dogs in 1:length(doggies), will take these values for every iteration, first iteration dogs is 1, second iteration dog is 2 and so on.

lapply, sapply, vapply

lapply

lapply() is a function in the apply family. It applies a function over a list or a vector. It is important to note that lapply() will always return a list. l for list, easy to remember. lapply() takes two arguments: lapply(list/vector, function to be applied).

Let’s use our doggies example. Say we want to print the number of characters for the dog’s names, we can do that using a for loop, but it is easier to do this using lapply(). It will apply the nchar() function to every element in the our doggies vector.

# Print number of characters using for loop
for (dogs in doggies) {
  print(nchar(dogs))
}
## [1] 4
## [1] 4
## [1] 2
## [1] 3
## [1] 4
# Print number of characters using lapply
lapply(doggies, nchar)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 4

If we don’t want the result in a form of list, we can wrap it with unlist() function. The result will be a vector instead of a list.

unlist(lapply(doggies, nchar))
## [1] 4 4 2 3 4

sapply

sapply() is a function that will simplify the result. This function has two default arguments simplify = TRUE, USE.NAMES = TRUE.

sapply(doggies, nchar)
## Jodi Loki   Ru  Pip Bear
##    4    4    2    3    4

From the result, we can see that sapply() returns an array instead of a list. If we don’t want the names, we can suppress this behaviour like this.

sapply(doggies, nchar, USE.NAMES = FALSE)
## [1] 4 4 2 3 4

Important: If sapply() cannot simplify the result, it will return a list, just like lapply().

vapply

As mentioned earlier, sapply() can’t handle and simplify everything. So, vapply() is a safer option, but with that, comes extra arguments that you have to specify. In vapply(), these are the arguments that you need to specify vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE). X, FUN and FUN.VALUE are compulsory.

vapply(doggies, nchar, numeric(1))
## Jodi Loki   Ru  Pip Bear
##    4    4    2    3    4

FUN.VALUE is the template for return value. numeric(1) means that you want numeric values of length 1 to be returned.

Recap

Let’s recap. We have three functions in the apply family:

  1. lapply() - lapply(list/vector, function to be applied)
    • Result is returned as list
  2. sapply() - sapply(list/vector, function to be applied, USE.NAMES = TRUE)
    • Result is simplified into array, but can’t simplify everything
    • As default, it will return a named array, can suppress this by specifying USE.NAMES = FALSE
  3. vapply() - vapply(list/vector, function to be applied, template for result, USE.NAMES = TRUE)
    • Need to specify template for result
    • A safer option in simplifying result in comparison to sapply()