Data Science with R, Part 2: Beyond Basics
This post is the second part of Data Science with R series.
Previous post in Data Science with R:
Conditionals and Control Flow
Relational Operators
Relational operators allow us to compare values in R. When I said values, it includes logicals, numerics and characters.
- Equality
==
- equal to
!=
- not equal to
- Less and greater than
<
- less than
>
- greater than<=
- less than or equal to
>=
- greater than or equal to
Important - single =
is the same as <-
in R, use for assigning a value to a variable.
Now let’s see how we can use these operators.
# Assign value to a variable
a <- 1
b <- 10
# a is equal to b
a == b
## [1] FALSE
# a is not equal to b
a != b
## [1] TRUE
Note that the result of these comparisons will be Boolean, true
or false
.
Let’s see another example using vectors of numerics.
# Create vectors
d <- c(1, 7, 3, 9, 5)
e <- c(6, 2, 8, 4, 10)
# d is greater than e
d > e
## [1] FALSE TRUE FALSE TRUE FALSE
# d is less than e
d < e
## [1] TRUE FALSE TRUE FALSE TRUE
The result from this example returns a vector of Boolean. This is due to the fact that, the comparison occurs in an element-wise fashion. For example, for ‘d is greater than e’, 1 is greater than 6 is true, 7 is greater than 2 is false.
Important - Under the hood in R, TRUE
is equal to 1, FALSE
is equal to 0. This characteristic is useful in this situation where we want to count something in R.
# Create a character vector
f <- c("a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b")
# Count how many "a" in vector f
sum(f == "a")
## [1] 8
# Let's see the result of f == "a"
f == "a"
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [12] FALSE
sum()
will interpret TRUE
as 1 and add these and return us with a sum value. We can see that there 8 true’s in the result.
Now, let’s use this relational operators in a dataset. This example will use ‘quine’ dataset in MASS package. You can use a question mark to read the description of the dataset, like this, ?quine
# Load MASS package
library(MASS)
# Create a subset: all aboriginal students
abo <- subset(quine, Eth == "A")
# View abo
abo
## Eth Sex Age Lrn Days
## 1 A M F0 SL 2
## 2 A M F0 SL 11
## 3 A M F0 SL 14
## 4 A M F0 AL 5
## 5 A M F0 AL 5
## 6 A M F0 AL 13
## 7 A M F0 AL 20
## 8 A M F0 AL 22
## 9 A M F1 SL 6
## 10 A M F1 SL 6
## 11 A M F1 SL 15
## 12 A M F1 AL 7
## 13 A M F1 AL 14
## 14 A M F2 SL 6
## 15 A M F2 SL 32
## 16 A M F2 SL 53
## 17 A M F2 SL 57
## 18 A M F2 AL 14
## 19 A M F2 AL 16
## 20 A M F2 AL 16
## 21 A M F2 AL 17
## 22 A M F2 AL 40
## 23 A M F2 AL 43
## 24 A M F2 AL 46
## 25 A M F3 AL 8
## 26 A M F3 AL 23
## 27 A M F3 AL 23
## 28 A M F3 AL 28
## 29 A M F3 AL 34
## 30 A M F3 AL 36
## 31 A M F3 AL 38
## 32 A F F0 SL 3
## 33 A F F0 AL 5
## 34 A F F0 AL 11
## 35 A F F0 AL 24
## 36 A F F0 AL 45
## 37 A F F1 SL 5
## 38 A F F1 SL 6
## 39 A F F1 SL 6
## 40 A F F1 SL 9
## 41 A F F1 SL 13
## 42 A F F1 SL 23
## 43 A F F1 SL 25
## 44 A F F1 SL 32
## 45 A F F1 SL 53
## 46 A F F1 SL 54
## 47 A F F1 AL 5
## 48 A F F1 AL 5
## 49 A F F1 AL 11
## 50 A F F1 AL 17
## 51 A F F1 AL 19
## 52 A F F2 SL 8
## 53 A F F2 SL 13
## 54 A F F2 SL 14
## 55 A F F2 SL 20
## 56 A F F2 SL 47
## 57 A F F2 SL 48
## 58 A F F2 SL 60
## 59 A F F2 SL 81
## 60 A F F2 AL 2
## 61 A F F3 AL 0
## 62 A F F3 AL 2
## 63 A F F3 AL 3
## 64 A F F3 AL 5
## 65 A F F3 AL 10
## 66 A F F3 AL 14
## 67 A F F3 AL 21
## 68 A F F3 AL 36
## 69 A F F3 AL 40
Our abo
now contains only Aboriginal students.
Logical Operators
Logical operators is used to do Boolean operations.
- AND -
&
TRUE
&TRUE
returnsTRUE
TRUE
&FALSE
returnsFALSE
FALSE
&FALSE
returnsFALSE
- OR -
|
TRUE
|TRUE
returnsTRUE
TRUE
|FALSE
returnsTRUE
FALSE
|FALSE
returnsFALSE
- NOT -
!
(negates output)!TRUE
returnsFALSE
!FALSE
returnsTRUE
Let’s see these operators in action.
# Create a subset of quine dataset: Aboriginal, Male students
abom <- subset(quine, Eth == "A" & Sex == "M")
# View abom
abom
## Eth Sex Age Lrn Days
## 1 A M F0 SL 2
## 2 A M F0 SL 11
## 3 A M F0 SL 14
## 4 A M F0 AL 5
## 5 A M F0 AL 5
## 6 A M F0 AL 13
## 7 A M F0 AL 20
## 8 A M F0 AL 22
## 9 A M F1 SL 6
## 10 A M F1 SL 6
## 11 A M F1 SL 15
## 12 A M F1 AL 7
## 13 A M F1 AL 14
## 14 A M F2 SL 6
## 15 A M F2 SL 32
## 16 A M F2 SL 53
## 17 A M F2 SL 57
## 18 A M F2 AL 14
## 19 A M F2 AL 16
## 20 A M F2 AL 16
## 21 A M F2 AL 17
## 22 A M F2 AL 40
## 23 A M F2 AL 43
## 24 A M F2 AL 46
## 25 A M F3 AL 8
## 26 A M F3 AL 23
## 27 A M F3 AL 23
## 28 A M F3 AL 28
## 29 A M F3 AL 34
## 30 A M F3 AL 36
## 31 A M F3 AL 38
if
, else if
and else
Statements
Let’s look at the structure of writing a combination of these three statements.
A few things you need to note:
- Conditions must be in parentheses
()
- Statements are enclosed in curly brackets
{}
else if
comes between theif
andelse
statements
Before we create an example, let’s understand how these statements are executed.
Let’s start with our if
statement. If the condition for our if
is satisfied, the statement for if
will be executed and the output produced will be based on the statement for if
.
If the condition for if
is not satisfied, R will evaluate the next condition, in this case, I have included else if
, so R will evaluate the condition in else if
statement. If the condition for our else if
is satisfied, the statement for else if
will be executed and the output produced will be based on the statement for else if
.
If the condition for else if
is not satisfied, R will evaluate our final statement, else
statement. The statement for else
will be executed and the output produced will be based on the statement for else
.
Important:
- An
if
statement alone can run on its own without having to haveelse if
orelse
statements attached to it, but this does not hold true forelse if
orelse
- It should be noted that
else if
is almost like an optional statement, which means that if there is noelse if
statement butelse
statement is present, R will move toelse
statement
- You can have a few
else if
statements
- Once a condition is satisfied, R will ignore the rest of the conditions that come after that
Let’s see how these statements work based on this example.
g <- 10
# Condition of if statement is satisfied
if (g < 13) {
print("g is less than 13")
} else if (g == 13) {
print("g is 13")
} else {
print("g is greater than 13")
}
## [1] "g is less than 13"
Notice that, R has ignored the else if
and else
statements since the if
condition is satisfied.
g <- 13
# Condition of else if statement is satisfied
if (g < 13) {
print("g is less than 13")
} else if (g == 13) {
print("g is 13")
} else {
print("g is greater than 13")
}
## [1] "g is 13"
g <- 14
# Two else if statements, condition of second else if statement is satisfied
if (g < 13) {
print("g is less than 13")
} else if (g == 13) {
print("g is 13")
} else if (g == 14) {
print("Oh it's 14!")
} else {
print("g is greater than 13")
}
## [1] "Oh it's 14!"
g <- 15
# No conditions are satisfied, so else statement is executed
if (g < 13) {
print("g is less than 13")
} else if (g == 13) {
print("g is 13")
} else {
print("g is greater than 13")
}
## [1] "g is greater than 13"
Loops
In data analysis, often you will find yourself doing a repeated action, for example, extracting certain information and dump the result into a new column. This repetitive task can be automated by using loops in R. We will cover two loops in this article: while
and for
.
While Loop
Similar to the if
statement, while
loop has a condition, statement to be executed and on top of that, the loop needs to initialised for it to run. Let’s see the structure of a while
loop.
Since we have learnt the conditional statements, let’s throw an if
statement in the while loop for the craic. This time, I will introduce you to the break
statement. It will stop the execution of the while
loop when the condition is met.
Before, we create an example, let’s see how a while
loop is executed.
From the flow diagram, we can see that a while
loop will keep running as long as the condition is satisfied. The output will be re-evaluated against the condition after each cycle or iteration. Now, let’s play.
# Initialise while loop
dogs <- 1
# while loop (basic)
while (dogs < 8) {
print(paste("I have ", dogs, " doggos!")) # statement 1
dogs = dogs + 1 # increment
}
## [1] "I have 1 doggos!"
## [1] "I have 2 doggos!"
## [1] "I have 3 doggos!"
## [1] "I have 4 doggos!"
## [1] "I have 5 doggos!"
## [1] "I have 6 doggos!"
## [1] "I have 7 doggos!"
From the output, we can see that 8 is not included as we use the <
operator instead of <=
. We can also see that the first output, is grammatically wrong in the doggo world. Let’s fix this with an if
statement.
# Initialise while loop
dogs <- 1
# while loop with an if and else statements
while (dogs < 8) {
if (dogs == 1) {
print(paste("I have ", dogs, " doggo!")) # statement 2
} else
print(paste("I have ", dogs, " doggos!")) # statement 1
dogs = dogs + 1 # increment
}
## [1] "I have 1 doggo!"
## [1] "I have 2 doggos!"
## [1] "I have 3 doggos!"
## [1] "I have 4 doggos!"
## [1] "I have 5 doggos!"
## [1] "I have 6 doggos!"
## [1] "I have 7 doggos!"
In our second while
loop, we have put an if
and also an else
. If we don’t put an else
, “I have 1 doggos!” will still be printed as an output and we don’t want that.
Let’s throw in the break
statement in our while
loop.
# Initialise while loop
dogs <- 1
# while loop with a break statement + if, else if, else statements
while (dogs < 8) {
if (dogs == 1) {
print(paste("I have ", dogs, " doggo!")) # statement 2
} else if (dogs == 5) { # break statement
break
} else
print(paste("I have ", dogs, " doggos!")) # statement 1
dogs = dogs + 1 # increment
}
## [1] "I have 1 doggo!"
## [1] "I have 2 doggos!"
## [1] "I have 3 doggos!"
## [1] "I have 4 doggos!"
For Loop
Our next loop is the famous for
loop. The structure of this loop is slightly different from while
loop.
Let’s see the flow diagram of a for
loop.
The difference in the for
loop:
- Instead of condition, variables in the sequence are evaluated, which means,
for
loop will execute the statement if the variable is in the sequence, if it is not in the sequence, thefor
loop will not execute the statement
- The output is not evaluated to keep the
for
loop running
Let’s create a for
loop.
# Create vector
doggies <- c("Jodi", "Loki", "Ru", "Pip", "Bear")
# Create a simple for loop
for (dogs in doggies) { # variable in sequence
print(paste("My dog's name is", dogs)) # statement 1
}
## [1] "My dog's name is Jodi"
## [1] "My dog's name is Loki"
## [1] "My dog's name is Ru"
## [1] "My dog's name is Pip"
## [1] "My dog's name is Bear"
Let’s throw in a break
statement into our for
loop.
# Create a loop with a break statement
for (dogs in doggies) { # variable in sequence
if (nchar(dogs) == 2) { # condition for if statement
break # break statement
}
print(paste("My dog's name is", dogs)) # statement 1
}
## [1] "My dog's name is Jodi"
## [1] "My dog's name is Loki"
In this loop, we have specified that the loop should stop when it encounters a dog’s name of two characters nchar(dogs) == 2
. So, only Jodi and Loki are printed out. They are the best dogs in the world!
But what if we still want the for
loop to go through Pip and Bear? We can do that by using a next
statement instead of break
.
# Create a loop with a next statement
for (dogs in doggies) { # variable in sequence
if (nchar(dogs) == 2) { # condition for if statement
next # next statement
}
print(paste("My dog's name is", dogs)) # statement 1
}
## [1] "My dog's name is Jodi"
## [1] "My dog's name is Loki"
## [1] "My dog's name is Pip"
## [1] "My dog's name is Bear"
What if we want to print “Jodi is dog number 1 in the list”? How do we iterate this process using the for
loop?
# Create a generalised for loop
for (dogs in 1:length(doggies)) { # variable in sequence
print(paste(doggies[dogs], "is dog number", dogs, "in the list")) # statement 1
}
## [1] "Jodi is dog number 1 in the list"
## [1] "Loki is dog number 2 in the list"
## [1] "Ru is dog number 3 in the list"
## [1] "Pip is dog number 4 in the list"
## [1] "Bear is dog number 5 in the list"
Let’s see the result of 1:length(doggies)
.
# View 1:length(doggies)
1:length(doggies)
## [1] 1 2 3 4 5
So, dogs in 1:length(doggies)
, will take these values for every iteration, first iteration dogs is 1, second iteration dog is 2 and so on.
lapply, sapply, vapply
lapply
lapply()
is a function in the apply family. It applies a function over a list or a vector. It is important to note that lapply()
will always return a list. l for list, easy to remember. lapply()
takes two arguments: lapply(list/vector, function to be applied)
.
Let’s use our doggies example. Say we want to print the number of characters for the dog’s names, we can do that using a for
loop, but it is easier to do this using lapply()
. It will apply the nchar()
function to every element in the our doggies vector.
# Print number of characters using for loop
for (dogs in doggies) {
print(nchar(dogs))
}
## [1] 4
## [1] 4
## [1] 2
## [1] 3
## [1] 4
# Print number of characters using lapply
lapply(doggies, nchar)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] 3
##
## [[5]]
## [1] 4
If we don’t want the result in a form of list, we can wrap it with unlist()
function. The result will be a vector instead of a list.
unlist(lapply(doggies, nchar))
## [1] 4 4 2 3 4
sapply
sapply()
is a function that will simplify the result. This function has two default arguments simplify = TRUE, USE.NAMES = TRUE
.
sapply(doggies, nchar)
## Jodi Loki Ru Pip Bear
## 4 4 2 3 4
From the result, we can see that sapply()
returns an array instead of a list. If we don’t want the names, we can suppress this behaviour like this.
sapply(doggies, nchar, USE.NAMES = FALSE)
## [1] 4 4 2 3 4
Important: If sapply()
cannot simplify the result, it will return a list, just like lapply()
.
vapply
As mentioned earlier, sapply()
can’t handle and simplify everything. So, vapply()
is a safer option, but with that, comes extra arguments that you have to specify. In vapply()
, these are the arguments that you need to specify vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
. X
, FUN
and FUN.VALUE
are compulsory.
vapply(doggies, nchar, numeric(1))
## Jodi Loki Ru Pip Bear
## 4 4 2 3 4
FUN.VALUE
is the template for return value. numeric(1)
means that you want numeric values of length 1 to be returned.
Recap
Let’s recap. We have three functions in the apply family:
lapply()
-lapply(list/vector, function to be applied)
- Result is returned as list
sapply()
-sapply(list/vector, function to be applied, USE.NAMES = TRUE)
- Result is simplified into array, but can’t simplify everything
- As default, it will return a named array, can suppress this by specifying
USE.NAMES = FALSE
- Result is simplified into array, but can’t simplify everything
vapply()
-vapply(list/vector, function to be applied, template for result, USE.NAMES = TRUE)
- Need to specify template for result
- A safer option in simplifying result in comparison to
sapply()
- Need to specify template for result