11 Loops

Loops are R’s method for repeating a task, which makes them a useful tool for programming simulations. This chapter will teach you how to use R’s loop tools.

Let’s use the score function to solve a real-world problem.

Your slot machine is modeled after real machines that were accused of fraud. The machines appeared to pay out 40 cents on the dollar, but the manufacturer claimed that they paid out 92 cents on the dollar. You can calculate the exact payout rate of your machine with the score program. The payout rate will be the expected value of the slot machine’s prize.

11.1 Expected Values

The expected value of a random event is a type of weighted average; it is the sum of each possible outcome of the event, weighted by the probability that each outcome occurs:

\[ E(x) = \sum_{i = 1}^{n}\left( x_{i} \cdot P(x_{i}) \right) \]

You can think of the expected value as the average prize that you would observe if you played the slot machine an infinite number of times. Let’s use the formula to calculate some simple expected values. Then we will apply the formula to your slot machine.

Do you remember the die you created in Project 1: Weighted Dice?

die <- c(1, 2, 3, 4, 5, 6)

Each time you roll the die, it returns a value selected at random (one through six). You can find the expected value of rolling the die with the formula:

\[ E(\text{die}) = \sum_{i = 1}^{n}\left( \text{die}_{i} \cdot P(\text{die}_{i}) \right) \]

The $\text{die}_{i}$s are the possible outcomes of rolling the die: 1, 2, 3, 4, 5, and 6; and the $P(\text{die}_{i})$’s are the probabilities associated with each of the outcomes. If your die is fair, each outcome will occur with the same probability: 1/6. So our equation simplifies to:

\[ \begin{array}{rl} E(\text{die}) & = \sum_{i = 1}^{n}\left( \text{die}_{i} \cdot P(\text{die}_{i}) \right)\\ & = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6}\\ & = 3.5\\ \end{array} \]

Hence, the expected value of rolling a fair die is 3.5. You may notice that this is also the average value of the die. The expected value will equal the average if every outcome has the same chance of occurring.

But what if each outcome has a different chance of occurring? For example, we weighted our dice in Packages and Help Pages so that each die rolled 1, 2, 3, 4, and 5 with probability 1/8 and 6 with probability 3/8. You can use the same formula to calculate the expected value in these conditions:

\[ \begin{array}{rl} E(die) & = 1 \cdot \frac{1}{8} + 2 \cdot \frac{1}{8} + 3 \cdot \frac{1}{8} + 4 \cdot \frac{1}{8} + 5 \cdot \frac{1}{8} + 6 \cdot \frac{3}{8}\\ & = 4.125\\ \end{array} \]

Hence, the expected value of a loaded die does not equal the average value of its outcomes. If you rolled a loaded die an infinite number of times, the average outcome would be 4.125, which is higher than what you would expect from a fair die.

Notice that we did the same three things to calculate both of these expected values. We have:

Listed out all of the possible outcomes
Determined the value of each outcome (here just the value of the die)
Calculated the probability that each outcome occurred

The expected value was then just the sum of the values in step 2 multiplied by the probabilities in step 3.

You can use these steps to calculate more sophisticated expected values. For example, you could calculate the expected value of rolling a pair of weighted dice. Let’s do this step by step.

First, list out all of the possible outcomes. A total of 36 different outcomes can appear when you roll two dice. For example, you might roll (1, 1), which notates one on the first die and one on the second die. Or, you may roll (1, 2), one on the first die and two on the second. And so on. Listing out these combinations can be tedious, but R has a function that can help.

11.2 expand.grid

The expand.grid function in R provides a quick way to write out every combination of the elements in n vectors. For example, you can list every combination of two dice. To do so, run expand.grid on two copies of die:

rolls <- expand.grid(die, die)

expand.grid will return a data frame that contains every way to pair an element from the first die vector with an element from the second die vector. This will capture all 36 possible combinations of values:

rolls
##    Var1 Var2
## 1     1    1
## 2     2    1
## 3     3    1
## ...
## 34    4    6
## 35    5    6
## 36    6    6

You can use expand.grid with more than two vectors if you like. For example, you could list every combination of rolling three dice with expand.grid(die, die, die) and every combination of rolling four dice with expand.grid(die, die, die, die), and so on. expand.grid will always return a data frame that contains each possible combination of n elements from the n vectors. Each combination will contain exactly one element from each vector.

You can determine the value of each roll once you’ve made your list of outcomes. This will be the sum of the two dice, which you can calculate using R’s element-wise execution:

rolls$value <- rolls$Var1 + rolls$Var2
head(rolls, 3)
## Var1 Var2 value
##    1    1     2
##    2    1     3
##    3    1     4

R will match up the elements in each vector before adding them together. As a result, each element of value will refer to the elements of Var1 and Var2 that appear in the same row.

Next, you must determine the probability that each combination appears. You can calculate this with a basic rule of probability:

The probability that n independent, random events all occur is equal to the product of the probabilities that each random event occurs.

Or more succinctly:

\[ P(A \& B \& C \& ...) = P(A) \cdot P(B) \cdot P(C) \cdot ... \]

So the probability that we roll a (1, 1) will be equal to the probability that we roll a one on the first die, 1/8, times the probability that we roll a one on the second die, 1/8:

\[ \begin{array}{rl} P(1 \& 1) & = P(1) \cdot P(1) \\ & = \frac{1}{8} \cdot \frac{1}{8}\\ & = \frac{1}{64} \end{array} \]

And the probability that we roll a (1, 2) will be:

\[ \begin{array}{rl} P(1 \& 2) & = P(1) \cdot P(2) \\ & = \frac{1}{8} \cdot \frac{1}{8}\\ & = \frac{1}{64} \end{array} \]

And so on.

Let me suggest a three-step process for calculating these probabilities in R. First, we can look up the probabilities of rolling the values in Var1. We’ll do this with the lookup table that follows:

prob <- c("1" = 1/8, "2" = 1/8, "3" = 1/8, "4" = 1/8, "5" = 1/8, "6" = 3/8)

prob
##     1     2     3     4     5     6 
## 0.125 0.125 0.125 0.125 0.125 0.375

If you subset this table by rolls$Var1, you will get a vector of probabilities perfectly keyed to the values of Var1:

rolls$Var1
## 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

prob[rolls$Var1]
##     1     2     3     4     5     6     1     2     3     4     5     6 
## 0.125 0.125 0.125 0.125 0.125 0.375 0.125 0.125 0.125 0.125 0.125 0.375 
##     1     2     3     4     5     6     1     2     3     4     5     6 
## 0.125 0.125 0.125 0.125 0.125 0.375 0.125 0.125 0.125 0.125 0.125 0.375 
##     1     2     3     4     5     6     1     2     3     4     5     6 
## 0.125 0.125 0.125 0.125 0.125 0.375 0.125 0.125 0.125 0.125 0.125 0.375 

rolls$prob1 <- prob[rolls$Var1]
head(rolls, 3)
## Var1 Var2 value prob1
##    1    1     2 0.125
##    2    1     3 0.125
##    3    1     4 0.125

Second, we can look up the probabilities of rolling the values in Var2:

rolls$prob2 <- prob[rolls$Var2]

head(rolls, 3)
## Var1 Var2 value prob1 prob2
##    1    1     2 0.125 0.125
##    2    1     3 0.125 0.125
##    3    1     4 0.125 0.125

Third, we can calculate the probability of rolling each combination by multiplying prob1 by prob2:

rolls$prob <- rolls$prob1 * rolls$prob2

head(rolls, 3)
## Var1 Var2 value prob1 prob2     prob
##    1    1     2 0.125 0.125 0.015625
##    2    1     3 0.125 0.125 0.015625
##    3    1     4 0.125 0.125 0.015625

It is easy to calculate the expected value now that we have each outcome, the value of each outcome, and the probability of each outcome. The expected value will be the summation of the dice values multiplied by the dice probabilities:

sum(rolls$value * rolls$prob)
## 8.25

So the expected value of rolling two loaded dice is 8.25. If you rolled a pair of loaded dice an infinite number of times, the average sum would be 8.25. (If you are curious, the expected value of rolling a pair of fair dice is 7, which explains why 7 plays such a large role in dice games like craps.)

Now that you’ve warmed up, let’s use our method to calculate the expected value of the slot machine prize. We will follow the same steps we just took:

We will list out every possible outcome of playing the machine. This will be a list of every combination of three slot symbols.
We will calculate the probability of getting each combination when you play the machine.
We will determine the prize that we would win for each combination.

When we are finished, we will have a data set that looks like this:

## Var1 Var2 Var3 prob1 prob2 prob3     prob prize
##   DD   DD   DD  0.03  0.03  0.03 0.000027   800
##    7   DD   DD  0.03  0.03  0.03 0.000027     0
##  BBB   DD   DD  0.06  0.03  0.03 0.000054     0
## ... and so on.

The expected value will then be the sum of the prizes multiplied by their probability of occuring:

\[ E(\text{prize}) = \sum_{i = 1}^{n}\left( \text{prize}_{i} \cdot P(\text{prize}_{i}) \right) \]

Ready to begin?

Exercise 11.1 (List the Combinations) Use expand.grid to make a data frame that contains every possible combination of three symbols from the wheel vector:

wheel <- c("DD", "7", "BBB", "BB", "B", "C", "0")

Be sure to add the argument stringsAsFactors = FALSE to your expand.grid call; otherwise, expand.grid will save the combinations as factors, an unfortunate choice that will disrupt the score function.

Solution. To create a data frame of each combination of three symbols, you need to run expand.grid and give it three copies of wheel. The result will be a data frame with 343 rows, one for each unique combination of three slot symbols:

combos <- expand.grid(wheel, wheel, wheel, stringsAsFactors = FALSE)

combos
##   Var1 Var2 Var3
## 1   DD   DD   DD
## 2    7   DD   DD
## 3  BBB   DD   DD
## 4   BB   DD   DD
## 5    B   DD   DD
## 6    C   DD   DD
## ...
## 341    B    0    0
## 342    C    0    0
## 343    0    0    0

Now, let’s calculate the probability of getting each combination. You can use the probabilities contained in the prob argument of get_symbols to do this. These probabilities determine how frequently each symbol is chosen when your slot machine generates symbols. They were calculated after observing 345 plays of the Manitoba video lottery terminals. Zeroes have the largest chance of being selected (0.52) and cherries the least (0.01):

get_symbols <- function() {
  wheel <- c("DD", "7", "BBB", "BB", "B", "C", "0")
  sample(wheel, size = 3, replace = TRUE, 
    prob = c(0.03, 0.03, 0.06, 0.1, 0.25, 0.01, 0.52)
}

Exercise 11.2 (Make a Lookup Table) Isolate the previous probabilities in a lookup table. What names will you use in your table?

Solution. Your names should match the input that you want to look up. In this case, the input will be the character strings that appear in Var1, Var2, and Var3. So your lookup table should look like this:

prob <- c("DD" = 0.03, "7" = 0.03, "BBB" = 0.06, 
  "BB" = 0.1, "B" = 0.25, "C" = 0.01, "0" = 0.52)

Now let’s look up our probabilities.

Exercise 11.3 (Lookup the Probabilities) Look up the probabilities of getting the values in Var1. Then add them to combos as a column named prob1. Then do the same for Var2 (prob2) and Var3 (prob3).

Solution. Remember that you use R’s selection notation to look up values in a lookup table. The values that result will be keyed to the index that you use:

combos$prob1 <- prob[combos$Var1]
combos$prob2 <- prob[combos$Var2]
combos$prob3 <- prob[combos$Var3]

head(combos, 3)
## Var1 Var2 Var3 prob1 prob2 prob3
##   DD   DD   DD  0.03  0.03  0.03
##    7   DD   DD  0.03  0.03  0.03
##  BBB   DD   DD  0.06  0.03  0.03

Now how should we calculate the total probability of each combination? Our three slot symbols are all chosen independently, which means that the same rule that governed our dice probabilities governs our symbol probabilities:

\[ P(A \& B \& C \& ...) = P(A) \cdot P(B) \cdot P(C) \cdot ... \]

Exercise 11.4 (Calculate Probabilities for Each Combination) Calculate the overall probabilities for each combination. Save them as a column named prob in combos, then check your work.

You can check that the math worked by summing the probabilities. The probabilities should add up to one, because one of the combinations must appear when you play the slot machine. In other words, a combination will appear, with probability of one.

You can calculate the probabilities of every possible combination in one fell swoop with some element-wise execution:

combos$prob <- combos$prob1 * combos$prob2 * combos$prob3

head(combos, 3)
## Var1 Var2 Var3 prob1 prob2 prob3     prob
##   DD   DD   DD  0.03  0.03  0.03 0.000027
##    7   DD   DD  0.03  0.03  0.03 0.000027
##  BBB   DD   DD  0.06  0.03  0.03 0.000054

The sum of the probabilities is one, which suggests that our math is correct:

sum(combos$prob)
## 1

You only need to do one more thing before you can calculate the expected value: you must determine the prize for each combination in combos. You can calculate the prize with score. For example, we can calculate the prize for the first row of combos like this:

symbols <- c(combos[1, 1], combos[1, 2], combos[1, 3])
## "DD" "DD" "DD"

score(symbols)
## 800

However there are 343 rows, which makes for tedious work if you plan to calculate the scores manually. It will be quicker to automate this task and have R do it for you, which you can do with a for loop.

11.3 for Loops

A for loop repeats a chunk of code many times, once for each element in a set of input. for loops provide a way to tell R, “Do this for every value of that.” In R syntax, this looks like:

for (value in that) {
  this
}

The that object should be a set of objects (often a vector of numbers or character strings). The for loop will run the code in that appears between the braces once for each member of that. For example, the for loop below runs print("one run") once for each element in a vector of character strings:

for (value in c("My", "first", "for", "loop")) {
  print("one run")
}
## "one run"
## "one run"
## "one run"
## "one run"

The value symbol in a for loop acts like an argument in a function. The for loop will create an object named value and assign it a new value on each run of the loop. The code in your loop can access this value by calling the value object.

What values will the for loop assign to value? It will use the elements in the set that you run the loop on. for starts with the first element and then assigns a different element to value on each run of the for loop, until all of the elements have been assigned to value. For example, the for loop below will run print(value) four times and will print out one element of c("My", "second", "for", "loop") each time:

for (value in c("My", "second", "for", "loop")) {
  print(value)
}
## "My"
## "second"
## "for"
## "loop"

On the first run, the for loop substituted "My" for value in print(value). On the second run it substituted "second", and so on until for had run print(value) once with every element in the set:

If you look at value after the loop runs, you will see that it still contains the value of the last element in the set:

value
## "loop"

I’ve been using the symbol value in my for loops, but there is nothing special about it. You can use any symbol you like in your loop to do the same thing as long as the symbol appears before in in the parentheses that follow for. For example, you could rewrite the previous loop with any of the following:

for (word in c("My", "second", "for", "loop")) {
  print(word)
}
for (string in c("My", "second", "for", "loop")) {
  print(string)
}
for (i in c("My", "second", "for", "loop")) {
  print(i)
}

Choose your symbols carefully

R will run your loop in whichever environment you call it from. This is bad news if your loop uses object names that already exist in the environment. Your loop will overwrite the existing objects with the objects that it creates. This applies to the value symbol as well.

For loops run on sets

In many programming languages, for loops are designed to work with integers, not sets. You give the loop a starting value and an ending value, as well as an increment to advance the value by between loops. The for loop then runs until the loop value exceeds the ending value.

You can recreate this effect in R by having a for loop execute on a set of integers, but don’t lose track of the fact that R’s for loops execute on members of a set, not sequences of integers.

for loops are very useful in programming because they help you connect a piece of code with each element in a set. For example, we could use a for loop to run score once for each row in combos. However, R’s for loops have a shortcoming that you’ll want to know about before you start using them: for loops do not return output.

for loops are like Las Vegas: what happens in a for loop stays in a for loop. If you want to use the products of a for loop, you must write the for loop so that it saves its own output as it goes.

Our previous examples appeared to return output, but this was misleading. The examples worked because we called print, which always prints its arguments in the console (even if it is called from a function, a for loop, or anything else). Our for loops won’t return anything if you remove the print call:

for (value in c("My", "third", "for", "loop")) {
  value
}
##

To save output from a for loop, you must write the loop so that it saves its own output as it runs. You can do this by creating an empty vector or list before you run the for loop. Then use the for loop to fill up the vector or list. When the for loop is finished, you’ll be able to access the vector or list, which will now have all of your results.

Let’s see this in action. The following code creates an empty vector of length 4:

chars <- vector(length = 4)

The next loop will fill it with strings:

words <- c("My", "fourth", "for", "loop")

for (i in 1:4) {
  chars[i] <- words[i]
}

chars
## "My"    "fourth" "for"   "loop"

This approach will usually require you to change the sets that you execute your for loop on. Instead of executing on a set of objects, execute on a set of integers that you can use to index both your object and your storage vector. This approach is very common in R. You’ll find in practice that you use for loops not so much to run code, but to fill up vectors and lists with the results of code.

Let’s use a for loop to calculate the prize for each row in combos. To begin, create a new column in combos to store the results of the for loop:

combos$prize <- NA

head(combos, 3)
##  Var1 Var2 Var3 prob1 prob2 prob3     prob prize
##    DD   DD   DD  0.03  0.03  0.03 0.000027    NA
##     7   DD   DD  0.03  0.03  0.03 0.000027    NA
##   BBB   DD   DD  0.06  0.03  0.03 0.000054    NA

The code creates a new column named prize and fills it with NAs. R uses its recycling rules to populate every value of the column with NA.

Exercise 11.5 (Build a Loop) Construct a for loop that will run score on all 343 rows of combos. The loop should run score on the first three entries of the _i_th row of combos and should store the results in the _i_th entry of combos$prize.

Solution. You can score the rows in combos with:

for (i in 1:nrow(combos)) {
  symbols <- c(combos[i, 1], combos[i, 2], combos[i, 3])
  combos$prize[i] <- score(symbols)
}

After you run the for loop, combos$prize will contain the correct prize for each row. This exercise also tests the score function; score appears to work correctly for every possible slot combination:

head(combos, 3)
## Var1 Var2 Var3 prob1 prob2 prob3     prob prize
##   DD   DD   DD  0.03  0.03  0.03 0.000027   800
##    7   DD   DD  0.03  0.03  0.03 0.000027     0
##  BBB   DD   DD  0.06  0.03  0.03 0.000054     0

We’re now ready to calculate the expected value of the prize. The expected value is the sum of combos$prize weighted by combos$prob. This is also the payout rate of the slot machine:

sum(combos$prize * combos$prob)
## 0.538014

Uh oh. The expected prize is about 0.54, which means our slot machine only pays 54 cents on the dollar over the long run. Does this mean that the manufacturer of the Manitoba slot machines was lying?

No, because we ignored an important feature of the slot machine when we wrote score: a diamond is wild. You can treat a DD as any other symbol if it increases your prize, with one exception. You cannot make a DD a C unless you already have another C in your symbols (it’d be too easy if every DD automatically earned you $2).

The best thing about DDs is that their effects are cumulative. For example, consider the combination B, DD, B. Not only does the DD count as a B, which would earn a prize of $10; the DD also doubles the prize to $20.

Adding this behavior to our code is a little tougher than what we have done so far, but it involves all of the same principles. You can decide that your slot machine doesn’t use wilds and keep the code that we have. In that case, your slot machine will have a payout rate of about 54 percent. Or, you could rewrite your code to use wilds. If you do, you will find that your slot machine has a payout rate of 93 percent, one percent higher than the manufacturer’s claim. You can calculate this rate with the same method that we used in this section.

Exercise 11.6 (Challenge) There are many ways to modify score that would count DDs as wild. If you would like to test your skill as an R programmer, try to write your own version of score that correctly handles diamonds.

If you would like a more modest challenge, study the following score code. It accounts for wild diamonds in a way that I find elegant and succinct. See if you can understand each step in the code and how it achieves its result.

Solution. Here is a version of score that handles wild diamonds:

score <- function(symbols) {
  
  diamonds <- sum(symbols == "DD")
  cherries <- sum(symbols == "C")
  
  # identify case
  # since diamonds are wild, only nondiamonds 
  # matter for three of a kind and all bars
  slots <- symbols[symbols != "DD"]
  same <- length(unique(slots)) == 1
  bars <- slots %in% c("B", "BB", "BBB")

  # assign prize
  if (diamonds == 3) {
    prize <- 100
  } else if (same) {
    payouts <- c("7" = 80, "BBB" = 40, "BB" = 25,
      "B" = 10, "C" = 10, "0" = 0)
    prize <- unname(payouts[slots[1]])
  } else if (all(bars)) {
    prize <- 5
  } else if (cherries > 0) {
    # diamonds count as cherries
    # so long as there is one real cherry
    prize <- c(0, 2, 5)[cherries + diamonds + 1]
  } else {
    prize <- 0
  }
  
  # double for each diamond
  prize * 2^diamonds
}

Exercise 11.7 (Calculate the Expected Value) Calculate the expected value of the slot machine when it uses the new score function. You can use the existing combos data frame, but you will need to build a for loop to recalculate combos$prize.

To update the expected value, just update combos$prize:

for (i in 1:nrow(combos)) {
  symbols <- c(combos[i, 1], combos[i, 2], combos[i, 3])
  combos$prize[i] <- score(symbols)
}

Then recompute the expected value:

sum(combos$prize * combos$prob)
## 0.934356

This result vindicates the manufacturer’s claim. If anything, the slot machines seem more generous than the manufacturer stated.

11.4 while Loops

R has two companions to the for loop: the while loop and the repeat loop. A while loop reruns a chunk while a certain condition remains TRUE. To create a while loop, follow while by a condition and a chunk of code, like this:

while (condition) {
  code
}

while will rerun condition, which should be a logical test, at the start of each loop. If condition evaluates to TRUE, while will run the code between its braces. If condition evaluates to FALSE, while will finish the loop.

Why might condition change from TRUE to FALSE? Presumably because the code inside your loop has changed whether the condition is still TRUE. If the code has no relationship to the condition, a while loop will run until you stop it. So be careful. You can stop a while loop by hitting Escape or by clicking on the stop-sign icon at the top of the RStudio console pane. The icon will appear once the loop begins to run.

Like for loops, while loops do not return a result, so you must think about what you want the loop to return and save it to an object during the loop.

You can use while loops to do things that take a varying number of iterations, like calculating how long it takes to go broke playing slots (as follows). However, in practice, while loops are much less common than for loops in R:

plays_till_broke <- function(start_with) {
  cash <- start_with
  n <- 0
  while (cash > 0) {
    cash <- cash - 1 + play()
    n <- n + 1
  }
  n
}

plays_till_broke(100)
## 260

11.5 repeat Loops

repeat loops are even more basic than while loops. They will repeat a chunk of code until you tell them to stop (by hitting Escape) or until they encounter the command break, which will stop the loop.

You can use a repeat loop to recreate plays_till_broke, my function that simulates how long it takes to lose money while playing slots:

plays_till_broke <- function(start_with) {
  cash <- start_with
  n <- 0
  repeat {
    cash <- cash - 1 + play()
    n <- n + 1
    if (cash <= 0) {
      break
    }
  }
  n
}

plays_till_broke(100)
## 237

11.6 Summary

You can repeat tasks in R with for, while, and repeat loops. To use for, give it a chunk of code to run and a set of objects to loop through. for will run the code chunk once for each object. If you wish to save the output of your loop, you can assign it to an object that exists outside of the loop.

Repetition plays an important role in data science. It is the basis for simulation, as well as for estimates of variance and probability. Loops are not the only way to create repetition in R (consider replicate for example), but they are one of the most popular ways.

Unfortunately, loops in R can sometimes be slower than loops in other languages. As a result, R’s loops get a bad rap. This reputation is not entirely deserved, but it does highlight an important issue. Speed is essential to data analysis. When your code runs fast, you can work with bigger data and do more to it before you run out of time or computational power. Speed will teach you how to write fast for loops and fast code in general with R. There, you will learn to write vectorized code, a style of lightning-fast code that takes advantage of all of R’s strengths.