# 2 Subsetting

## 2.1 Data types

**Q**: Fix each of the following common data frame subsetting errors:`mtcars[mtcars$cyl = 4, ] # = -> == mtcars[-1:4, ] # -1:4 -> -(1:4) mtcars[mtcars$cyl <= 5] # "," is missing mtcars[mtcars$cyl == 4 | 6, ] # 6 -> mtcars$cyl == 6`

**Q**: Why does`x <- 1:5; x[NA]`

yield five missing values? (Hint: why is it different from`x[NA_real_]`

?)

**A**:`NA`

is of class logical, so`x[NA]`

becomes recycled to`x[NA, NA, NA, NA, NA]`

. Since subsetting an atomic with`NA`

leads to an`NA`

, you will get 5 of them returned.**Q**: What does`upper.tri()`

return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?`x <- outer(1:5, 1:5, FUN = "*") x[upper.tri(x)]`

**A**:`upper.tri()`

has really intuitive source code. It coerces it’s input to a matrix and returns a logical matrix. Hence describing it’s behaviour for the use of subsetting is based on everything that applies to subsetting with logical matrices.**Q**: Why does`mtcars[1:20]`

return an error? How does it differ from the similar`mtcars[1:20, ]`

?

**A**: In the first case`mtcar`

is subsetted with a vector and the statement should return a data.frame of the first 20 columns in`mtcars`

. Since`mtcars`

only has 11 columns, the index is out of bounds, which explains the error. The biggest difference of`mtcars[1:20, ]`

to the former case, is that now`mtcars`

is subsetted with two vectors. In this case you will get returned the first 20 rows and all columns (like subsetting a matrix).**Q**: Implement your own function that extracts the diagonal entries from a matrix (it should behave like`diag(x)`

where`x`

is a matrix).

**A**: First we copy the relevant part of the source code from the`diag()`

function:`function (x = 1, nrow, ncol){ if (is.matrix(x)) { if (nargs() > 1L) # this and the next line will be dropped stop("'nrow' or 'ncol' cannot be specified when 'x' is a matrix") if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }`

In the next step we drop the unncessary

`nrow`

and`ncol`

argument and the related code in the 3rd and 4th line:`diag_v <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] # subsetting with a vector nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }`

If we look for the idea to capture the diagonal elements, we can see that the input matrix is subsetted with a vector, so we called this function

`diag_v()`

. Of course we can implement our own function`diag_m()`

, where we subset with a matrix.`diag_m <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[matrix(c(1:m,1:m), m)] # subsetting with a matrix nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }`

Now we can check if we get the same results as the original function and also compare the speed. Therefore we convert the relatively large

`diamonds`

dataset from the`ggplot2`

package into a matrix.`diamonds_m <- as.matrix(ggplot2::diamonds) stopifnot(all.equal(diag(diamonds_m), diag_v(diamonds_m), diag_m(diamonds_m))) # our functions succeed the little test microbenchmark::microbenchmark( diag(diamonds_m), diag_v(diamonds_m), diag_m(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> diag(diamonds_m) 4.851 5.2590 5.70113 5.7315 5.932 10.378 100 #> diag_v(diamonds_m) 12.666 13.1455 13.31255 13.2730 13.403 14.948 100 #> diag_m(diamonds_m) 14.680 15.1415 94.71131 15.2890 15.472 7935.469 100`

The original function seems to be a little bit faster than the trimmed and our matrix version. Maybe this is due to compiling issues

`diag_c <- compiler::cmpfun(diag) diag_v_c <- compiler::cmpfun(diag_v) diag_m_c <- compiler::cmpfun(diag_m) # Now we can make a fair comparison of the speed: microbenchmark::microbenchmark( diag_c(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> diag_c(diamonds_m) 4.743 5.005 5.59129 5.7285 5.8465 12.881 100 #> diag_v_c(diamonds_m) 12.616 12.880 13.09981 13.0080 13.1970 18.044 100 #> diag_m_c(diamonds_m) 14.559 14.984 15.67262 15.1190 15.2835 66.300 100`

We can see that our diag_m version is only a little bit slower than the original version. However the source code of the matrix version could be a bit easier to read.

We could also take an idea from the source code of

`upper.tri()`

and subset with a logical vector (but it turns out to be really slow):`diag_lv <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[row(x) == col(x)] nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }`

compile it and compare it with the other versions

`diag_lv_c <- compiler::cmpfun(diag_lv) stopifnot(all.equal(diag(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m), diag_lv_c(diamonds_m))) # our functions succeed the little test microbenchmark::microbenchmark( diag(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m), diag_lv_c(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq #> diag(diamonds_m) 5.015 7.2075 8.71495 7.9880 9.9685 #> diag_v_c(diamonds_m) 13.665 16.3980 18.67010 17.0185 18.9435 #> diag_m_c(diamonds_m) 16.911 18.8905 21.05842 19.5075 22.9010 #> diag_lv_c(diamonds_m) 1648.256 2077.4310 2712.09951 2102.5180 3653.8960 #> max neval #> 19.462 100 #> 60.965 100 #> 41.901 100 #> 5030.135 100`

**Q**: What does`df[is.na(df)] <- 0`

do? How does it work?

**A**: It replaces all`NA`

s within`df`

with the value`0`

.`is.na(df)`

returns a logical matrix which is used to subset df. Since you can combine subsetting and assignment, only the matched part of`df`

(the`NA`

s) is replaced with`0`

entries.

## 2.2 Subsetting operators

**Q**: Given a linear model, e.g.,`mod <- lm(mpg ~ wt, data = mtcars)`

, extract the residual degrees of freedom. Extract the R squared from the model summary (`summary(mod)`

)

**A**: Since`mod`

is of type list we can expect several possibilities:`mod$df.residual # preserving output mod$df.r # preserving output with partial matching mod["df.residual"] # list output (without partial matching) mod[["df.residual"]] # preserving output (without partial matching)`

The same states for

`summary(mod)`

, so we can use for example:`summary(mod)$r.squared`

(To get tidy output from r-models in general also the

`broom`

package is a good alternative).

## 2.3 Applications

**Q**: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?

**A**: Combine``[``

with the`sample()`

function:`iris[sample(ncol(iris))] # permute rows iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE] # permute both at the same time`

**Q**: How would you select a random sample of`m`

rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

**A**: For example`m=10 iris[sample(nrow(iris), m),] # Blockversion start <- sample(nrow(iris) - m + 1, 1) end <- start + m - 1 iris[start:end, , drop = FALSE]`

**Q**: How could you put the columns in a data frame in alphabetical order?

**A**: We can sort the names and subset by name:`iris[sort(names(iris))]`