2 Subsetting
2.1 Data types
Q: Fix each of the following common data frame subsetting errors:
mtcars[mtcars$cyl = 4, ] # = -> == mtcars[-1:4, ] # -1:4 -> -(1:4) mtcars[mtcars$cyl <= 5] # "," is missing mtcars[mtcars$cyl == 4 | 6, ] # 6 -> mtcars$cyl == 6
Q: Why does
x <- 1:5; x[NA]
yield five missing values? (Hint: why is it different fromx[NA_real_]
?)
A:NA
is of class logical, sox[NA]
becomes recycled tox[NA, NA, NA, NA, NA]
. Since subsetting an atomic withNA
leads to anNA
, you will get 5 of them returned.Q: What does
upper.tri()
return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?x <- outer(1:5, 1:5, FUN = "*") x[upper.tri(x)]
A:
upper.tri()
has really intuitive source code. It coerces it’s input to a matrix and returns a logical matrix. Hence describing it’s behaviour for the use of subsetting is based on everything that applies to subsetting with logical matrices.Q: Why does
mtcars[1:20]
return an error? How does it differ from the similarmtcars[1:20, ]
?
A: In the first casemtcar
is subsetted with a vector and the statement should return a data.frame of the first 20 columns inmtcars
. Sincemtcars
only has 11 columns, the index is out of bounds, which explains the error. The biggest difference ofmtcars[1:20, ]
to the former case, is that nowmtcars
is subsetted with two vectors. In this case you will get returned the first 20 rows and all columns (like subsetting a matrix).Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like
diag(x)
wherex
is a matrix).
A: First we copy the relevant part of the source code from thediag()
function:function (x = 1, nrow, ncol){ if (is.matrix(x)) { if (nargs() > 1L) # this and the next line will be dropped stop("'nrow' or 'ncol' cannot be specified when 'x' is a matrix") if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }
In the next step we drop the unncessary
nrow
andncol
argument and the related code in the 3rd and 4th line:diag_v <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] # subsetting with a vector nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }
If we look for the idea to capture the diagonal elements, we can see that the input matrix is subsetted with a vector, so we called this function
diag_v()
. Of course we can implement our own functiondiag_m()
, where we subset with a matrix.diag_m <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[matrix(c(1:m,1:m), m)] # subsetting with a matrix nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }
Now we can check if we get the same results as the original function and also compare the speed. Therefore we convert the relatively large
diamonds
dataset from theggplot2
package into a matrix.diamonds_m <- as.matrix(ggplot2::diamonds) stopifnot(all.equal(diag(diamonds_m), diag_v(diamonds_m), diag_m(diamonds_m))) # our functions succeed the little test microbenchmark::microbenchmark( diag(diamonds_m), diag_v(diamonds_m), diag_m(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> diag(diamonds_m) 4.851 5.2590 5.70113 5.7315 5.932 10.378 100 #> diag_v(diamonds_m) 12.666 13.1455 13.31255 13.2730 13.403 14.948 100 #> diag_m(diamonds_m) 14.680 15.1415 94.71131 15.2890 15.472 7935.469 100
The original function seems to be a little bit faster than the trimmed and our matrix version. Maybe this is due to compiling issues
diag_c <- compiler::cmpfun(diag) diag_v_c <- compiler::cmpfun(diag_v) diag_m_c <- compiler::cmpfun(diag_m) # Now we can make a fair comparison of the speed: microbenchmark::microbenchmark( diag_c(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> diag_c(diamonds_m) 4.743 5.005 5.59129 5.7285 5.8465 12.881 100 #> diag_v_c(diamonds_m) 12.616 12.880 13.09981 13.0080 13.1970 18.044 100 #> diag_m_c(diamonds_m) 14.559 14.984 15.67262 15.1190 15.2835 66.300 100
We can see that our diag_m version is only a little bit slower than the original version. However the source code of the matrix version could be a bit easier to read.
We could also take an idea from the source code of
upper.tri()
and subset with a logical vector (but it turns out to be really slow):diag_lv <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[row(x) == col(x)] nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }
compile it and compare it with the other versions
diag_lv_c <- compiler::cmpfun(diag_lv) stopifnot(all.equal(diag(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m), diag_lv_c(diamonds_m))) # our functions succeed the little test microbenchmark::microbenchmark( diag(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m), diag_lv_c(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq #> diag(diamonds_m) 5.015 7.2075 8.71495 7.9880 9.9685 #> diag_v_c(diamonds_m) 13.665 16.3980 18.67010 17.0185 18.9435 #> diag_m_c(diamonds_m) 16.911 18.8905 21.05842 19.5075 22.9010 #> diag_lv_c(diamonds_m) 1648.256 2077.4310 2712.09951 2102.5180 3653.8960 #> max neval #> 19.462 100 #> 60.965 100 #> 41.901 100 #> 5030.135 100
Q: What does
df[is.na(df)] <- 0
do? How does it work?
A: It replaces allNA
s withindf
with the value0
.is.na(df)
returns a logical matrix which is used to subset df. Since you can combine subsetting and assignment, only the matched part ofdf
(theNA
s) is replaced with0
entries.
2.2 Subsetting operators
Q: Given a linear model, e.g.,
mod <- lm(mpg ~ wt, data = mtcars)
, extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod)
)
A: Sincemod
is of type list we can expect several possibilities:mod$df.residual # preserving output mod$df.r # preserving output with partial matching mod["df.residual"] # list output (without partial matching) mod[["df.residual"]] # preserving output (without partial matching)
The same states for
summary(mod)
, so we can use for example:summary(mod)$r.squared
(To get tidy output from r-models in general also the
broom
package is a good alternative).
2.3 Applications
Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?
A: Combine`[`
with thesample()
function:iris[sample(ncol(iris))] # permute rows iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE] # permute both at the same time
Q: How would you select a random sample of
m
rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?
A: For examplem=10 iris[sample(nrow(iris), m),] # Blockversion start <- sample(nrow(iris) - m + 1, 1) end <- start + m - 1 iris[start:end, , drop = FALSE]
Q: How could you put the columns in a data frame in alphabetical order?
A: We can sort the names and subset by name:iris[sort(names(iris))]