2 Subsetting
2.1 Data types
Q: Fix each of the following common data frame subsetting errors:
mtcars[mtcars$cyl = 4, ] # = -> == mtcars[-1:4, ] # -1:4 -> -(1:4) mtcars[mtcars$cyl <= 5] # "," is missing mtcars[mtcars$cyl == 4 | 6, ] # 6 -> mtcars$cyl == 6Q: Why does
x <- 1:5; x[NA]yield five missing values? (Hint: why is it different fromx[NA_real_]?)
A:NAis of class logical, sox[NA]becomes recycled tox[NA, NA, NA, NA, NA]. Since subsetting an atomic withNAleads to anNA, you will get 5 of them returned.Q: What does
upper.tri()return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?x <- outer(1:5, 1:5, FUN = "*") x[upper.tri(x)]A:
upper.tri()has really intuitive source code. It coerces it’s input to a matrix and returns a logical matrix. Hence describing it’s behaviour for the use of subsetting is based on everything that applies to subsetting with logical matrices.Q: Why does
mtcars[1:20]return an error? How does it differ from the similarmtcars[1:20, ]?
A: In the first casemtcaris subsetted with a vector and the statement should return a data.frame of the first 20 columns inmtcars. Sincemtcarsonly has 11 columns, the index is out of bounds, which explains the error. The biggest difference ofmtcars[1:20, ]to the former case, is that nowmtcarsis subsetted with two vectors. In this case you will get returned the first 20 rows and all columns (like subsetting a matrix).Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like
diag(x)wherexis a matrix).
A: First we copy the relevant part of the source code from thediag()function:function (x = 1, nrow, ncol){ if (is.matrix(x)) { if (nargs() > 1L) # this and the next line will be dropped stop("'nrow' or 'ncol' cannot be specified when 'x' is a matrix") if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }In the next step we drop the unncessary
nrowandncolargument and the related code in the 3rd and 4th line:diag_v <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] # subsetting with a vector nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }If we look for the idea to capture the diagonal elements, we can see that the input matrix is subsetted with a vector, so we called this function
diag_v(). Of course we can implement our own functiondiag_m(), where we subset with a matrix.diag_m <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[matrix(c(1:m,1:m), m)] # subsetting with a matrix nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }Now we can check if we get the same results as the original function and also compare the speed. Therefore we convert the relatively large
diamondsdataset from theggplot2package into a matrix.diamonds_m <- as.matrix(ggplot2::diamonds) stopifnot(all.equal(diag(diamonds_m), diag_v(diamonds_m), diag_m(diamonds_m))) # our functions succeed the little test microbenchmark::microbenchmark( diag(diamonds_m), diag_v(diamonds_m), diag_m(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> diag(diamonds_m) 4.851 5.2590 5.70113 5.7315 5.932 10.378 100 #> diag_v(diamonds_m) 12.666 13.1455 13.31255 13.2730 13.403 14.948 100 #> diag_m(diamonds_m) 14.680 15.1415 94.71131 15.2890 15.472 7935.469 100The original function seems to be a little bit faster than the trimmed and our matrix version. Maybe this is due to compiling issues
diag_c <- compiler::cmpfun(diag) diag_v_c <- compiler::cmpfun(diag_v) diag_m_c <- compiler::cmpfun(diag_m) # Now we can make a fair comparison of the speed: microbenchmark::microbenchmark( diag_c(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> diag_c(diamonds_m) 4.743 5.005 5.59129 5.7285 5.8465 12.881 100 #> diag_v_c(diamonds_m) 12.616 12.880 13.09981 13.0080 13.1970 18.044 100 #> diag_m_c(diamonds_m) 14.559 14.984 15.67262 15.1190 15.2835 66.300 100We can see that our diag_m version is only a little bit slower than the original version. However the source code of the matrix version could be a bit easier to read.
We could also take an idea from the source code of
upper.tri()and subset with a logical vector (but it turns out to be really slow):diag_lv <- function (x = 1) { if (is.matrix(x)) { if ((m <- min(dim(x))) == 0L) return(vector(typeof(x), 0L)) y <- x[row(x) == col(x)] nms <- dimnames(x) if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), nms[[2L]][seq_len(m)])) names(y) <- nm return(y) } }compile it and compare it with the other versions
diag_lv_c <- compiler::cmpfun(diag_lv) stopifnot(all.equal(diag(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m), diag_lv_c(diamonds_m))) # our functions succeed the little test microbenchmark::microbenchmark( diag(diamonds_m), diag_v_c(diamonds_m), diag_m_c(diamonds_m), diag_lv_c(diamonds_m) ) #> Unit: microseconds #> expr min lq mean median uq #> diag(diamonds_m) 5.015 7.2075 8.71495 7.9880 9.9685 #> diag_v_c(diamonds_m) 13.665 16.3980 18.67010 17.0185 18.9435 #> diag_m_c(diamonds_m) 16.911 18.8905 21.05842 19.5075 22.9010 #> diag_lv_c(diamonds_m) 1648.256 2077.4310 2712.09951 2102.5180 3653.8960 #> max neval #> 19.462 100 #> 60.965 100 #> 41.901 100 #> 5030.135 100Q: What does
df[is.na(df)] <- 0do? How does it work?
A: It replaces allNAs withindfwith the value0.is.na(df)returns a logical matrix which is used to subset df. Since you can combine subsetting and assignment, only the matched part ofdf(theNAs) is replaced with0entries.
2.2 Subsetting operators
Q: Given a linear model, e.g.,
mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod))
A: Sincemodis of type list we can expect several possibilities:mod$df.residual # preserving output mod$df.r # preserving output with partial matching mod["df.residual"] # list output (without partial matching) mod[["df.residual"]] # preserving output (without partial matching)The same states for
summary(mod), so we can use for example:summary(mod)$r.squared(To get tidy output from r-models in general also the
broompackage is a good alternative).
2.3 Applications
Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?
A: Combine`[`with thesample()function:iris[sample(ncol(iris))] # permute rows iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE] # permute both at the same timeQ: How would you select a random sample of
mrows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?
A: For examplem=10 iris[sample(nrow(iris), m),] # Blockversion start <- sample(nrow(iris) - m + 1, 1) end <- start + m - 1 iris[start:end, , drop = FALSE]Q: How could you put the columns in a data frame in alphabetical order?
A: We can sort the names and subset by name:iris[sort(names(iris))]