2 Subsetting

2.1 Data types

  1. Q: Fix each of the following common data frame subsetting errors:

    mtcars[mtcars$cyl = 4, ]       # = -> ==
    mtcars[-1:4, ]                 # -1:4 -> -(1:4)
    mtcars[mtcars$cyl <= 5]        # "," is missing
    mtcars[mtcars$cyl == 4 | 6, ]  # 6 -> mtcars$cyl == 6
  2. Q: Why does x <- 1:5; x[NA] yield five missing values? (Hint: why is it different from x[NA_real_]?)
    A: NA is of class logical, so x[NA] becomes recycled to x[NA, NA, NA, NA, NA]. Since subsetting an atomic with NA leads to an NA, you will get 5 of them returned.

  3. Q: What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

    x <- outer(1:5, 1:5, FUN = "*")
    x[upper.tri(x)]

    A: upper.tri() has really intuitive source code. It coerces it’s input to a matrix and returns a logical matrix. Hence describing it’s behaviour for the use of subsetting is based on everything that applies to subsetting with logical matrices.

  4. Q: Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?
    A: In the first case mtcar is subsetted with a vector and the statement should return a data.frame of the first 20 columns in mtcars. Since mtcars only has 11 columns, the index is out of bounds, which explains the error. The biggest difference of mtcars[1:20, ] to the former case, is that now mtcars is subsetted with two vectors. In this case you will get returned the first 20 rows and all columns (like subsetting a matrix).

  5. Q: Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).
    A: First we copy the relevant part of the source code from the diag() function:

    function (x = 1, nrow, ncol){
      if (is.matrix(x)) {
        if (nargs() > 1L) # this and the next line will be dropped
          stop("'nrow' or 'ncol' cannot be specified when 'x' is a matrix") 
        if ((m <- min(dim(x))) == 0L) 
          return(vector(typeof(x), 0L))
        y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)]
        nms <- dimnames(x)
        if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]),
                                                                    nms[[2L]][seq_len(m)]))
          names(y) <- nm
        return(y)
      }
      }

    In the next step we drop the unncessary nrow and ncol argument and the related code in the 3rd and 4th line:

    diag_v <- function (x = 1) {
      if (is.matrix(x)) {
        if ((m <- min(dim(x))) == 0L) 
          return(vector(typeof(x), 0L))
        y <- x[1 + 0L:(m - 1L) * (dim(x)[1L] + 1)] # subsetting with a vector
        nms <- dimnames(x)
        if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]),
                                                                    nms[[2L]][seq_len(m)]))
          names(y) <- nm
        return(y)
      }
      }

    If we look for the idea to capture the diagonal elements, we can see that the input matrix is subsetted with a vector, so we called this function diag_v(). Of course we can implement our own function diag_m(), where we subset with a matrix.

    diag_m <- function (x = 1) {
      if (is.matrix(x)) {
        if ((m <- min(dim(x))) == 0L) 
          return(vector(typeof(x), 0L))
        y <- x[matrix(c(1:m,1:m), m)] # subsetting with a matrix
        nms <- dimnames(x)
        if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), 
                                                                        nms[[2L]][seq_len(m)])) 
          names(y) <- nm
        return(y)
      }
    }

    Now we can check if we get the same results as the original function and also compare the speed. Therefore we convert the relatively large diamonds dataset from the ggplot2 package into a matrix.

    diamonds_m <- as.matrix(ggplot2::diamonds)
    
    stopifnot(all.equal(diag(diamonds_m),
                        diag_v(diamonds_m),
                        diag_m(diamonds_m))) # our functions succeed the little test
    
    microbenchmark::microbenchmark(
      diag(diamonds_m),
      diag_v(diamonds_m),
      diag_m(diamonds_m)
    )
    #> Unit: microseconds
    #>                expr    min      lq     mean  median     uq      max neval
    #>    diag(diamonds_m)  4.851  5.2590  5.70113  5.7315  5.932   10.378   100
    #>  diag_v(diamonds_m) 12.666 13.1455 13.31255 13.2730 13.403   14.948   100
    #>  diag_m(diamonds_m) 14.680 15.1415 94.71131 15.2890 15.472 7935.469   100

    The original function seems to be a little bit faster than the trimmed and our matrix version. Maybe this is due to compiling issues

    diag_c <- compiler::cmpfun(diag)
    diag_v_c <- compiler::cmpfun(diag_v)
    diag_m_c <- compiler::cmpfun(diag_m)
    
    # Now we can make a fair comparison of the speed:
    
    microbenchmark::microbenchmark(
      diag_c(diamonds_m),
      diag_v_c(diamonds_m),
      diag_m_c(diamonds_m)
    )
    #> Unit: microseconds
    #>                  expr    min     lq     mean  median      uq    max neval
    #>    diag_c(diamonds_m)  4.743  5.005  5.59129  5.7285  5.8465 12.881   100
    #>  diag_v_c(diamonds_m) 12.616 12.880 13.09981 13.0080 13.1970 18.044   100
    #>  diag_m_c(diamonds_m) 14.559 14.984 15.67262 15.1190 15.2835 66.300   100

    We can see that our diag_m version is only a little bit slower than the original version. However the source code of the matrix version could be a bit easier to read.

    We could also take an idea from the source code of upper.tri() and subset with a logical vector (but it turns out to be really slow):

    diag_lv <- function (x = 1) {
      if (is.matrix(x)) {
        if ((m <- min(dim(x))) == 0L) 
          return(vector(typeof(x), 0L))
        y <- x[row(x) == col(x)]
        nms <- dimnames(x)
        if (is.list(nms) && !any(sapply(nms, is.null)) && identical((nm <- nms[[1L]][seq_len(m)]), 
                                                                nms[[2L]][seq_len(m)])) 
          names(y) <- nm
        return(y)
      }
      }

    compile it and compare it with the other versions

    diag_lv_c <- compiler::cmpfun(diag_lv)
    stopifnot(all.equal(diag(diamonds_m),
                        diag_v_c(diamonds_m),
                        diag_m_c(diamonds_m),
                        diag_lv_c(diamonds_m))) # our functions succeed the little test
    
    microbenchmark::microbenchmark(
      diag(diamonds_m),
      diag_v_c(diamonds_m),
      diag_m_c(diamonds_m),
      diag_lv_c(diamonds_m)
    )
    #> Unit: microseconds
    #>                   expr      min        lq       mean    median        uq
    #>       diag(diamonds_m)    5.015    7.2075    8.71495    7.9880    9.9685
    #>   diag_v_c(diamonds_m)   13.665   16.3980   18.67010   17.0185   18.9435
    #>   diag_m_c(diamonds_m)   16.911   18.8905   21.05842   19.5075   22.9010
    #>  diag_lv_c(diamonds_m) 1648.256 2077.4310 2712.09951 2102.5180 3653.8960
    #>       max neval
    #>    19.462   100
    #>    60.965   100
    #>    41.901   100
    #>  5030.135   100
  6. Q: What does df[is.na(df)] <- 0 do? How does it work?
    A: It replaces all NAs within df with the value 0. is.na(df) returns a logical matrix which is used to subset df. Since you can combine subsetting and assignment, only the matched part of df (the NAs) is replaced with 0 entries.

2.2 Subsetting operators

  1. Q: Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Extract the R squared from the model summary (summary(mod))
    A: Since mod is of type list we can expect several possibilities:

    mod$df.residual       # preserving output
    mod$df.r              # preserving output with partial matching
    mod["df.residual"]    # list output (without partial matching)
    mod[["df.residual"]]  # preserving output (without partial matching)

    The same states for summary(mod), so we can use for example:

    summary(mod)$r.squared

    (To get tidy output from r-models in general also the broom package is a good alternative).

2.3 Applications

  1. Q: How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?
    A: Combine `[` with the sample() function:

    iris[sample(ncol(iris))] # permute rows
    iris[sample(nrow(iris)), sample(ncol(iris)), drop = FALSE] # permute both at the same time
  2. Q: How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?
    A: For example

    m=10
    iris[sample(nrow(iris), m),]
    
    # Blockversion
    start <- sample(nrow(iris) - m + 1, 1)
    end <- start + m - 1
    iris[start:end, , drop = FALSE]
  3. Q: How could you put the columns in a data frame in alphabetical order?
    A: We can sort the names and subset by name:

    iris[sort(names(iris))]