10 Non standard evaluation
10.1 Capturing expressions
Q: One important feature of
deparse()
to be aware of when programming is that it can return multiple strings if the input is too long. For example, the following call produces a vector of length two:g <- function(x) deparse(substitute(x)) g(a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z)
Why does this happen? Carefully read the documentation for
?deparse
. Can you write a wrapper arounddeparse()
so that it always returns a single string?A:
deparse()
has awidth.cutoff
argument (default 60 byte), which is according to?deparse
an:integer in [20, 500] determining the cutoff (in bytes) at which line-breaking is tried.
Further:
width.cutoff is a lower bound for the line lengths: deparsing a line proceeds until at least width.cutoff bytes have been output and e.g. arg = value expressions will not be split across lines.
You can wrap it with for example with
paste0()
:deparse_without_cutoff <- function(x){ paste0(deparse(x), collapse = "") }
It can be a little bit enhanced with a
gsub()
:gsub("\\s+", " ", paste0(deparse(substitute(x))))
This formats at least the spaces to a unified single space. However note that it is not possible to capture the exact input in every case:
# spaces are unified substitute(1 + 1 + 1 + 1) #> 1 + 1 + 1 + 1 quote(1 + 1 + 1 + 1) #> 1 + 1 + 1 + 1 # leading zeros in numeric input are trimmed substitute(01) #> [1] 1 quote(01) #> [1] 1
Q: Why does
as.Date.default()
usesubstitute()
anddeparse()
? Why doespairwise.t.test()
use them? Read the source code.A:
as.Date.default()
uses them to convert unexpected input expressions (neither dates, norNAs
) into a character string and return it within an error message.pairwise.t.test()
uses them to convert the names of its datainputs (response vectorx
and grouping factorg
) into character strings to format these further into a part of the desired output.Q:
pairwise.t.test()
assumes thatdeparse()
always returns a length one character vector. Can you construct an input that violates this expectation? What happens?A: We can pass an expression to one of
pairwise.t.test()
’s data input arguments, which exceeds the default cutoff width indeparse()
. The expression will be split into a character vector of length greater 1. The deparsed data inputs are directly pasted (read the source code!) with “and” as separator and the result is just used to be displayed in the output. Just the data.name output will change (it will include more than one “and”).d=1 pairwise.t.test(2, d+d+d+d+d+d+d+d+d+d+d+d+d+d+d+d+d) #> #> Pairwise comparisons using t tests with pooled SD #> #> data: 2 and d + d + d + d + d + d + d + d + d + d + d + d + d + d + d + d + 2 and d #> #> <0 x 0 matrix> #> #> P value adjustment method: holm
Q:
f()
, defined above, just callssubstitute()
. Why can’t we use it to defineg()
? In other words, what will the following code return? First make a prediction. Then run the code and think about the results.f <- function(x) substitute(x) g <- function(x) deparse(f(x)) g(1:10) # -> x g(x) # -> x g(x + y ^ 2 / z + exp(a * sin(b))) # -> x
A: All return x, because
substitute()
’s second argumentenv
is the current evaluation environmentenvironment()
. If you callsubstitute
from another function, you may want to set theenv
argument toparent.frame()
, which refers to the calling environment:f <- function(x) substitute(x, env = parent.frame()) g <- function(x) deparse(f(x)) g(1:10) # -> 1:10 g(x) # -> x g(x + y ^ 2 / z + exp(a * sin(b))) # -> x + y ^ 2 / z + exp(a * sin(b))
10.2 Non standard evaluation in subset
Q: Predict the results of the following lines of code:
eval(quote(eval(quote(eval(quote(2 + 2)))))) # -> 4 eval(eval(quote(eval(quote(eval(quote(2 + 2))))))) # -> 4 quote(eval(quote(eval(quote(eval(quote(2 + 2))))))) # eval(quote(eval(quote(eval(quote(2 + 2))))))
A: An outside
quote()
always wins…Q:
subset2()
has a bug if you use it with a single column data frame. What should the following code return? How can you modifysubset2()
so it returns the correct type of object?subset2 <- function(x, condition) { condition_call <- substitute(condition) r <- eval(condition_call, x) x[r, ] } sample_df2 <- data.frame(x = 1:10) subset2(sample_df2, x > 8) #> [1] 9 10
A: Well what does
base::subset
return?subset(sample_df2, x > 8) #> x #> 9 9 #> 10 10
So we want that the output is always a data frame and not an atomic vector like above. To return always a data frame change the last row in
subset2()
tox[r, , drop = FALSE]
.Q: The real subset function (
subset.data.frame()
) removes missing values in the condition. Modifysubset2()
to do the same: drop the offending rows.A: This time change the last row to
x[!is.na(r) & r, , drop = FALSE]
. Alternatively you can also excludeNA
s from the subset via setting them toFALSE
withr[is.na(r)] <- FALSE
.Q: What happens if you use
quote()
instead ofsubstitute()
inside ofsubset2()
?A: R looks for
condition
withinsample_df
but can’t find it, so it is looking in the execution environment forcondition
and evaluates it toa >= 4
(as supplied in the input). In the actual environment and the remaining environments (the global environment and the search path)a
can’t be found and we get the error “Error in eval(expr, envir, enclos) : object ‘a’ not found”. To understand this in detail, it is very important to forget aboutsubstitute()
for a moment and just explore whereeval()
evaluates its supplied expressions for all kind of suppliedenvir
andenclos
arguments. Before you get crazy (since a lot of stuff is coming togetehr here), look also here and here.The above is opposed to
substitute()
, which isn’t only capturing the symbolcondition
, but the expression slot of the condition promise object, which means, thatsubstitute()
notices, when a promise is assigned as it’s first argument and also stores this information. To be more precise, we quote from R Language DefinitionA formal argument is really a promise, an object with three slots, one for the expression that defines it, one for the environment in which to evaluate that expression, and one for the value of that expression once evaluated. substitute will recognize a promise variable and substitute the value of its expression slot.
Q: The second argument in
subset()
allows you to select variables. It treats variable names as if they were positions. This allows you to do things likesubset(mtcars, , -cyl)
to drop the cylinder variable, orsubset(mtcars, , disp:drat)
to select all the variables betweendisp
anddrat
. How does this work? I’ve made this easier to understand by extracting it out into its own function.select <- function(df, vars) { vars <- substitute(vars) var_pos <- setNames(as.list(seq_along(df)), names(df)) pos <- eval(vars, var_pos) df[, pos, drop = FALSE] } select(mtcars, -cyl)
A: We can comment what happens
select <- function(df, vars) { vars <- substitute(vars) var_pos <- setNames(as.list(seq_along(df)), names(df)) # We create a list with # columnnumbers and -names of the original data.frame. pos <- eval(vars, var_pos) # We evaluate the supplied variable names within # the list of all names of the data.frame and return the values of the mathing # variable names and list elements (the positions of supplied variables # within the supplied data.frame). df[, pos, drop = FALSE] # now we just subset the data.frame by its column index. } select(mtcars, -cyl)
This works also for ranges, i.e.,
select(mtcars, cyl:drat)
because of the usual precedences
cyl:drat
becomes2:5
.Q: What does
evalq()
do? Use it to reduce the amount of typing for the examples above that use botheval()
andquote()
.A: From the help of
eval()
:The evalq form is equivalent to eval(quote(expr), …). eval evaluates its first argument in the current scope before passing it to the evaluator: evalq avoids this.
In other “words”:
identical(eval(quote(x)), evalq(x)) # -> TRUE
The examples above can be written as:
eval(quote(eval(quote(eval(quote(2 + 2)))))) #-> evalq(evalq(evalq(2 + 2))) eval(eval(quote(eval(quote(eval(quote(2 + 2))))))) #-> eval(evalq(evalq(evalq(2 + 2)))) quote(eval(quote(eval(quote(eval(quote(2 + 2))))))) #-> quote(evalq(evalq(evalq(2 + 2))))
10.3 Scoping issues
Q:
plyr::arrange()
works similarly tosubset()
, but instead of selecting rows, it reorders them. How does it work? What doessubstitute(order(...))
do? Create a function that does only that and experiment with it.A:
substitute(order(...))
orders the indices of the supplied columns in...
in the context of the submitted data.frame argument, beginning with the first submitted column.We can just copy the part of the source code from
plyr::arrange()
and see if it does what we expect:arrange_indices <- function (df, ...){ stopifnot(is.data.frame(df)) ord <- eval(substitute(order(...)), df, parent.frame()) ord } identical(arrange_indices(iris, Species, Sepal.Length), order(iris$Species, iris$Sepal.Length))
Q: What does
transform()
do? Read the documentation. How does it work? Read the source code fortransform.data.frame()
. What doessubstitute(list(...))
do?A: As stated in the next question
transform()
is similar toplyr::mutate()
butplyr::mutate()
applies the transformations sequentially so that transformation can refer to columns that were just created. The rest of the question can be answered, by just commenting the source code:# Setting "..." as function argument allows the user to specify any kind of extra # argument to the function. In this case we can expect arguments of the form # new_col1 = foo(col_in_data_argument), new_col2 = foo(col_in_data_argument),... > transform.data.frame function (`_data`, ...) { # subsitute(list(...)) takes the dots into a list and just returns the expression # `list(...)`. Nothing is evaluated until now (which is important). # Evaluation of the expression happens with the `eval()` function. # This means: all the names of the arguments in `...` like new_col1, new_col2,... # become names of the list `e`. # All functions/variables like foo(column_in_data_argument) are evaluated within # the context (environment) of the `_data` argument supplied to the `transform()` # function (this is specified by the second argument of the eval() function). e <- eval(substitute(list(...)), `_data`, parent.frame()) # Everything that happens from now on is just about formatting and # returning the correct columns: # We save the names of the list (the names of the added columns) tags <- names(e) # We create a numeric vector and check if the tags (names of the added columns) # appear in the names of the supplied `_data` argument. If yes, we save the # column number, if not we save an NA. inx <- match(tags, names(`_data`)) # We create a logical vector, which is telling us if a column_name is already in the # data.frame (TRUE) or really new (FALSE) matched <- !is.na(inx) # If any new column is corresponding to an old column name, # the correspong old columns will be overwritten if (any(matched)) { `_data`[inx[matched]] <- e[matched] `_data` <- data.frame(`_data`) } # If there is at least one new column name, all of these new columns will be bound # on the old data.frame (which might have changed a bit during the first if). Then the # transformed `data_` is returned if (!all(matched)) do.call("data.frame", c(list(`_data`), e[!matched])) # Also in case of no new column names the transformed `data_` is returned else `_data` }
Q:
plyr::mutate()
is similar totransform()
but it applies the transformations sequentially so that transformation can refer to columns that were just created:df <- data.frame(x = 1:5) transform(df, x2 = x * x, x3 = x2 * x) plyr::mutate(df, x2 = x * x, x3 = x2 * x)
How does mutate work? What’s the key difference between
mutate()
andtransform()
?A: The main difference is the possibility of sequential transformations. Another difference is that unnamed added columns will be thrown away. For the implementation many ideas are are the same. However the key difference is that for the sequential transformations, a for loop is created which iterates over a list of expressions and simultaneously changes the environment for the evaluation of the next expression (which is the supplied data). This should become clear with some comments on the code:
> mutate function (.data, ...) { stopifnot(is.data.frame(.data) || is.list(.data) || is.environment(.data)) # we catch everything supplied in `...`. But this time we save this in a list of expressions. # However, again the added column names will be the names of this list. cols <- as.list(substitute(list(...))[-1]) cols <- cols[names(cols) != ""] # all unnamed arguments in `...` will be thrown away, in # contrast to `transform()` above, which just creates new columnnames. # Now a for loop evaluates all added columns iteratively in the context (environment) # of the data. # We start with the first added column:. # If the column name is already in the data, the old column will be overritten. # If the column name is new, it will be created # Since the underlying data (the environment for the evaluation) gets automatically # "updated" in every iteration of the for loop, it will be possible to use the new columns # directly in the next iteration (which relates to the next added column) for (col in names(cols)) { .data[[col]] <- eval(cols[[col]], .data, parent.frame()) } # Afterwards the data gets returned .data }
Q: What does
with()
do? How does it work? Read the source code forwith.default()
. What doeswithin()
do? How does it work? Read the source code forwithin.data.frame()
. Why is the code so much more complex thanwith()
?A:
with()
is a generic function that allows writing an expression (second argument) that refers to variablenames ofdata
(first argument) as if the corresponding variables were objects themselves.with()
evaluates the expression via aneval(substitute(expr), data, enclos = parent.frame())
construct in a temporary environment, which has the calling frame as a parent. This also means that variables that aren’t found in
data
, will be looked up inwith()
’s calling environment. As stated in?with
, this is useful for modelling functions.In contrast to
with()
, which returns the value of the evaluated expression,within()
returns the modified object. Sowithin()
can be used as an alternative tobase::transform()
.within()
first creates an environment withdata
as parent andwithin()
’s calling environment as grandparent. This environment becomes changed, since afterwards the expression is evaluated inside of it. The rest of the code converts this environment into a list and ensures that new variables are not overriden by the former ones.
10.4 Calling from another function
- Q: The following R functions all use NSE. For each, describe how it uses NSE,
and read the documentation to determine its escape hatch.
rm()
library()
andrequire()
substitute()
data()
data.frame()
A: For NSE in
rm()
, we just look at its first two arguments:...
andlist = character()
. If we supply expressions to...
(which can also be character vectors) , these will be caught bymatch.call()
and become an unevaluated call (in this case a pairlist). However,rm()
copies and converts the expressions into a character representation and concatenates these with the character vector supplied to the list argument. Then the removing starts… The escape hatch is to supply the objects to be removed as a character vector torm()
’s list argument.You can supply the input to
library()
’s andrequire()
’s first argument (package
) with or without quotes. In the default case (character.only = FALSE
) the input topackage
will be converted viaas.character(substitute(package))
. To ommit this, just supply a character vector and setcharacter.only = TRUE
.substitute()
andeval()
/quote
are the basic functions for NSE. To see how it’s done one has to understand parse trees and/or look into the underlying C code. The problematic behaviour ofsubstitute()
is pretty obvious. There might be some insights that make it predictable, but sincesubstitute()
is written for NSE and only contains the argumentsexpr
andenv
, it seems that no escape hatch exists.Like
rm()
data()
has the first arguments...
andlist = character()
. Again you can supply unquoted or quoted names to...
. These will be caught, converted to character viaas.character(substitute(list(...))[-1L])
and concatenated with the character input of thelist
argument. The escape hatch is similar torm()
: use explicitly thelist
argument.data.frame()
’s first argument,...
, gets caught once viaobject <- as.list(substitute(list(...)))[-1L]
and oncex <- list(...)
. First one is used among others to create rownames. This can be suppressed via the setting of the argumentrow.names
, which lets you supply a vector or specifing a column of the data.frame for the explicit naming of rows.x
will be deparsed later and is then used to create the columnnames. Since this process underlies several complex rules in cases of “special namings”,data.frame()
provides thecheck.names
argument. One can setcheck.names = FALSE
, to ensure that columns will be named however they are supplied todata.frame()
. Q: Base functions
match.fun()
,page()
, andls()
all try to automatically determine whether you want standard or non-standard evaluation. Each uses a different approach. Figure out the essence of each approach then compare and contrast.A:
match.fun
uses NSE if you pass something other than a length-one character or symbol, and does not use NSE otherwise.page
uses NSE if you pass something other than a length-one character. Symbols would still trigger NSE.ls
triggers NSE substitute if it cannot evaluate the directory passed as a variable, and triggers NSE deparse if the result is not a character.
The
ls
method seems safest of the three approaches, but is also the least performant.Q: Add an escape hatch to
plyr::mutate()
by splitting it into two functions. One function should capture the unevaluated inputs. The other should take a data frame and list of expressions and perform the computation.A: We look again at the source code of
plyr::mutate()
:plyr::mutate #> function (.data, ...) #> { #> stopifnot(is.data.frame(.data) || is.list(.data) || is.environment(.data)) #> cols <- as.list(substitute(list(...))[-1]) #> cols <- cols[names(cols) != ""] #> for (col in names(cols)) { #> .data[[col]] <- eval(cols[[col]], .data, parent.frame()) #> } #> .data #> } #> <bytecode: 0x563577a52190> #> <environment: namespace:plyr>
What we want is to have the local variable “cols” as an argument of our new (wrapped) escape hatch function (analogously as shown with
subset2_q()
in the textbook).Therefore we create:
get_cols <- function(...) { ll <- as.list(substitute(list(...))) ll[names(ll) != ""] }
We also want a function, that works with “cols” and performs the computation (the for loop in the original
plyr::mutate()
):mutate_cols <- function(df, cols) { for (col in names(cols)) { df[[col]] <- eval(cols[[col]], df, parent.frame()) } df }
Now we can wrap these with our new mutate function and have a nice interface:
mutate2 <- function(df, ...) { mutate_cols(df, get_cols(df, ...)) } # a little test df <- data.frame(x = 1:5) identical( plyr::mutate(df, x2 = x * x, x3 = x2 * x), mutate2(df, x2 = x * x, x3 = x2 * x) ) #> [1] TRUE
Q: What’s the escape hatch for
ggplot2::aes()
? What aboutplyr::.()
? What do they have in common? What are the advantages and disadvantages of their differences?- One can call
rename_aes
directly. plyr::.
lets you specify an env in which to evaluate...
.
Both evaluate
...
usingmatch.call()
and create a structure out of them.plyr::.
probably requires less knowledge about internals, but is also less customizable.- One can call
Q: The version of
subset2_q()
I presented is a simplification of real code. Why is the following version better?subset2_q <- function(x, cond, env = parent.frame()) { r <- eval(cond, x, env) x[r, ] }
Rewrite
subset2()
andsubscramble()
to use this improved version.A:
subset2_q_old <- function(x, condition) { r <- eval(condition, x, parent.frame()) x[r, ] } subset2_q <- function(x, cond, env = parent.frame()) { r <- eval(cond, x, env) x[r, ] }
The modified version of subset2_q allows you to specify an environment in which to evaluate the condition, which allows you to run
subset2_q()
in more situations (such as within a dataframe).subset2 <- function(x, condition, env = parent.frame()) { subset2_q(x, substitute(condition), env) } scramble <- function(x) x[sample(nrow(x)), ] subscramble <- function(x, condition, env = parent.frame()) { condition <- substitute(condition, env) scramble(subset2_q(x, condition, env)) }
10.5 Substitute
Q: Use
pryr::subs()
to convert the LHS to the RHS for each of the following pairs:a + b + c
->a * b * c
f(g(a, b), c)
->(a + b) * c
f(a < b, c, d)
->if (a < b) c else d
A:
subs(a + b + c, list("+" = quote(`*`))) # -> `a * b * c` subs(f(g(a, b), c), list(g = quote(`+`), f = quote(`*`))) # -> `(a + b) * c` subs(f(a < b, c, d), list(f = quote(`if`))) # -> `if (a < b) c else d`
- Q: For each of the following pairs of expressions, describe why you can’t
use
subs()
to convert one to the other.a + b + c
->a + b * c
f(a, b)
->f(a, b, c)
f(a, b, c)
->f(a, b)
a + b + c
->a + b * c
You can’t convert one “+” to “+” and the other to "*“, becausesubs()
converts either all instances of the”+" or no instances of the “+”.f(a, b)
->f(a, b, c)
subs()
cannot be used to add new arguments, only convert.- f(a, b, c) -> f(a, b)
subs()
cannot be used to subtract new arguments, only convert.
Q: How does
pryr::named_dots()
work? Read the source.A: It captures the dot arguments using
pryr::dots
(which is justeval(substitute(alist(...)))
), and then gets the names of the arguments, using "" for the arguments without names.If all the args are "", it simply returns the args. Otherwise, it names the args with their values, and returns the renamed list of args.
10.6 The downsides of non-standard evaluation
Q: What does the following function do? What’s the escape hatch? Do you think that this is an appropriate use of NSE?
nl <- function(...) { dots <- pryr::named_dots(...) lapply(dots, eval, parent.frame()) }
A:
nl()
extracts the dots, names them, and then evaluates them in the global namespace. This returns a list of arguments that are named by what is literally in the dots, with the values of what the dots evaluate to.For example:
nl(1, 2 + 2, mean(c(3, 5))) #> $`1` #> [1] 1 #> #> $`2 + 2` #> [1] 4 #> #> $`mean(c(3, 5))` #> [1] 4
You can always call the underlying
lapply
directly as an escape hatch.However, it is a toy example and we are not really sure what you would gain from actually using this.
Q: Instead of relying on promises, you can use formulas created with ~ to explicitly capture an expression and its environment. What are the advantages and disadvantages of making quoting explicit? How does it impact referential transparency?
A: Using formulas in this manner would allow for referential transparency, but it would make working with NSE much more verbose. In any situation in which it is worth using NSE, it would also be worth not using formulas like this.
Q: Read the standard non-standard evaluation rules found at http://developer.r-project.org/nonstandard-eval.pdf.