4 min read

re-writing as.formula

The base r function as.formula uses R’s parsing rules, so it ignores white space, but cares about operators. That can lead to problems if the string it’s passed doesn’t obey those rules. For instance

as.formula("money ~ brains + looks") # works
## money ~ brains + looks

works fine, but

as.formula("gross-pay ~ min-wage + tips") # doesn't work
## gross - pay ~ min - wage + tips

doesn’t work, because it confuses hyphens with minus signs.

Let’s pause a minute and think of when we might want to use as.formula. A typical example might be when you want to experiment with different formulae on one dataset. e.g.

library(purrr)
library(broom)
library(dplyr, warn.conflicts = FALSE)

mtcars %>% names %>%                   # get the names
      setdiff("mpg") %>%               # take of mpg because that's what we'll model
      combn(3, simplify = FALSE) %>%   # get all combinations of 3 columns
      map(~ paste(., collapse = " + ")) %>%   # make the rhs of the formula
      map_chr(~ paste("mpg ~ ", .)) %>%              # add the lhs
      tibble(formula = .,                            
             model = formula %>% map(~lm(as.formula(.), data = mtcars)), # build models
             r_squared = model %>% map_dbl(. %>% glance %>% .$r.squared) # extract r.squred
            ) %>%
      arrange(desc(r_squared))  # find the best formulae
## # A tibble: 120 x 3
##    formula                 model    r_squared
##    <chr>                   <list>       <dbl>
##  1 mpg ~  wt + qsec + am   <S3: lm>     0.850
##  2 mpg ~  cyl + hp + wt    <S3: lm>     0.843
##  3 mpg ~  cyl + wt + carb  <S3: lm>     0.842
##  4 mpg ~  hp + wt + am     <S3: lm>     0.840
##  5 mpg ~  cyl + wt + qsec  <S3: lm>     0.840
##  6 mpg ~  drat + wt + qsec <S3: lm>     0.837
##  7 mpg ~  hp + drat + wt   <S3: lm>     0.837
##  8 mpg ~  hp + wt + gear   <S3: lm>     0.835
##  9 mpg ~  hp + wt + qsec   <S3: lm>     0.835
## 10 mpg ~  wt + qsec + gear <S3: lm>     0.833
## # … with 110 more rows

this is all well and good but imagine if the column names of mtcars were not so neat and tidy.

library(dplyr)
mtcars1 <- mtcars %>% rename(`q-sec` = qsec) 

mtcars1 %>% names %>% # mess up a name
  setdiff("mpg") %>%               # same
      combn(3, simplify = FALSE) %>%   # code 
      map(~ paste(., collapse = " + ")) %>%   # as
      map(~ paste("mpg ~ ", .)) %>%              # before
      unlist %>%
      tibble(formula = ., 
             model = formula %>% map(~lm(as.formula(.), data = mtcars1)) 
            )
## Error in eval(predvars, data, env): object 'sec' not found

So, let’s re-write as.formula

library(stringr)
library(rlang, warn.conflicts = FALSE)

as_formula <- function(string) {
  
  parts <- string %>% str_extract_all("[A-Za-z\\-\\\\]+") %>% .[[1]] %>% syms 
  target <- parts[[1]]
  inputs <- parts[-1]
  rhs <- inputs %>% reduce(~expr(!!.x + !!.y))
  expr(!!target ~ !!rhs)
  
    }

as_formula("gross-pay ~ min-wage + tips")
## `gross-pay` ~ `min-wage` + tips

Let’s just try it out on our previous example:

mtcars1 %>% names %>%                   # get the names
      setdiff("mpg") %>%               # take of mpg because that's what we'll model
      combn(3, simplify = FALSE) %>%   # get all combinations of 3 columns
      map(~ paste(., collapse = " + ")) %>%   # make the rhs of the formula
      map_chr(~ paste("mpg ~ ", .)) %>%              # add the lhs
      tibble(formula = .,                            
             model = formula %>% map(~lm(as_formula(.), data = mtcars1)), # build models
             r_squared = model %>% map_dbl(. %>% glance %>% .$r.squared) # extract r.squred
            ) %>%
      arrange(desc(r_squared))
## # A tibble: 120 x 3
##    formula                  model    r_squared
##    <chr>                    <list>       <dbl>
##  1 mpg ~  wt + q-sec + am   <S3: lm>     0.850
##  2 mpg ~  cyl + hp + wt     <S3: lm>     0.843
##  3 mpg ~  cyl + wt + carb   <S3: lm>     0.842
##  4 mpg ~  hp + wt + am      <S3: lm>     0.840
##  5 mpg ~  cyl + wt + q-sec  <S3: lm>     0.840
##  6 mpg ~  drat + wt + q-sec <S3: lm>     0.837
##  7 mpg ~  hp + drat + wt    <S3: lm>     0.837
##  8 mpg ~  hp + wt + gear    <S3: lm>     0.835
##  9 mpg ~  hp + wt + q-sec   <S3: lm>     0.835
## 10 mpg ~  wt + q-sec + gear <S3: lm>     0.833
## # … with 110 more rows

Ok, I admit it. This version of as.formula isn’t actually better than the original at all, because it can’t handle interaction terms, since it assumes you wants ’+’s on the rhs of your formula.

as_formula("bedtime ~ tiredness * coding")
## bedtime ~ tiredness + coding

Clearly isn’t correct. Nonetheless, it might be useful in the case when you have complex names and a simple linear model. Maybe it should be called as_linear_formula..