Hutils

hutils package

My name is Hugh. I’ve written some miscellaneous functions that don’t seem to belong in a particular package. I’ve usually put these in R/utils.R when I write a package. Thus, hutils.

This vignette just goes through each exported function.

library(knitr)
suggested_packages <- c("geosphere", "nycflights13", "dplyr", "ggplot2", "microbenchmark")
opts_chunk$set(eval = all(vapply(suggested_packages, requireNamespace, quietly = TRUE, FUN.VALUE = FALSE)))
tryCatch({
  library(geosphere)
  library(nycflights13)
  library(dplyr, warn.conflicts = FALSE)
  library(ggplot2)
  library(microbenchmark)
  library(data.table, warn.conflicts = FALSE)
  library(magrittr)
  library(hutils, warn.conflicts = FALSE)
}, 
# requireNamespace does not detect errors like
# package ‘dplyr’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version
error = function(e) {
  opts_chunk$set(eval = FALSE)
})
options(digits = 4)

Aliases

These are simple additions to magrittr‘s aliases, including: capitalized forms of and and or that invoke && and || (the ’long-form’ logical operators) and nor / neither functions.

The main motivation is to make the source code easier to indent. I occasionally find such source code easier to use.

OR(OR(TRUE,
      stop("Never happens")),  ## short-circuits
   AND(FALSE,
       stop("Never happens")))  
## [1] TRUE

nor (or neither which is identical) returns TRUE if and only if both arguments are FALSE.

coalesce and if_else

These are near drop-in replacements for the equivalent functions from dplyr. They are included here because they are very useful outside of the tidyverse, but may be required in circumstances where importing dplyr (with all of its dependencies) would be inappropriate.

They attempt to be drop-in replacements but:

  1. hutils::if_else only works with logical, integer, double, and character type vectors. Lists and factors won’t work.
  2. hutils::coalesce short-circuits on its first argument; if there are no NAs in x then x is returned, even if the other vectors are the wrong length or type.

In addition, hutils::if_else is generally faster than dplyr::if_else:

my_check <- function(values) {
  all(vapply(values[-1], function(x) identical(values[[1]], x), logical(1)))
}

set.seed(2)
cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 100e3, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::if_else(cnd, yes, no, na),
               hutils = hutils::if_else(cnd, yes, no, na),
               check = my_check) %>%
  print
## Unit: milliseconds
##    expr   min    lq  mean median    uq   max neval
##   dplyr 4.020 4.084 4.574  4.193 4.281 9.022   100
##  hutils 1.852 1.878 2.196  1.891 1.915 4.766   100
cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE)
yes <- sample(letters, size = 1, replace = TRUE)
no <- sample(letters, size = 100e3, replace = TRUE)
na <- sample(letters, size = 1, replace = TRUE)

microbenchmark(dplyr =  dplyr::if_else(cnd, yes, no, na),
               hutils = hutils::if_else(cnd, yes, no, na),
               check = my_check) %>%
  print
## Unit: milliseconds
##    expr   min    lq  mean median    uq   max neval
##   dplyr 3.942 3.988 4.212  4.084 4.136 6.675   100
##  hutils 1.102 1.130 1.369  1.138 1.149 3.901   100

This speed advantage also appears to be true of coalesce:

x <- sample(c(letters, NA), size = 100e3, replace = TRUE)
A <- sample(c(letters, NA), size = 100e3, replace = TRUE)
B <- sample(c(letters, NA), size = 100e3, replace = TRUE)
C <- sample(c(letters, NA), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print
## Unit: microseconds
##    expr    min   lq mean median     uq   max neval
##   dplyr 3679.1 3764 4596   3920 6114.9  8753   100
##  hutils  779.6  797 1799    812  877.4 72403   100

especially during short-circuits:

x <- sample(c(letters), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print
## Unit: microseconds
##    expr     min      lq    mean  median     uq    max neval
##   dplyr 2864.24 2936.37 3648.17 3127.27 3914.3 6704.7   100
##  hutils   63.54   63.74   66.23   65.41   67.1  106.2   100
x <- sample(c(letters, NA), size = 100e3, replace = TRUE)
A <- sample(c(letters), size = 100e3, replace = TRUE)

microbenchmark(dplyr =  dplyr::coalesce(x, A, B, C),
               hutils = hutils::coalesce(x, A, B, C),
               check = my_check) %>%
  print
## Unit: microseconds
##    expr  min     lq mean median     uq   max neval
##   dplyr 3203 3289.8 3802 3389.7 3549.3  7598   100
##  hutils  447  459.9 1341  468.6  480.9 71401   100

Drop columns

To drop a column from a data.table, you set it to NULL

DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT[, A := NULL]

There’s nothing wrong with this, but I’ve found the following a useful alias, especially in a magrittr pipe.

DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT %>%
  drop_col("A") %>%
  drop_col("B")
##        C
##    <int>
## 1:     1
## 2:     2
## 3:     3
## 4:     4
## 5:     5
# or
DT <- data.table(A = 1:5, B = 1:5, C = 1:5)
DT %>%
  drop_cols(c("A", "B"))
##        C
##    <int>
## 1:     1
## 2:     2
## 3:     3
## 4:     4
## 5:     5

These functions simple invoke the canonical form, so won’t be any faster.

Additionally, one can drop columns by a regular expression using drop_colr:

flights <- as.data.table(flights)

flights %>%
  drop_colr("time") %>%
  drop_colr("arr(?!_delay)", perl = TRUE)
##          year month   day dep_delay arr_delay flight tailnum origin   dest
##         <int> <int> <int>     <num>     <num>  <int>  <char> <char> <char>
##      1:  2013     1     1         2        11   1545  N14228    EWR    IAH
##      2:  2013     1     1         4        20   1714  N24211    LGA    IAH
##      3:  2013     1     1         2        33   1141  N619AA    JFK    MIA
##      4:  2013     1     1        -1       -18    725  N804JB    JFK    BQN
##      5:  2013     1     1        -6       -25    461  N668DN    LGA    ATL
##     ---                                                                   
## 336772:  2013     9    30        NA        NA   3393    <NA>    JFK    DCA
## 336773:  2013     9    30        NA        NA   3525    <NA>    LGA    SYR
## 336774:  2013     9    30        NA        NA   3461  N535MQ    LGA    BNA
## 336775:  2013     9    30        NA        NA   3572  N511MQ    LGA    CLE
## 336776:  2013     9    30        NA        NA   3531  N839MQ    LGA    RDU
##         distance  hour minute
##            <num> <num>  <num>
##      1:     1400     5     15
##      2:     1416     5     29
##      3:     1089     5     40
##      4:     1576     5     45
##      5:      762     6      0
##     ---                      
## 336772:      213    14     55
## 336773:      198    22      0
## 336774:      764    12     10
## 336775:      419    11     59
## 336776:      431     8     40

drop_constant_cols

When a table is filtered, the filtrate is often redundant.

flights %>%
  .[origin == "JFK"] %>%
  drop_constant_cols
##         month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##         <int> <int>    <int>          <int>     <num>    <int>          <int>
##      1:     1     1      542            540         2      923            850
##      2:     1     1      544            545        -1     1004           1022
##      3:     1     1      557            600        -3      838            846
##      4:     1     1      558            600        -2      849            851
##      5:     1     1      558            600        -2      853            856
##     ---                                                                      
## 111275:     9    30     2240           2250       -10     2347              7
## 111276:     9    30     2241           2246        -5     2345              1
## 111277:     9    30     2307           2255        12     2359           2358
## 111278:     9    30     2349           2359       -10      325            350
## 111279:     9    30       NA           1455        NA       NA           1634
##         arr_delay carrier flight tailnum   dest air_time distance  hour minute
##             <num>  <char>  <int>  <char> <char>    <num>    <num> <num>  <num>
##      1:        33      AA   1141  N619AA    MIA      160     1089     5     40
##      2:       -18      B6    725  N804JB    BQN      183     1576     5     45
##      3:        -8      B6     79  N593JB    MCO      140      944     6      0
##      4:        -2      B6     49  N793JB    PBI      149     1028     6      0
##      5:        -3      B6     71  N657JB    TPA      158     1005     6      0
##     ---                                                                       
## 111275:       -20      B6   2002  N281JB    BUF       52      301    22     50
## 111276:       -16      B6    486  N346JB    ROC       47      264    22     46
## 111277:         1      B6    718  N565JB    BOS       33      187    22     55
## 111278:       -25      B6    745  N516JB    PSE      196     1617    23     59
## 111279:        NA      9E   3393    <NA>    DCA       NA      213    14     55
##                   time_hour
##                      <POSc>
##      1: 2013-01-01 05:00:00
##      2: 2013-01-01 05:00:00
##      3: 2013-01-01 06:00:00
##      4: 2013-01-01 06:00:00
##      5: 2013-01-01 06:00:00
##     ---                    
## 111275: 2013-09-30 22:00:00
## 111276: 2013-09-30 22:00:00
## 111277: 2013-09-30 22:00:00
## 111278: 2013-09-30 23:00:00
## 111279: 2013-09-30 14:00:00

drop_empty_cols

This function drops columns in which all the values are NA.

planes %>% 
  as.data.table %>% 
  .[!complete.cases(.)]
##       tailnum  year                    type                  manufacturer
##        <char> <int>                  <char>                        <char>
##    1:  N10156  2004 Fixed wing multi engine                       EMBRAER
##    2:  N102UW  1998 Fixed wing multi engine              AIRBUS INDUSTRIE
##    3:  N103US  1999 Fixed wing multi engine              AIRBUS INDUSTRIE
##    4:  N104UW  1999 Fixed wing multi engine              AIRBUS INDUSTRIE
##    5:  N10575  2002 Fixed wing multi engine                       EMBRAER
##   ---                                                                    
## 3295:  N997AT  2002 Fixed wing multi engine                        BOEING
## 3296:  N997DL  1992 Fixed wing multi engine MCDONNELL DOUGLAS AIRCRAFT CO
## 3297:  N998AT  2002 Fixed wing multi engine                        BOEING
## 3298:  N998DL  1992 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATION
## 3299:  N999DN  1992 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATION
##           model engines seats speed    engine
##          <char>   <int> <int> <int>    <char>
##    1: EMB-145XR       2    55    NA Turbo-fan
##    2:  A320-214       2   182    NA Turbo-fan
##    3:  A320-214       2   182    NA Turbo-fan
##    4:  A320-214       2   182    NA Turbo-fan
##    5: EMB-145LR       2    55    NA Turbo-fan
##   ---                                        
## 3295:   717-200       2   100    NA Turbo-fan
## 3296:     MD-88       2   142    NA Turbo-fan
## 3297:   717-200       2   100    NA Turbo-fan
## 3298:     MD-88       2   142    NA Turbo-jet
## 3299:     MD-88       2   142    NA Turbo-jet
planes %>% 
  as.data.table %>% 
  .[!complete.cases(.)] %>% 
  # drops speed
  drop_empty_cols
##       tailnum  year                    type                  manufacturer
##        <char> <int>                  <char>                        <char>
##    1:  N10156  2004 Fixed wing multi engine                       EMBRAER
##    2:  N102UW  1998 Fixed wing multi engine              AIRBUS INDUSTRIE
##    3:  N103US  1999 Fixed wing multi engine              AIRBUS INDUSTRIE
##    4:  N104UW  1999 Fixed wing multi engine              AIRBUS INDUSTRIE
##    5:  N10575  2002 Fixed wing multi engine                       EMBRAER
##   ---                                                                    
## 3295:  N997AT  2002 Fixed wing multi engine                        BOEING
## 3296:  N997DL  1992 Fixed wing multi engine MCDONNELL DOUGLAS AIRCRAFT CO
## 3297:  N998AT  2002 Fixed wing multi engine                        BOEING
## 3298:  N998DL  1992 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATION
## 3299:  N999DN  1992 Fixed wing multi engine MCDONNELL DOUGLAS CORPORATION
##           model engines seats    engine
##          <char>   <int> <int>    <char>
##    1: EMB-145XR       2    55 Turbo-fan
##    2:  A320-214       2   182 Turbo-fan
##    3:  A320-214       2   182 Turbo-fan
##    4:  A320-214       2   182 Turbo-fan
##    5: EMB-145LR       2    55 Turbo-fan
##   ---                                  
## 3295:   717-200       2   100 Turbo-fan
## 3296:     MD-88       2   142 Turbo-fan
## 3297:   717-200       2   100 Turbo-fan
## 3298:     MD-88       2   142 Turbo-jet
## 3299:     MD-88       2   142 Turbo-jet

duplicated_rows

There are many useful functions for detecting duplicates in R. However, in interactive use, I often want to not merely see which values are duplicated, but also compare them to the original. This is especially true when I am comparing duplicates across a subset of columns in a a data.table.

flights %>%
  # only the 'second' of the duplicates is returned
  .[duplicated(., by = c("origin", "dest"))]  
##          year month   day dep_time sched_dep_time dep_delay arr_time
##         <int> <int> <int>    <int>          <int>     <num>    <int>
##      1:  2013     1     1      600            600         0      837
##      2:  2013     1     1      607            607         0      858
##      3:  2013     1     1      608            600         8      807
##      4:  2013     1     1      623            627        -4      933
##      5:  2013     1     1      624            630        -6      840
##     ---                                                             
## 336548:  2013     9    30       NA           1455        NA       NA
## 336549:  2013     9    30       NA           2200        NA       NA
## 336550:  2013     9    30       NA           1210        NA       NA
## 336551:  2013     9    30       NA           1159        NA       NA
## 336552:  2013     9    30       NA            840        NA       NA
##         sched_arr_time arr_delay carrier flight tailnum origin   dest air_time
##                  <int>     <num>  <char>  <int>  <char> <char> <char>    <num>
##      1:            825        12      MQ   4650  N542MQ    LGA    ATL      134
##      2:            915       -17      UA   1077  N53442    EWR    MIA      157
##      3:            735        32      MQ   3768  N9EAMQ    EWR    ORD      139
##      4:            932         1      UA    496  N459UA    LGA    IAH      229
##      5:            830        10      MQ   4599  N518MQ    LGA    MSP      166
##     ---                                                                       
## 336548:           1634        NA      9E   3393    <NA>    JFK    DCA       NA
## 336549:           2312        NA      9E   3525    <NA>    LGA    SYR       NA
## 336550:           1330        NA      MQ   3461  N535MQ    LGA    BNA       NA
## 336551:           1344        NA      MQ   3572  N511MQ    LGA    CLE       NA
## 336552:           1020        NA      MQ   3531  N839MQ    LGA    RDU       NA
##         distance  hour minute           time_hour
##            <num> <num>  <num>              <POSc>
##      1:      762     6      0 2013-01-01 06:00:00
##      2:     1085     6      7 2013-01-01 06:00:00
##      3:      719     6      0 2013-01-01 06:00:00
##      4:     1416     6     27 2013-01-01 06:00:00
##      5:     1020     6     30 2013-01-01 06:00:00
##     ---                                          
## 336548:      213    14     55 2013-09-30 14:00:00
## 336549:      198    22      0 2013-09-30 22:00:00
## 336550:      764    12     10 2013-09-30 12:00:00
## 336551:      419    11     59 2013-09-30 11:00:00
## 336552:      431     8     40 2013-09-30 08:00:00
flights %>%
  # Both rows are returned and (by default)
  # duplicates are presented adjacently
  duplicated_rows(by = c("origin", "dest"))
##          year month   day dep_time sched_dep_time dep_delay arr_time
##         <int> <int> <int>    <int>          <int>     <num>    <int>
##      1:  2013     1     1     1315           1317        -2     1413
##      2:  2013     1     1     1655           1621        34     1804
##      3:  2013     1     1     2056           2004        52     2156
##      4:  2013     1     2     1332           1327         5     1419
##      5:  2013     1     2     1746           1621        85     1835
##     ---                                                             
## 336767:  2013     9    27     1516           1520        -4     1739
## 336768:  2013     9    29     1754           1755        -1     2019
## 336769:  2013     9    30      719            725        -6      916
## 336770:  2013     9    30     1519           1520        -1     1726
## 336771:  2013     9    30     1747           1755        -8     1941
##         sched_arr_time arr_delay carrier flight tailnum origin   dest air_time
##                  <int>     <num>  <char>  <int>  <char> <char> <char>    <num>
##      1:           1423       -10      EV   4112  N13538    EWR    ALB       33
##      2:           1724        40      EV   3260  N19554    EWR    ALB       36
##      3:           2112        44      EV   4170  N12540    EWR    ALB       31
##      4:           1433       -14      EV   4316  N14153    EWR    ALB       33
##      5:           1724        71      EV   3260  N14153    EWR    ALB       31
##     ---                                                                       
## 336767:           1740        -1      MQ   3532  N724MQ    LGA    XNA      160
## 336768:           2015         4      MQ   3713  N725MQ    LGA    XNA      160
## 336769:            945       -29      MQ   3547  N735MQ    LGA    XNA      150
## 336770:           1740       -14      MQ   3532  N725MQ    LGA    XNA      148
## 336771:           2015       -34      MQ   3713  N720MQ    LGA    XNA      146
##         distance  hour minute           time_hour
##            <num> <num>  <num>              <POSc>
##      1:      143    13     17 2013-01-01 13:00:00
##      2:      143    16     21 2013-01-01 16:00:00
##      3:      143    20      4 2013-01-01 20:00:00
##      4:      143    13     27 2013-01-02 13:00:00
##      5:      143    16     21 2013-01-02 16:00:00
##     ---                                          
## 336767:     1147    15     20 2013-09-27 15:00:00
## 336768:     1147    17     55 2013-09-29 17:00:00
## 336769:     1147     7     25 2013-09-30 07:00:00
## 336770:     1147    15     20 2013-09-30 15:00:00
## 336771:     1147    17     55 2013-09-30 17:00:00

Haversine distance

To emphasize the miscellany of this package, I now present haversine_distance which simply returns the distance between two points on the Earth, given their latitude and longitude.

I prefer this to other packages’ implementations. Although the geosphere package can do a lot more than calculate distances between points, I find the interface for distHaversine unfortunate as it cannot be easily used inside a data.frame. In addition, I’ve found the arguments clearer in hutils::haversine_distance rather than trying to remember whether to use byrow inside the matrix function while passing to distHaversine.

DT1 <- data.table(lat_orig = runif(1e5, -80, 80),
                  lon_orig = runif(1e5, -179, 179),
                  lat_dest = runif(1e5, -80, 80),
                  lon_dest = runif(1e5, -179, 179))

DT2 <- copy(DT1)

microbenchmark(DT1[, distance := haversine_distance(lat_orig, lon_orig,
                                                    lat_dest, lon_dest)],
               
               DT2[, distance := distHaversine(cbind(lon_orig, lat_orig),
                                               cbind(lon_orig, lat_orig))])
## Unit: milliseconds
##                                                                                             expr
##           DT1[, `:=`(distance, haversine_distance(lat_orig, lon_orig, lat_dest,      lon_dest))]
##  DT2[, `:=`(distance, distHaversine(cbind(lon_orig, lat_orig),      cbind(lon_orig, lat_orig)))]
##    min     lq   mean median     uq   max neval
##   7.96  8.083  8.269  8.193  8.274 12.05   100
##  12.68 12.955 15.498 15.535 15.851 92.04   100
rm(DT1, DT2)

mutate_other

There may be occasions where a categorical variable in a data.table may need to modified to reduce the number of distinct categories. For example, you may want to plot a chart with a set number of facets, or ensure the smooth operation of randomForest, which accepts no more than 32 levels in a feature.

mutate_other keeps the n most common categories and changes the other categories to Other.

set.seed(1)
DT <- data.table(Fruit = sample(c("apple", "pear", "orange", "tomato", "eggplant"),
                                size = 20,
                                prob = c(0.45, 0.25, 0.15, 0.1, 0.05),
                                replace = TRUE),
                 Price = rpois(20, 10))

kable(mutate_other(DT, "Fruit", n = 3)[])
Fruit Price
apple 14
apple 11
pear 8
Other 2
apple 8
Other 10
Other 7
pear 12
pear 11
apple 12
apple 12
apple 10
pear 3
apple 11
Other 13
pear 7
Other 8
Other 11
apple 14
Other 9

ngrep

This is a ‘dumb’ negation of grep. In recent versions of R, the option invert = FALSE exists. A slight advantage of ngrep is that it’s shorter to type. But if you don’t have arthritis, best use invert = FALSE or !grepl.

notin ein enotin pin

These functions provide complementary functionality to %in%:

%notin%

%notin% is the negation of %in%, but also uses the package fastmatch to increase the speed of the operation

%ein% and %enotin%

The functions %ein% and %enotin% are motivated by a different sort of problem. Consider the following statement:

iris <- as.data.table(iris)
iris[Species %in% c("setosa", "versicolour")] %$%
  mean(Sepal.Length / Sepal.Width)
## [1] 1.47

On the face of it, this appears to give the average ratio of Iris setosa and Iris versicolour irises. However, it only gives the average ratio of setosa irises, as the correct spelling is Iris versicolor not -our. This particular error is easy to make, (in fact when I wrote this vignette, the first hit of Google for iris dataset made the same spelling error), but it’s easy to imagine similar mistakes, such as mistaking the capitalization of a value. The functions %ein% and %enotin% strive to reduce the occurrence of this mistake. The functions operate exactly the same as %in% and %enotin% but error if any of the table of values to be matched against is not present in any of the values:

iris <- as.data.table(iris)
iris[Species %ein% c("setosa", "versicolour")] %$%
  mean(Sepal.Length / Sepal.Width)
## Error in .checkTypos(e, names_x): `rhs` contained versicolour, but this value was not found in `lhs = Species`. All values of `rhs` must be in `lhs`. Ensure you have specified `rhs` correctly.

The e stands for ‘exists’; i.e. they should be read as “exists and in” and “exists and not in”.

%pin%

This performs a partial match (i.e grepl) but with a possibly more readable or intuitive syntax

identical(iris[grep("v", Species)],
          iris[Species %pin% "v"])
## [1] TRUE

If the RHS has more than one element, the matching is done on alternation (i.e. OR):

iris[Species %pin% c("ver", "vir")] %>%
  head
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##           <num>       <num>        <num>       <num>     <fctr>
## 1:          7.0         3.2          4.7         1.4 versicolor
## 2:          6.4         3.2          4.5         1.5 versicolor
## 3:          6.9         3.1          4.9         1.5 versicolor
## 4:          5.5         2.3          4.0         1.3 versicolor
## 5:          6.5         2.8          4.6         1.5 versicolor
## 6:          5.7         2.8          4.5         1.3 versicolor

There is an important qualification: if the RHS is NULL, then the result will be TRUE along the length of x, contrary to the behaviour of %in%. This is not entirely unexpected as NULL could legitimately be interpreted as ε, the empty regular expression, which occurs in every string.

provide.dir

This is the same as dir.create but checks whether the target exists or not and does nothing if it does. Motivated by \providecommand in $\rm\LaTeX{}$, which creates a macro only if it does not exist already.

select_which

This provides a similar role to dplyr::select_if but was originally part of package:grattan so has a different name. It simply returns the columns whose values return TRUE when Which is applied. Additional columns (which may or not may satisfy Which) may be included by using .and.dots. (To remove columns, you can use drop_col).

DT <- data.table(x = 1:5,
                 y = letters[1:5],
                 AB = c(NA, TRUE, FALSE, TRUE, FALSE))
select_which(DT, anyNA, .and.dots = "y")
##        AB      y
##    <lgcl> <char>
## 1:     NA      a
## 2:   TRUE      b
## 3:  FALSE      c
## 4:   TRUE      d
## 5:  FALSE      e

set_cols_first

Up to and including data.table 1.10.4, one could only reorder the columns by supplying all the columns. You can use set_cols_first and set_cols_last to put columns first or last without supplying all the columns.

Unique keys

In some circumstances, you need to know that the key of a data.table is unique. For example, you may expect a join to be performed later, without specifying mult='first' or permitting Cartesian joins. data.table does not require a key to be unique and does not supply tools to check the uniqueness of keys. hutils supplies two simple functions: has_unique_key which when applied to a data.table returns TRUE if and only if the table has a key and it is unique.

set_unique_key does the same as setkey but will error if the resultant key is not unique.

hutils v1.1.0

auc

The area under the (ROC) curve gives a single value to measure the tradeoff between true positives and false positives.

dt <- data.table(y = !sample(0:1, size = 100, replace = TRUE), 
                 x = runif(100))
dt[, pred := predict(lm(y ~ x, data = .SD), newdata = .SD)]

dt[, auc(y, pred)]
## [1] 0.577

select_grep

To select columns matching a regular expression:

flights %>%
  select_grep("arr")
##         arr_time sched_arr_time arr_delay carrier
##            <int>          <int>     <num>  <char>
##      1:      830            819        11      UA
##      2:      850            830        20      UA
##      3:      923            850        33      AA
##      4:     1004           1022       -18      B6
##      5:      812            837       -25      DL
##     ---                                          
## 336772:       NA           1634        NA      9E
## 336773:       NA           2312        NA      9E
## 336774:       NA           1330        NA      MQ
## 336775:       NA           1344        NA      MQ
## 336776:       NA           1020        NA      MQ

You can use the additional arguments .and and .but.not to override the patterns.

flights %>%
  select_grep("arr", .and = "year", .but.not = "arr_time")
##          year sched_arr_time arr_delay carrier
##         <int>          <int>     <num>  <char>
##      1:  2013            819        11      UA
##      2:  2013            830        20      UA
##      3:  2013            850        33      AA
##      4:  2013           1022       -18      B6
##      5:  2013            837       -25      DL
##     ---                                       
## 336772:  2013           1634        NA      9E
## 336773:  2013           2312        NA      9E
## 336774:  2013           1330        NA      MQ
## 336775:  2013           1344        NA      MQ
## 336776:  2013           1020        NA      MQ

hutils v1.2.0

RQ

This is simply a shorthand to test whether a package needs installing. The package name need not be quoted, for convenience.

RQ(dplyr, "dplyr must be installed")
## NULL
RQ("dplyr", "dplyr needs installing", "dplyr installed.")
## [1] "dplyr installed."

ahull

This locates the biggest rectangle beneath a curve:

if (!identical(Sys.info()[["sysname"]], "Darwin"))
  ggplot(data.table(x = c(0, 1, 2, 3, 4), y = c(0, 1, 2, 0.1, 0))) +
  geom_area(aes(x, y)) +
  geom_rect(data = ahull(, c(0, 1, 2, 3, 4), c(0, 1, 2, 0.1, 0)),
            aes(xmin = xmin,
                xmax = xmax,
                ymin = ymin,
                ymax = ymax),
            color = "red") 

set.seed(101)
ahull_dt <-
  data.table(x = c(0:100) / 100,
             y = cumsum(rnorm(101, 0.05)))
if (!identical(Sys.info()[["sysname"]], "Darwin"))
ggplot(ahull_dt) +
  geom_area(aes(x, y)) + 
  geom_rect(data = ahull(ahull_dt), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "red") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "blue") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE,
                         minH = 4), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "green") + 
  geom_rect(data = ahull(ahull_dt,
                         incl_negative = TRUE,
                         minW = 0.25), 
            aes(xmin = xmin, 
                xmax = xmax, 
                ymin = ymin, 
                ymax = ymax), 
            color = "white",
            fill = NA)

hutils v1.3.0

weighted_quantile

Simply a version of quantile supporting weighted values:

x <- 1:10
w <- c(rep(1, 5), rep(2, 5))
quantile(x, prob = c(0.25, 0.75), names = FALSE)
## [1] 3.25 7.75
weighted_quantile(x, w, p = c(0.25, 0.75))
## [1] 4 9

mutate_ntile

To add a column of ntiles (say, for later summarizing):

flights %>%
  as.data.table %>%
  .[, .(year, month, day, origin, dest, distance)] %>%
  mutate_ntile(distance, n = 5L)
##          year month   day origin   dest distance distanceQuintile
##         <int> <int> <int> <char> <char>    <num>            <int>
##      1:  2013     1     1    EWR    IAH     1400                4
##      2:  2013     1     1    LGA    IAH     1416                4
##      3:  2013     1     1    JFK    MIA     1089                4
##      4:  2013     1     1    JFK    BQN     1576                4
##      5:  2013     1     1    LGA    ATL      762                3
##     ---                                                          
## 336772:  2013     9    30    JFK    DCA      213                1
## 336773:  2013     9    30    LGA    SYR      198                1
## 336774:  2013     9    30    LGA    BNA      764                3
## 336775:  2013     9    30    LGA    CLE      419                1
## 336776:  2013     9    30    LGA    RDU      431                2

You can use non-standard evaluation (as above) or you can quote the col argument. Use character.only = TRUE to ensure column is only interpreted as character.

flights %>%
  as.data.table %>%
  .[, .(year, month, day, origin, dest, distance)] %>%
  mutate_ntile(distance, n = 5L)
##          year month   day origin   dest distance distanceQuintile
##         <int> <int> <int> <char> <char>    <num>            <int>
##      1:  2013     1     1    EWR    IAH     1400                4
##      2:  2013     1     1    LGA    IAH     1416                4
##      3:  2013     1     1    JFK    MIA     1089                4
##      4:  2013     1     1    JFK    BQN     1576                4
##      5:  2013     1     1    LGA    ATL      762                3
##     ---                                                          
## 336772:  2013     9    30    JFK    DCA      213                1
## 336773:  2013     9    30    LGA    SYR      198                1
## 336774:  2013     9    30    LGA    BNA      764                3
## 336775:  2013     9    30    LGA    CLE      419                1
## 336776:  2013     9    30    LGA    RDU      431                2
flights %>%
  as.data.table %>%
  mutate_ntile("distance",
               n = 5L,
               character.only = TRUE) %>%
  .[, dep_delay := coalesce(dep_delay, 0)] %>%
  .[, .(avgDelay = mean(dep_delay)), keyby = "distanceQuintile"]
## Key: <distanceQuintile>
##    distanceQuintile avgDelay
##               <int>    <num>
## 1:                1    12.88
## 2:                2    12.97
## 3:                3    13.14
## 4:                4    11.49
## 5:                5    11.16

longest_affix

Trim common affixes can be useful during data cleaning:

trim_common_affixes(c("CurrentHousingCosts(weekly)",
                      "CurrentFuelCosts(weekly)"))
## [1] "Housing" "Fuel"

hutils 1.4.0

%<->%

Referred to as swap in the documentation. Used to swap values between object names

a <- 1
b <- 2
a %<->% b
identical(c(a, b), c(2, 1))
## [1] TRUE

average_bearing

Determine the average bearing of vectors. Slightly more difficult than simply the average modulo 360 since its the most acute sector is desired.

average_bearing(0, 270)  # NW
## [1] 315
mean(c(0, 270))          # SE (i.e. wrong)
## [1] 135

dir2

This is a faster version of list.files for Windows only, utilizing the dir command on the command prompt.

Mode

The statistical mode; the most common element.

Mode(c(1, 1, 1, 2, 3))
## [1] 1

replace_pattern_in

A cousin of find_pattern_in, but instead of collecting the results, it replaces the contents sought with the replacement provided.

samp

A safer version of sample. I use it because I found the following behaviour of sample surprising.

DT <- data.table(x = c(5, 2, 3),
                 y = c(5, 3, 4))
DT[, .(Base = sample(.BY[["x"]]:.BY[["y"]])), keyby = .(x, y)]
## Key: <x, y>
##        x     y  Base
##    <num> <num> <int>
## 1:     2     3     2
## 2:     2     3     3
## 3:     3     4     3
## 4:     3     4     4
## 5:     5     5     3
## 6:     5     5     1
## 7:     5     5     5
## 8:     5     5     4
## 9:     5     5     2
DT[, .(Base = samp(.BY[["x"]]:.BY[["y"]])), keyby = .(x, y)]
## `length(x) = 1`, so returning `rep.int(x, 1)`.
## Key: <x, y>
##        x     y  Base
##    <num> <num> <int>
## 1:     2     3     2
## 2:     2     3     3
## 3:     3     4     3
## 4:     3     4     4
## 5:     5     5     5