--- title: "Hutils" author: "Hugh Parsonage" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{hutils} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # `hutils` package My name is Hugh. I've written some miscellaneous functions that don't seem to belong in a particular package. I've usually put these in `R/utils.R` when I write a package. Thus, `hutils`. This vignette just goes through each exported function. ```{r knitrOpts} library(knitr) suggested_packages <- c("geosphere", "nycflights13", "dplyr", "ggplot2", "microbenchmark") opts_chunk$set(eval = all(vapply(suggested_packages, requireNamespace, quietly = TRUE, FUN.VALUE = FALSE))) ``` ```{r loadPackages} tryCatch({ library(geosphere) library(nycflights13) library(dplyr, warn.conflicts = FALSE) library(ggplot2) library(microbenchmark) library(data.table, warn.conflicts = FALSE) library(magrittr) library(hutils, warn.conflicts = FALSE) }, # requireNamespace does not detect errors like # package ‘dplyr’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version error = function(e) { opts_chunk$set(eval = FALSE) }) options(digits = 4) ``` ## Aliases These are simple additions to `magrittr`'s aliases, including: capitalized forms of `and` and `or` that invoke `&&` and `||` (the 'long-form' logical operators) and `nor` / `neither` functions. The main motivation is to make the source code easier to indent. I occasionally find such source code easier to use. ```{r aliases} OR(OR(TRUE, stop("Never happens")), ## short-circuits AND(FALSE, stop("Never happens"))) ``` `nor` (or `neither` which is identical) returns `TRUE` if and only if both arguments are `FALSE`. ## `coalesce` and `if_else` These are near drop-in replacements for the equivalent functions from `dplyr`. They are included here because they are very useful outside of the tidyverse, but may be required in circumstances where importing `dplyr` (with all of its dependencies) would be inappropriate. They attempt to be drop-in replacements but: 1. `hutils::if_else` only works with `logical`, `integer`, `double`, and `character` type vectors. Lists and factors won't work. 2. `hutils::coalesce` short-circuits on its first argument; if there are no `NA`s in `x` then `x` is returned, even if the other vectors are the wrong length or type. In addition, `hutils::if_else` is generally faster than `dplyr::if_else`: ```{r compare_if_else} my_check <- function(values) { all(vapply(values[-1], function(x) identical(values[[1]], x), logical(1))) } set.seed(2) cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE) yes <- sample(letters, size = 100e3, replace = TRUE) no <- sample(letters, size = 100e3, replace = TRUE) na <- sample(letters, size = 100e3, replace = TRUE) microbenchmark(dplyr = dplyr::if_else(cnd, yes, no, na), hutils = hutils::if_else(cnd, yes, no, na), check = my_check) %>% print cnd <- sample(c(TRUE, FALSE, NA), size = 100e3, replace = TRUE) yes <- sample(letters, size = 1, replace = TRUE) no <- sample(letters, size = 100e3, replace = TRUE) na <- sample(letters, size = 1, replace = TRUE) microbenchmark(dplyr = dplyr::if_else(cnd, yes, no, na), hutils = hutils::if_else(cnd, yes, no, na), check = my_check) %>% print ``` This speed advantage also appears to be true of `coalesce`: ```{r compare_coalesce} x <- sample(c(letters, NA), size = 100e3, replace = TRUE) A <- sample(c(letters, NA), size = 100e3, replace = TRUE) B <- sample(c(letters, NA), size = 100e3, replace = TRUE) C <- sample(c(letters, NA), size = 100e3, replace = TRUE) microbenchmark(dplyr = dplyr::coalesce(x, A, B, C), hutils = hutils::coalesce(x, A, B, C), check = my_check) %>% print ``` especially during short-circuits: ```{r compare_coalesce_short_circuit_x} x <- sample(c(letters), size = 100e3, replace = TRUE) microbenchmark(dplyr = dplyr::coalesce(x, A, B, C), hutils = hutils::coalesce(x, A, B, C), check = my_check) %>% print ``` ```{r compare_coalesce_short_circuit_A} x <- sample(c(letters, NA), size = 100e3, replace = TRUE) A <- sample(c(letters), size = 100e3, replace = TRUE) microbenchmark(dplyr = dplyr::coalesce(x, A, B, C), hutils = hutils::coalesce(x, A, B, C), check = my_check) %>% print ``` ## Drop columns To drop a column from a `data.table`, you set it to `NULL` ```{r canonical_drop_DT} DT <- data.table(A = 1:5, B = 1:5, C = 1:5) DT[, A := NULL] ``` There's nothing wrong with this, but I've found the following a useful alias, especially in a `magrittr` pipe. ```{r drop_col_hutils} DT <- data.table(A = 1:5, B = 1:5, C = 1:5) DT %>% drop_col("A") %>% drop_col("B") # or DT <- data.table(A = 1:5, B = 1:5, C = 1:5) DT %>% drop_cols(c("A", "B")) ``` These functions simple invoke the canonical form, so won't be any faster. Additionally, one can drop columns by a regular expression using `drop_colr`: ```{r drop_colr} flights <- as.data.table(flights) flights %>% drop_colr("time") %>% drop_colr("arr(?!_delay)", perl = TRUE) ``` ## `drop_constant_cols` When a table is filtered, the filtrate is often redundant. ```{r drop_constant_cols} flights %>% .[origin == "JFK"] %>% drop_constant_cols ``` ## `drop_empty_cols` This function drops columns in which all the values are `NA`. ```{r drop_empty_cols} planes %>% as.data.table %>% .[!complete.cases(.)] planes %>% as.data.table %>% .[!complete.cases(.)] %>% # drops speed drop_empty_cols ``` ## `duplicated_rows` There are many useful functions for detecting duplicates in R. However, in interactive use, I often want to not merely see which values are duplicated, but also compare them to the original. This is especially true when I am comparing duplicates across a *subset* of columns in a a `data.table`. ```{r duplicated_rows} flights %>% # only the 'second' of the duplicates is returned .[duplicated(., by = c("origin", "dest"))] flights %>% # Both rows are returned and (by default) # duplicates are presented adjacently duplicated_rows(by = c("origin", "dest")) ``` ## Haversine distance To emphasize the miscellany of this package, I now present `haversine_distance` which simply returns the distance between two points on the Earth, given their latitude and longitude. I prefer this to other packages' implementations. Although the `geosphere` package can do a lot more than calculate distances between points, I find the interface for `distHaversine` unfortunate as it cannot be easily used inside a `data.frame`. In addition, I've found the arguments clearer in `hutils::haversine_distance` rather than trying to remember whether to use `byrow` inside the `matrix` function while passing to `distHaversine`. ```{r haversine_distance} DT1 <- data.table(lat_orig = runif(1e5, -80, 80), lon_orig = runif(1e5, -179, 179), lat_dest = runif(1e5, -80, 80), lon_dest = runif(1e5, -179, 179)) DT2 <- copy(DT1) microbenchmark(DT1[, distance := haversine_distance(lat_orig, lon_orig, lat_dest, lon_dest)], DT2[, distance := distHaversine(cbind(lon_orig, lat_orig), cbind(lon_orig, lat_orig))]) rm(DT1, DT2) ``` ## `mutate_other` There may be occasions where a categorical variable in a `data.table` may need to modified to reduce the number of distinct categories. For example, you may want to plot a chart with a set number of facets, or ensure the smooth operation of `randomForest`, which accepts no more than 32 levels in a feature. `mutate_other` keeps the *n* most common categories and changes the other categories to `Other`. ```{r mutate-other, results='asis'} set.seed(1) DT <- data.table(Fruit = sample(c("apple", "pear", "orange", "tomato", "eggplant"), size = 20, prob = c(0.45, 0.25, 0.15, 0.1, 0.05), replace = TRUE), Price = rpois(20, 10)) kable(mutate_other(DT, "Fruit", n = 3)[]) ``` ## `ngrep` This is a 'dumb' negation of `grep`. In recent versions of R, the option `invert = FALSE` exists. A slight advantage of `ngrep` is that it's shorter to type. But if you don't have arthritis, best use `invert = FALSE` or `!grepl`. ## `notin` `ein` `enotin` `pin` These functions provide complementary functionality to `%in%`: ### `%notin%` `%notin%` is the negation of `%in%`, but also uses the package `fastmatch` to increase the speed of the operation ### `%ein%` and `%enotin%` The functions `%ein%` and `%enotin%` are motivated by a different sort of problem. Consider the following statement: ```{r iris-veriscolor} iris <- as.data.table(iris) iris[Species %in% c("setosa", "versicolour")] %$% mean(Sepal.Length / Sepal.Width) ``` On the face of it, this appears to give the average ratio of *Iris setosa* and *Iris versicolour* irises. However, it only gives the average ratio of *setosa* irises, as the correct spelling is *Iris versicolor* not *-our*. This particular error is easy to make, (in fact when I wrote this vignette, the first hit of Google for `iris dataset` made the same spelling error), but it's easy to imagine similar mistakes, such as mistaking the capitalization of a value. The functions `%ein%` and `%enotin%` strive to reduce the occurrence of this mistake. The functions operate exactly the same as `%in%` and `%enotin%` but error if any of the table of values to be matched against is not present in any of the values: ```{r iris-versicolor, error=TRUE} iris <- as.data.table(iris) iris[Species %ein% c("setosa", "versicolour")] %$% mean(Sepal.Length / Sepal.Width) ``` The `e` stands for 'exists'; *i.e.* they should be read as "exists and in" and "exists and not in". ### `%pin%` This performs a partial match (*i.e* `grepl`) but with a possibly more readable or intuitive syntax ```{r pin} identical(iris[grep("v", Species)], iris[Species %pin% "v"]) ``` If the RHS has more than one element, the matching is done on alternation (*i.e.* OR): ```{r pin-multi} iris[Species %pin% c("ver", "vir")] %>% head ``` There is an important qualification: if the RHS is `NULL`, then the result will be `TRUE` along the length of `x`, contrary to the behaviour of `%in%`. This is not entirely unexpected as `NULL` could legitimately be interpreted as \(\varepsilon\), the empty regular expression, which occurs in every string. ### `provide.dir` This is the same as `dir.create` but checks whether the target exists or not and does nothing if it does. Motivated by `\providecommand` in \(\rm\LaTeX{}\), which creates a macro only if it does not exist already. ### `select_which` This provides a similar role to `dplyr::select_if` but was originally part of `package:grattan` so has a different name. It simply returns the columns whose *values* return `TRUE` when `Which` is applied. Additional columns (which may or not may satisfy `Which`) may be included by using `.and.dots`. (To remove columns, you can use `drop_col`). ```{r} DT <- data.table(x = 1:5, y = letters[1:5], AB = c(NA, TRUE, FALSE, TRUE, FALSE)) select_which(DT, anyNA, .and.dots = "y") ``` ### `set_cols_first` Up to and including `data.table 1.10.4`, one could only reorder the columns by supplying all the columns. You can use `set_cols_first` and `set_cols_last` to put columns first or last without supplying all the columns. ### Unique keys In some circumstances, you need to know that the `key` of a `data.table` is unique. For example, you may expect a join to be performed later, without specifying `mult='first'` or permitting Cartesian joins. `data.table` does not require a `key` to be unique and does not supply tools to check the uniqueness of keys. `hutils` supplies two simple functions: `has_unique_key` which when applied to a `data.table` returns `TRUE` if and only if the table has a key and it is unique. `set_unique_key` does the same as `setkey` but will error if the resultant key is not unique. ## `hutils` v1.1.0 ### `auc` The area under the [(ROC) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) gives a single value to measure the tradeoff between true positives and false positives. ```{r} dt <- data.table(y = !sample(0:1, size = 100, replace = TRUE), x = runif(100)) dt[, pred := predict(lm(y ~ x, data = .SD), newdata = .SD)] dt[, auc(y, pred)] ``` ### `select_grep` To select columns matching a regular expression: ```{r select_grep} flights %>% select_grep("arr") ``` You can use the additional arguments `.and` and `.but.not` to override the patterns. ```{r select_grep-and} flights %>% select_grep("arr", .and = "year", .but.not = "arr_time") ``` ## `hutils` v1.2.0 ### `RQ` This is simply a shorthand to test whether a package needs installing. The package name need not be quoted, for convenience. ```{r} RQ(dplyr, "dplyr must be installed") RQ("dplyr", "dplyr needs installing", "dplyr installed.") ``` ### `ahull` This locates the biggest rectangle beneath a curve: ```{r ahull-1} if (!identical(Sys.info()[["sysname"]], "Darwin")) ggplot(data.table(x = c(0, 1, 2, 3, 4), y = c(0, 1, 2, 0.1, 0))) + geom_area(aes(x, y)) + geom_rect(data = ahull(, c(0, 1, 2, 3, 4), c(0, 1, 2, 0.1, 0)), aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), color = "red") ``` ```{r ahull-demos, fig.width = 8, fig.height = 6} set.seed(101) ahull_dt <- data.table(x = c(0:100) / 100, y = cumsum(rnorm(101, 0.05))) if (!identical(Sys.info()[["sysname"]], "Darwin")) ggplot(ahull_dt) + geom_area(aes(x, y)) + geom_rect(data = ahull(ahull_dt), aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), color = "red") + geom_rect(data = ahull(ahull_dt, incl_negative = TRUE), aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), color = "blue") + geom_rect(data = ahull(ahull_dt, incl_negative = TRUE, minH = 4), aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), color = "green") + geom_rect(data = ahull(ahull_dt, incl_negative = TRUE, minW = 0.25), aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), color = "white", fill = NA) ``` ## `hutils` v1.3.0 ### `weighted_quantile` Simply a version of `quantile` supporting weighted values: ```{r weighted_quantile-ex} x <- 1:10 w <- c(rep(1, 5), rep(2, 5)) quantile(x, prob = c(0.25, 0.75), names = FALSE) weighted_quantile(x, w, p = c(0.25, 0.75)) ``` ### `mutate_ntile` To add a column of ntiles (say, for later summarizing): ```{r mutate_ntile-ex} flights %>% as.data.table %>% .[, .(year, month, day, origin, dest, distance)] %>% mutate_ntile(distance, n = 5L) ``` You can use non-standard evaluation (as above) or you can quote the `col` argument. Use `character.only = TRUE` to ensure column is only interpreted as character. ```{r mutate_ntile-ex-charonly} flights %>% as.data.table %>% .[, .(year, month, day, origin, dest, distance)] %>% mutate_ntile(distance, n = 5L) ``` ```{r mutate_ntile-ex-2} flights %>% as.data.table %>% mutate_ntile("distance", n = 5L, character.only = TRUE) %>% .[, dep_delay := coalesce(dep_delay, 0)] %>% .[, .(avgDelay = mean(dep_delay)), keyby = "distanceQuintile"] ``` ### `longest_affix` Trim common affixes can be useful during data cleaning: ```{r longest-affix} trim_common_affixes(c("CurrentHousingCosts(weekly)", "CurrentFuelCosts(weekly)")) ``` ## `hutils 1.4.0` ### `%<->%` Referred to as `swap` in the documentation. Used to swap values between object names ```{r swap} a <- 1 b <- 2 a %<->% b identical(c(a, b), c(2, 1)) ``` ### `average_bearing` Determine the average bearing of vectors. Slightly more difficult than simply the average modulo 360 since its the most acute sector is desired. ```{r average-bearing} average_bearing(0, 270) # NW mean(c(0, 270)) # SE (i.e. wrong) ``` ### `dir2` This is a faster version of `list.files` for Windows only, utilizing the `dir` command on the command prompt. ### `Mode` The statistical mode; the most common element. ```{r Mode-eg} Mode(c(1, 1, 1, 2, 3)) ``` ### `replace_pattern_in` A cousin of `find_pattern_in`, but instead of collecting the results, it replaces the contents sought with the replacement provided. ### `samp` A safer version of `sample`. I use it because I found the following behaviour of sample surprising. ```{r samp-eg} DT <- data.table(x = c(5, 2, 3), y = c(5, 3, 4)) DT[, .(Base = sample(.BY[["x"]]:.BY[["y"]])), keyby = .(x, y)] DT[, .(Base = samp(.BY[["x"]]:.BY[["y"]])), keyby = .(x, y)] ```