Title: | Miscellaneous R Functions and Aliases |
---|---|
Description: | Provides utility functions for, and drawing on, the 'data.table' package. The package also collates useful miscellaneous functions extending base R not available elsewhere. The name is a portmanteau of 'utils' and the author. |
Authors: | Hugh Parsonage [aut, cre], Michael Frasco [ctb], Ben Hamner [ctb] |
Maintainer: | Hugh Parsonage <[email protected]> |
License: | GPL-3 |
Version: | 1.8.2 |
Built: | 2024-11-01 03:24:41 UTC |
Source: | https://github.com/hughparsonage/hutils |
Provides utility functions for, and drawing on, the 'data.table' package. The package also collates useful miscellaneous functions extending base R not available elsewhere. The name is a portmanteau of 'utils' and the author.
The package attempts to provide lightweight, fast, and stable functions for common operations.
By lightweight, I mean in terms of dependencies:
we import package:data.table
and package:fastmatch
which do require
compilation, but in C. Otherwise, all dependencies do not require compilation.
By fast, I mean essentially as fast as possible without using compilation.
By stable, I mean that unit tests should not change unless the major
version also changes. To make this completely transparent, tests include the version
of their introduction and are guaranteed to not be modified (not even in the sense of
adding extra, independent tests) while the major version is 1
. Tests that do
not include the version in their filename may be modified from version to version
(though this will be avoided).
A common blunder in R programming is to mistype one of a set of filters without realizing. This function will error if any member of the values to be matched against is not present.
lhs %ein% rhs lhs %enotin% rhs
lhs %ein% rhs lhs %enotin% rhs
lhs |
Values to be matched |
rhs |
Values to be matched against. |
Same as %in%
and %notin%
, unless an element of rhs
is not present in lhs
, in which case, an error.
# Incorrectly assumed to include two Species iris[iris$Species %in% c("setosa", "versicolour"), ] ## Not run: # Error: iris[iris$Species %ein% c("setosa", "versicolour"), ] ## End(Not run)
# Incorrectly assumed to include two Species iris[iris$Species %in% c("setosa", "versicolour"), ] ## Not run: # Error: iris[iris$Species %ein% c("setosa", "versicolour"), ] ## End(Not run)
Negation of in (character)
x %notchin% y
x %notchin% y
x |
Values to be matched. |
y |
Values to be matched against. |
If y
is NULL
, then x
is TRUE
for consistency with
%in%
. If x
and y
are not both character, the function simply
falls back to %in%
rather than erroring.
Negation of in
x %notin% y
x %notin% y
x |
Values to be matched |
y |
Values to be matched against. |
If y
is NULL
, then x
is TRUE
for consistency with
%in%
. Note that the function uses fmatch
internally for
performance on large y
. Accordingly, y
will be modified by adding
a .match.hash
attribute and thus must not be used in packages where y
is a constant, or for things like names of data.table
.
Analogue of %in%
but indicating partial match of the left operand.
x %pin% Y
x %pin% Y
x |
The values to be matched. Same as |
Y |
A vector of values (perl regular expressions) to be matched against. |
TRUE
for every x
for which any grepl
is TRUE
.
x <- c("Sydney Airport", "Melbourne Airport") x %pin% c("Syd", "Melb")
x <- c("Sydney Airport", "Melbourne Airport") x %pin% c("Syd", "Melb")
Present since hutils 1.2.0
.
ahull( DT, x = DT$x, y = DT$y, minH = 0, minW = 0, maximize = "area", incl_negative = FALSE )
ahull( DT, x = DT$x, y = DT$y, minH = 0, minW = 0, maximize = "area", incl_negative = FALSE )
DT , x , y
|
Coordinates of a curve containing a rectangle.
Either as a list, |
minH |
The minimum height of the rectangles. |
minW |
The minimum width of the rectangles. |
maximize |
How the rectangle should be selected. Currently, only |
incl_negative |
Should areas below the x-axis be considered? |
A data.table
: The coordinates of a rectangle, from (0, 0), (1, 0), (1, 1), (0, 1), south-west clockwise,
that is contained within the area of the chart for positive values only.
ahull(, c(0, 1, 2, 3, 4), c(0, 1, 2, 0, 0))
ahull(, c(0, 1, 2, 3, 4), c(0, 1, 2, 0, 0))
These simple aliases can be useful to avoid operator precedence ambiguity, or to make use of indents from commas within your text editor. The all-caps versions accept single-length (capable of 'short-circuits') logical conditions only.
Neithers and nors are identical except have slightly different short-circuits.
NOR
uses negation once so may be quicker if the first argument is very, very prompt.
AND(x, y) OR(x, y) nor(x, y) neither(x, y) NOR(x, y) NEITHER(x, y) pow() XOR(x, y)
AND(x, y) OR(x, y) nor(x, y) neither(x, y) NOR(x, y) NEITHER(x, y) pow() XOR(x, y)
x , y
|
Logical conditions. |
Present since hutils 1.2.0
.
all_same_sign(x)
all_same_sign(x)
x |
A numeric vector. |
TRUE
if all elements of x
have the same sign. Zero is a separate sign from positive and negative. All vectors of length-1 or length-0 return TRUE
, even if x
= NA
, (since although the value is unknown, it must have a unique sign), and non-numeric x
.
all_same_sign(1:10) all_same_sign(1:10 - 1) all_same_sign(0) all_same_sign(NA) all_same_sign(c(NA, 1)) all_same_sign("surprise?") all_same_sign(c(0, 0.1 + 0.2 - 0.3)) if (requireNamespace("microbenchmark", quietly = TRUE)) { library(microbenchmark) microbenchmark(base = length(unique(sign(1:1e5), nmax = 3)) == 1L, all_same_sign(1:1e5)) } # Unit: microseconds # expr min lq mean median uq max neval cld # base 2012 2040 2322 2047 2063 9324 100 b # all_same_sign(1:1e+05) 86 86 94 89 93 290 100 a
all_same_sign(1:10) all_same_sign(1:10 - 1) all_same_sign(0) all_same_sign(NA) all_same_sign(c(NA, 1)) all_same_sign("surprise?") all_same_sign(c(0, 0.1 + 0.2 - 0.3)) if (requireNamespace("microbenchmark", quietly = TRUE)) { library(microbenchmark) microbenchmark(base = length(unique(sign(1:1e5), nmax = 3)) == 1L, all_same_sign(1:1e5)) } # Unit: microseconds # expr min lq mean median uq max neval cld # base 2012 2040 2322 2047 2063 9324 100 b # all_same_sign(1:1e+05) 86 86 94 89 93 290 100 a
Shortcut for any(grepl(...))
, mostly for consistency.
any_grepl( x, pattern, perl = TRUE, ignore.case = FALSE, fixed = FALSE, quiet = FALSE )
any_grepl( x, pattern, perl = TRUE, ignore.case = FALSE, fixed = FALSE, quiet = FALSE )
x |
A character vector. |
pattern , perl , ignore.case , fixed
|
As in |
quiet |
(logical, default: |
From version v 1.4.0
, any_grepl(a, bb)
will be internally
reversed to any_grepl(bb, a)
if length(bb) > 1
and length(a) == 1
.
any_grepl(c("A_D_E", "K0j"), "[a-z]")
any_grepl(c("A_D_E", "K0j"), "[a-z]")
Returns the area under the curve ("AUC") of a receiver-operating characteristic curve for the given predicted and actual values.
auc(actual, pred)
auc(actual, pred)
actual |
Logical vector: |
pred |
Numeric (double) vector the same length as |
Copyright (c) 2012, Ben Hamner Author: Ben Hamner ([email protected]) All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Source code based on Metrics::auc
from Ben Hamner and Michael Frasco and Erin LeDell from the Metrics package.
Average of bearings
average_bearing(theta1, theta2, average_of_opposite = NULL) average_bearing_n(thetas)
average_bearing(theta1, theta2, average_of_opposite = NULL) average_bearing_n(thetas)
theta1 , theta2
|
Bearings, expressed in degrees. |
average_of_opposite |
The average of opposing bearings (e.g. average of north
and south) is not well-defined. If |
thetas |
A vector of bearings. |
For 'average_bearing', the bearing bisecting the two bearings.
For 'average_bearing_n', the average bearing of the bearing.
average_bearing(0, 90) average_bearing(0, 270) average_bearing(90, 180) average_bearing(0, 180) average_bearing(0, 180, average_of_opposite = 3) average_bearing(0, 180, average_of_opposite = "left") average_bearing_n(1:179)
average_bearing(0, 90) average_bearing(0, 270) average_bearing(90, 180) average_bearing(0, 180) average_bearing(0, 180, average_of_opposite = 3) average_bearing(0, 180, average_of_opposite = "left") average_bearing_n(1:179)
Bearing calculations
bearing(lat_orig, lon_orig, lat_dest, lon_dest) compass2bearing(compass) easterly_component(compass) northerly_component(compass)
bearing(lat_orig, lon_orig, lat_dest, lon_dest) compass2bearing(compass) easterly_component(compass) northerly_component(compass)
lat_orig , lon_orig , lat_dest , lon_dest
|
Latitude and longitude of origin and destination. |
compass |
A character vector of compass rose points, such as
|
bearing
An approximate bearing from _orig
and _dest
.
compass2bearing
The bearing encoded by the compass input.
easterly_component
The easterly component of a unit vector pointing in the direction provided.
bearing(0, 0, 90, 0) bearing(-35, 151, 51, 0) compass2bearing("NW") easterly_component("E") easterly_component("NW")
bearing(0, 0, 90, 0) bearing(-35, 151, 51, 0) compass2bearing("NW") easterly_component("E") easterly_component("NW")
Lightweight version of dplyr::coalesce
, with all the vices and virtues that come from such an
approach.
Very similar logic (and timings to dplyr::coalesce
), though no ability to use quosures etc.
One exception is that if x
does not contain any missing values, it is returned immediately,
and ignores ...
. For example, dplyr::coalesce(1:2, 1:3)
is an error, but
hutils::coalesce(1:2, 1:3)
is not.
coalesce(x, ...)
coalesce(x, ...)
x |
A vector |
... |
Successive vectors whose values will replace the corresponding values in |
x
with missing values replaced by the first non-missing corresponding elements in ...
.
That is, if ... = A, B, C
and x[i]
is missing, then x[i]
is replaced by
A[i]
. If x[i]
is still missing (i.e. A[i]
was itself NA
), then it
is replaced by B[i]
, C[i]
until it is no longer missing or the list has been exhausted.
Original source code but obviously inspired by dplyr::coalesce
.
coalesce(c(1, NA, NA, 4), c(1, 2, NA, NA), c(3, 4, 5, NA))
coalesce(c(1, NA, NA, 4), c(1, 2, NA, NA), c(3, 4, 5, NA))
Simply a wrapper around dev.copy2pdf
, but without the need to remember that an A4 sheet of paper is 8.27 in by 11.69 in.
dev_copy2a4(filename, ...)
dev_copy2a4(filename, ...)
filename |
A string giving the name of the PDF file to write to, must end in |
... |
Other parameters passed to |
As in dev2
.
(Windows only) Same as list.files
but much faster.
Present since v1.4.0.
dir2( path = ".", file_ext = NULL, full.names = TRUE, recursive = TRUE, pattern = NULL, fixed = FALSE, perl = TRUE && missing(fixed) && !fixed, ignore.case = FALSE, invert = FALSE, .dont_use = FALSE )
dir2( path = ".", file_ext = NULL, full.names = TRUE, recursive = TRUE, pattern = NULL, fixed = FALSE, perl = TRUE && missing(fixed) && !fixed, ignore.case = FALSE, invert = FALSE, .dont_use = FALSE )
path |
A string representing the trunk path to search within. |
file_ext |
A string like '*.txt' or '.csv' to limit the result to files with that extension. |
full.names |
|
recursive |
|
pattern , perl , ignore.case , fixed , invert
|
As in |
.dont_use |
Only used for tests to simulate non-Windows systems. |
The same as list.files
, a character vector of files sought.
Drop column or columns
drop_col(DT, var, checkDT = TRUE) drop_cols(DT, vars, checkDT = TRUE)
drop_col(DT, var, checkDT = TRUE) drop_cols(DT, vars, checkDT = TRUE)
DT |
A |
var |
Quoted column to drop. |
checkDT |
Should the function check |
vars |
Character vector of columns to drop. Only the intersection is dropped;
if any |
DT
with specified columns removed.
if (requireNamespace("data.table", quietly = TRUE)) { library(data.table) DT <- data.table(x = 1, y = 2, z = 3) drop_col(DT, "x") }
if (requireNamespace("data.table", quietly = TRUE)) { library(data.table) DT <- data.table(x = 1, y = 2, z = 3) drop_col(DT, "x") }
drop_colr
present since hutils 1.0.0
.
drop_grep
is identical but only present since hutils 1.2.0
.
drop_colr(DT, pattern, ..., checkDT = TRUE)
drop_colr(DT, pattern, ..., checkDT = TRUE)
DT |
A |
pattern |
A regular expression as in |
... |
Arguments passed to |
checkDT |
If |
library(data.table) dt <- data.table(x1 = 1, x2 = 2, y = 3) drop_grep(dt, "x")
library(data.table) dt <- data.table(x1 = 1, x2 = 2, y = 3) drop_grep(dt, "x")
Drops columns that have only one value in a data.table
.
drop_constant_cols(DT, copy = FALSE)
drop_constant_cols(DT, copy = FALSE)
DT |
A |
copy |
(logical, default: |
If DT
is a data.frame
that is not a data.table
,
constant columns are still dropped, but since DT
will be copied, copy
should be set
to TRUE
to avoid a warning. If DT
is a data.frame
and all but one
of the columns are constant, a data.frame
will still be returned, as opposed to the values of the sole remaining column, which is the
default behaviour of base data.frame
.
If all columns are constant, drop_constant_cols
returns a Null data table if DT
is a data.table
,
but a data frame with 0 columns and nrow(DT)
otherwise.
library(data.table) X <- data.table(x = c(1, 1), y = c(1, 2)) drop_constant_cols(X)
library(data.table) X <- data.table(x = c(1, 1), y = c(1, 2)) drop_constant_cols(X)
Removes columns from a data.table
where all the values are missing.
drop_empty_cols(DT, copy = FALSE)
drop_empty_cols(DT, copy = FALSE)
DT |
A |
copy |
Copies the |
This function differs from duplicated
in that it returns both the duplicate row and the row which has been duplicated.
This may prove useful in combination with the by
argument for determining whether two observations are identical across
more than just the specified columns.
duplicated_rows( DT, by = names(DT), na.rm = FALSE, order = TRUE, copyDT = TRUE, na.last = FALSE )
duplicated_rows( DT, by = names(DT), na.rm = FALSE, order = TRUE, copyDT = TRUE, na.last = FALSE )
DT |
A |
by |
Character vector of columns to evaluate duplicates over. |
na.rm |
(logical) Should |
order |
(logical) Should the result be ordered so that duplicate rows are adjacent? (Default |
copyDT |
(logical) Should |
na.last |
(logical) If |
Duplicate rows of DT
by by
. For interactive use.
if (requireNamespace("data.table", quietly = TRUE)) { library(data.table) DT <- data.table(x = rep(1:4, 3), y = rep(1:2, 6), z = rep(1:3, 4)) # No duplicates duplicated_rows(DT) # x and y have duplicates duplicated_rows(DT, by = c("x", "y"), order = FALSE) # By default, the duplicate rows are presented adjacent to each other. duplicated_rows(DT, by = c("x", "y")) }
if (requireNamespace("data.table", quietly = TRUE)) { library(data.table) DT <- data.table(x = rep(1:4, 3), y = rep(1:2, 6), z = rep(1:3, 4)) # No duplicates duplicated_rows(DT) # x and y have duplicates duplicated_rows(DT, by = c("x", "y"), order = FALSE) # By default, the duplicate rows are presented adjacent to each other. duplicated_rows(DT, by = c("x", "y")) }
goto_pattern_in
present from 1.6.0
find_pattern_in( file_contents, basedir = ".", dir_recursive = TRUE, reader = readLines, include.comments = FALSE, comment.char = NULL, use.OS = FALSE, file_pattern = "\\.(R|r)(nw|md)?$", file_contents_perl = TRUE, file_contents_fixed = FALSE, file_contents_ignore_case = FALSE, file.ext = NULL, which_lines = c("first", "all") ) goto_pattern_in(file_contents, ...)
find_pattern_in( file_contents, basedir = ".", dir_recursive = TRUE, reader = readLines, include.comments = FALSE, comment.char = NULL, use.OS = FALSE, file_pattern = "\\.(R|r)(nw|md)?$", file_contents_perl = TRUE, file_contents_fixed = FALSE, file_contents_ignore_case = FALSE, file.ext = NULL, which_lines = c("first", "all") ) goto_pattern_in(file_contents, ...)
file_contents |
A perl-regular expression as a search query. |
basedir |
The root of the directory tree in which files will be searched recursively. |
dir_recursive |
(logical, default: |
reader |
A function, akin to |
include.comments |
If |
comment.char |
If |
use.OS |
Use the operating system to determine file list. Only available on Windows. If it fails, a fall-back option
(using |
file_pattern |
A regular expression passed to |
file_contents_perl |
(logical, default: |
file_contents_fixed |
(logical, default: |
file_contents_ignore_case |
(logical, default: |
file.ext |
A file extension passed to the operating system if |
which_lines |
One of |
... |
Arguments passed to |
For convenience, if file_contents
appears to be a directory
and basedir
does not, the arguments are swapped, but with a warning.
A data.table
, showing the matches per file.
goto_pattern_in
additionally prompts for a row of the returned results.
Using the rstudioapi
, if available, RStudio will jump to the file
and line number.
Utilities for 'fst' files
fst_columns(file.fst) fst_nrow(file.fst)
fst_columns(file.fst) fst_nrow(file.fst)
file.fst |
Path to file. |
Various outputs:
fst_columns
Returns the names of the columns in file.fst
.
fst_nrow
Returns the number of rows in file.fst
.
Generate LaTeX manual of installed package
generate_LaTeX_manual(pkg, launch = TRUE)
generate_LaTeX_manual(pkg, launch = TRUE)
pkg |
Quoted package name (must be installed). |
launch |
Should the PDF created be launched using the viewer ( |
See system
.
Called for its side-effect: creates a PDF in the current working directory. Requires a TeX distribution.
https://stackoverflow.com/a/30608000/1664978
Distance between two points on the Earth
haversine_distance(lat1, lon1, lat2, lon2)
haversine_distance(lat1, lon1, lat2, lon2)
lat1 , lon1 , lat2 , lon2
|
That latitudes and longitudes of the two points. |
This is reasonably accurate for distances in the order of 1 to 1000 km.
The distance in kilometres between the two points.
# Distance from YMEL to YSSY haversine_distance(-37 - 40/60, 144 + 50/60, -33 - 56/60, 151 + 10/60)
# Distance from YMEL to YSSY haversine_distance(-37 - 40/60, 144 + 50/60, -33 - 56/60, 151 + 10/60)
Lightweight dplyr::if_else
with the virtues and vices that come from such an approach.
Attempts to replicate dplyr::if_else
but written in base R for faster compile time.
hutils::if_else
should be faster than dplyr::if_else
... when it works,
but will not work on lists or on factors.
Additional attributes may be dropped.
if_else(condition, true, false, missing = NULL)
if_else(condition, true, false, missing = NULL)
condition |
Logical vector. |
true , false
|
Where condition is |
missing |
If condition is |
If the result is expected to be a factor then the conditions for type safety are strict and may be made stricter in future.
Where condition
is TRUE
, the corresponding value in true
;
where condition
is FALSE
, the corresponding value in false
.
Where condition
is NA
, then the corresponding value in na
–
unless na
is NULL
(the default) in which case the value will be NA
(with the same
type as true
.)
Original code but obviously heavily inspired by https://CRAN.R-project.org/package=dplyr.
Returns the result of .
implies(x, y) x %implies% y
implies(x, y) x %implies% y
x , y
|
Logical vectors of the same length. |
Logical implies: TRUE
unless x
is TRUE
and y
is FALSE
.
NA
in either x
or y
results in NA
if and only if the result is unknown.
In particular NA %implies% TRUE
is TRUE
and FALSE %implies% NA
is TRUE
.
If x
or y
are length-one, the function proceeds as if the length-one vector were recycled
to the length of the other.
library(data.table) CJ(x = c(TRUE, FALSE), y = c(TRUE, FALSE))[, ` x => y` := x %implies% y][] #> x y x => y #> 1: FALSE FALSE TRUE #> 2: FALSE TRUE TRUE #> 3: TRUE FALSE FALSE #> 4: TRUE TRUE TRUE # NA results: #> 5: NA NA NA #> 6: NA FALSE NA #> 7: NA TRUE TRUE #> 8: FALSE NA TRUE #> 9: TRUE NA NA
library(data.table) CJ(x = c(TRUE, FALSE), y = c(TRUE, FALSE))[, ` x => y` := x %implies% y][] #> x y x => y #> 1: FALSE FALSE TRUE #> 2: FALSE TRUE TRUE #> 3: TRUE FALSE FALSE #> 4: TRUE TRUE TRUE # NA results: #> 5: NA NA NA #> 6: NA FALSE NA #> 7: NA TRUE TRUE #> 8: FALSE NA TRUE #> 9: TRUE NA NA
Is a package attached?
isAttached(pkg)
isAttached(pkg)
pkg |
Either character or unquoted. |
TRUE
if pkg
is attached.
Logical assertions
isTrueFalse(x)
isTrueFalse(x)
x |
An object whose values are to be checked. |
For isTrueFalse
, TRUE
if and only if x
is TRUE
or FALSE
identically (perhaps with attributes).
Longest common prefix/suffix
trim_common_affixes( x, .x = NULL, na.rm = TRUE, prefixes = TRUE, suffixes = TRUE, warn_if_no_prefix = TRUE, warn_if_no_suffix = TRUE ) longest_suffix(x, .x = NULL, na.rm = TRUE, warn_if_no_suffix = TRUE) longest_prefix(x, .x = NULL, na.rm = TRUE, warn_if_no_prefix = TRUE)
trim_common_affixes( x, .x = NULL, na.rm = TRUE, prefixes = TRUE, suffixes = TRUE, warn_if_no_prefix = TRUE, warn_if_no_suffix = TRUE ) longest_suffix(x, .x = NULL, na.rm = TRUE, warn_if_no_suffix = TRUE) longest_prefix(x, .x = NULL, na.rm = TRUE, warn_if_no_prefix = TRUE)
x |
A character vector. |
.x |
If |
na.rm |
(logical, default: If |
prefixes |
(logical, default: |
suffixes |
(logical, default: |
warn_if_no_prefix , warn_if_no_suffix
|
(logical, default: |
The longest common substring in x
either at the start or end of each string.
For trim_common_affixes
x
with common prefix and common suffix
removed.
longest_prefix(c("totalx", "totaly", "totalz")) longest_suffix(c("ztotal", "ytotal", "xtotal"))
longest_prefix(c("totalx", "totaly", "totalz")) longest_suffix(c("ztotal", "ytotal", "xtotal"))
Proportion of values that are NA.
mean_na(v)
mean_na(v)
v |
A vector. |
A double, mean(is.na(v))
.
Present since hutils 1.4.0
. The most common element.
Mode(x)
Mode(x)
x |
A vector for which the mode is desired. |
The most common element of x
.
If the mode is not unique, only one of these values is returned, for simplicity.
If x
has length zero, Mode(x) = x
.
Add a column of ntiles to a data table
mutate_ntile( DT, col, n, weights = NULL, by = NULL, keyby = NULL, new.col = NULL, character.only = FALSE, overwrite = TRUE, check.na = FALSE )
mutate_ntile( DT, col, n, weights = NULL, by = NULL, keyby = NULL, new.col = NULL, character.only = FALSE, overwrite = TRUE, check.na = FALSE )
DT |
A |
col |
The column name (quoted or unquoted) for which quantiles are desired. |
n |
A positive integer, the number of groups to split |
weights |
If |
by , keyby
|
Produce a grouped quantile column, as in |
new.col |
If not |
character.only |
(logical, default: |
overwrite |
(logical, default: |
check.na |
(logical, default: |
DT
with a new integer column new.col
containing the
quantiles. If DT
is not a data.table
its class may be preserved
unless keyby
is used, where it will always be a data.table
.
library(data.table) DT <- data.table(x = 1:20, y = 2:1) mutate_ntile(DT, "x", n = 10) mutate_ntile(DT, "x", n = 5) mutate_ntile(DT, "x", n = 10, by = "y") mutate_ntile(DT, "x", n = 10, keyby = "y") y <- "x" DT <- data.table(x = 1:20, y = 2:1) mutate_ntile(DT, y, n = 5) # Use DT$y mutate_ntile(DT, y, n = 5, character.only = TRUE) # Use DT$x
library(data.table) DT <- data.table(x = 1:20, y = 2:1) mutate_ntile(DT, "x", n = 10) mutate_ntile(DT, "x", n = 5) mutate_ntile(DT, "x", n = 10, by = "y") mutate_ntile(DT, "x", n = 10, keyby = "y") y <- "x" DT <- data.table(x = 1:20, y = 2:1) mutate_ntile(DT, y, n = 5) # Use DT$y mutate_ntile(DT, y, n = 5, character.only = TRUE) # Use DT$x
Useful when you want to constrain the number of unique values in a column by keeping only the most common values.
mutate_other( .data, var, n = 5, count, by = NULL, var.weight = NULL, mass = NULL, copy = TRUE, other.category = "Other" )
mutate_other( .data, var, n = 5, count, by = NULL, var.weight = NULL, mass = NULL, copy = TRUE, other.category = "Other" )
.data |
Data containing variable. |
var |
Variable containing infrequent entries, to be collapsed into "Other". |
n |
Threshold for total number of categories above "Other". |
count |
Threshold for total count of observations before "Other". |
by |
Extra variables to group by when calculating |
var.weight |
Variable to act as a weight: |
mass |
Threshold for sum of |
copy |
Should |
other.category |
Value that infrequent entries are to be collapsed into. Defaults to |
.data
but with var
changed so that infrequent values have the same value (other.category
).
library(data.table) library(magrittr) DT <- data.table(City = c("A", "A", "B", "B", "C", "D"), value = c(1, 9, 4, 4, 5, 11)) DT %>% mutate_other("City", var.weight = "value", mass = 10) %>% .[]
library(data.table) library(magrittr) DT <- data.table(City = c("A", "A", "B", "B", "C", "D"), value = c(1, 9, 4, 4, 5, 11)) DT %>% mutate_other("City", var.weight = "value", mass = 10) %>% .[]
It is not simple to negate a regular expression. This obviates the need
takes the long way round: negating the corresponding grepl
call.
ngrep(pattern, x, value = FALSE, ...)
ngrep(pattern, x, value = FALSE, ...)
x , value , pattern
|
As in |
... |
Arguments passed to |
If value
is FALSE
(the default), indices of x
which do not match the
pattern; if TRUE
, the values of x
themselves.
grep("[a-h]", letters) ngrep("[a-h]", letters) txt <- c("The", "licenses", "for", "most", "software", "are", "designed", "to", "take", "away", "your", "freedom", "to", "share", "and", "change", "it.", "", "By", "contrast,", "the", "GNU", "General", "Public", "License", "is", "intended", "to", "guarantee", "your", "freedom", "to", "share", "and", "change", "free", "software", "--", "to", "make", "sure", "the", "software", "is", "free", "for", "all", "its", "users") grep("[gu]", txt, value = TRUE) ngrep("[gu]", txt, value = TRUE)
grep("[a-h]", letters) ngrep("[a-h]", letters) txt <- c("The", "licenses", "for", "most", "software", "are", "designed", "to", "take", "away", "your", "freedom", "to", "share", "and", "change", "it.", "", "By", "contrast,", "the", "GNU", "General", "Public", "License", "is", "intended", "to", "guarantee", "your", "freedom", "to", "share", "and", "change", "free", "software", "--", "to", "make", "sure", "the", "software", "is", "free", "for", "all", "its", "users") grep("[gu]", txt, value = TRUE) ngrep("[gu]", txt, value = TRUE)
Tests whether all vectors have the same length.
prohibit_unequal_length_vectors(...)
prohibit_unequal_length_vectors(...)
... |
Vectors to test. |
An error message unless all of ...
have the same length in which case NULL
, invisibly.
Tests (harshly) whether the vectors can be recycled safely.
prohibit_vector_recycling(...) prohibit_vector_recycling.MAXLENGTH(...)
prohibit_vector_recycling(...) prohibit_vector_recycling.MAXLENGTH(...)
... |
A list of vectors |
An error message if the vectors are of different length (unless the alternative length is 1).
The functions differ in their return values on success: prohibit_vector_recycling.MAXLENGTH
returns the maximum of the lengths whereas prohibit_vector_recyling
returns NULL
.
(Both functions return their values invisibly.)
## Not run: # Returns nothing because they are of the same length prohibit_vector_recycling(c(2, 2), c(2, 2)) # Returns nothing also, because the only different length is 1 prohibit_vector_recycling(c(2, 2), 1) # Returns an error: prohibit_vector_recycling(c(2, 2), 1, c(3, 3, 3)) ## End(Not run)
## Not run: # Returns nothing because they are of the same length prohibit_vector_recycling(c(2, 2), c(2, 2)) # Returns nothing also, because the only different length is 1 prohibit_vector_recycling(c(2, 2), 1) # Returns an error: prohibit_vector_recycling(c(2, 2), 1, c(3, 3, 3)) ## End(Not run)
Provide directory. Create directory only if it does not exist.
provide.dir(path, ...)
provide.dir(path, ...)
path |
Path to create. |
... |
Passed to |
path
on success, the empty string character(1)
on failure.
Present since hutils v1.5.0
.
provide.file(path, on_failure = "")
provide.file(path, on_failure = "")
path |
A string. The path to a filename that requires existence. |
on_failure |
The return value on failure. By default, an empty string. |
path
for success. Or on_failure
if the path
cannot be provided.
Replace string pattern in text file
replace_pattern_in( file_contents, replace, basedir = ".", dir_recursive = TRUE, reader = readLines, file_pattern = "\\.(R|r)(nw|md)?$", file_contents_perl = TRUE, file_contents_fixed = FALSE, file_contents_ignore_case = FALSE, writer = writeLines )
replace_pattern_in( file_contents, replace, basedir = ".", dir_recursive = TRUE, reader = readLines, file_pattern = "\\.(R|r)(nw|md)?$", file_contents_perl = TRUE, file_contents_fixed = FALSE, file_contents_ignore_case = FALSE, writer = writeLines )
file_contents |
Character string containing a regular expression to be matched in the
given character vector. Passed to |
replace |
The replacement, passed to |
basedir |
The root of the directory tree in which files will be searched recursively. |
dir_recursive |
(logical, default: |
reader |
A function, akin to |
file_pattern |
A regular expression passed to |
file_contents_perl |
(logical, default: |
file_contents_fixed |
(logical, default: |
file_contents_ignore_case |
(logical, default: |
writer |
A function that will rewrite the file from the character vector read in. |
Provides a consistent style for errors and warnings.
report_error( faulty_input, error_condition, requirement, context = NULL, advice, hint = NULL, halt = TRUE )
report_error( faulty_input, error_condition, requirement, context = NULL, advice, hint = NULL, halt = TRUE )
faulty_input |
Unquoted function argument that is the cause of the error condition. |
error_condition |
A sentence explaining the condition that invoked the error. |
requirement |
A sentence that explains what is required. |
context |
(Optional) A sentence that contextualizes the error |
advice |
Advice for the user to avoid the error. |
hint |
If the input can be guessed, |
halt |
(logical, default: |
requireNamespace
Present since hutils v1.2.0
. Alias for if (!requireNamespace(pkg, quietly = TRUE))
yes else
no.
Typical use-case would be RQ(pkg, install.packages("pkg"))].
Default values for yes
and no
from hutils v1.5.0
.
This function is not recommended for use in scripts as it is a bit cryptic; its use-case is for bash scripts and the like where calls like this would otherwise be frequent and cloud the message.
RQ(pkg, yes = NULL, no = NULL)
RQ(pkg, yes = NULL, no = NULL)
pkg |
Package to test whether the package is not yet installed. |
yes |
Response if |
no |
(optional) Response if |
## Not run: RQ("dplyr", "dplyr needs installing") ## End(Not run)
## Not run: RQ("dplyr", "dplyr needs installing") ## End(Not run)
Present since hutils v1.4.0
.
Same as sample
, but avoiding the behaviour when
length(x) == 1L
.
samp(x, size = length(x), replace = size > length(x), loud = TRUE, prob = NULL)
samp(x, size = length(x), replace = size > length(x), loud = TRUE, prob = NULL)
x |
A vector. |
size |
A non-negative integer, the number of items to return. |
replace |
Should the sampling be done with replacement? Defaults to |
loud |
If |
prob |
As in |
samp(1:5) sample(1:5) samp(1:5, size = 10) # no error tryCatch(sample(1:5, size = 10), error = function(e) print(e$m)) samp(5, size = 3) sample(5, size = 3)
samp(1:5) sample(1:5) samp(1:5, size = 10) # no error tryCatch(sample(1:5, size = 10), error = function(e) print(e$m)) samp(5, size = 3) sample(5, size = 3)
Select names matching a pattern
select_grep( DT, patterns, .and = NULL, .but.not = NULL, ignore.case = FALSE, perl = TRUE, fixed = FALSE, useBytes = FALSE, invert = FALSE, .warn.fixed.mismatch = TRUE )
select_grep( DT, patterns, .and = NULL, .but.not = NULL, ignore.case = FALSE, perl = TRUE, fixed = FALSE, useBytes = FALSE, invert = FALSE, .warn.fixed.mismatch = TRUE )
DT |
A |
patterns |
Regular expressions to be matched against the names of |
.and |
Character or integer positions of names to select, regardless of whether or not they are matched by |
.but.not |
Character or integer positions of names to drop, regardless of whether or not they are matched by |
ignore.case , perl , fixed , useBytes , invert
|
Arguments passed to |
.warn.fixed.mismatch |
(logical, default: |
DT
with the selected names.
integer vector of positions
library(data.table) dt <- data.table(x1 = 1, x2 = 2, y = 0) select_grep(dt, "x") select_grep(dt, "x", .and = "y") select_grep(dt, "x", .and = "y", .but.not = "x2")
library(data.table) dt <- data.table(x1 = 1, x2 = 2, y = 0) select_grep(dt, "x") select_grep(dt, "x", .and = "y") select_grep(dt, "x", .and = "y", .but.not = "x2")
Select columns satisfying a condition
select_which(DT, Which, .and.dots = NULL, checkDT = TRUE, .and.grep = NULL)
select_which(DT, Which, .and.dots = NULL, checkDT = TRUE, .and.grep = NULL)
DT |
A |
Which |
A function that takes a vector and returns |
.and.dots |
Optional extra columns to include. May be a character vector of |
checkDT |
If |
.and.grep |
A character vector of regular expressions to match to the names
of |
DT
with the selected variables.
library(data.table) DT <- data.table(x = 1:5, y = letters[1:5], AB = c(NA, TRUE, FALSE)) select_which(DT, anyNA, .and.dots = "y")
library(data.table) DT <- data.table(x = 1:5, y = letters[1:5], AB = c(NA, TRUE, FALSE)) select_which(DT, anyNA, .and.dots = "y")
data.table
columnsPresent since hutils 1.2.0
.
selector(DT, ..., cols = NULL, preserve.key = TRUE, shallow = FALSE)
selector(DT, ..., cols = NULL, preserve.key = TRUE, shallow = FALSE)
DT |
A |
... |
Unquoted columns names. |
cols |
Character vector of column names. |
preserve.key |
(logical, default: |
shallow |
(logical, default: |
DT
with the selected columns.
RQ("nycflights13", no = { library(nycflights13) library(data.table) fs <- as.data.table(flights) fs1 <- selector(fs, year, month, day, arr_delay) fs1[, arr_delay := NA] })
RQ("nycflights13", no = { library(nycflights13) library(data.table) fs <- as.data.table(flights) fs1 <- selector(fs, year, month, day, arr_delay) fs1[, arr_delay := NA] })
Generate sequence of row numbers
seq_nrow(x)
seq_nrow(x)
x |
An object that admits an |
Equivalent to seq_len(nrow(x))
Reorder columns of a data.table
(via setcolorder
) so that particular columns
appear first (or last), or in a particular order.
set_cols_first(DT, cols, intersection = TRUE) set_cols_last(DT, cols, intersection = TRUE) set_colsuborder(DT, cols, intersection = TRUE)
set_cols_first(DT, cols, intersection = TRUE) set_cols_last(DT, cols, intersection = TRUE) set_colsuborder(DT, cols, intersection = TRUE)
DT |
A data.table. |
cols |
Character vector of columns to put before (after) all others or, in the case of |
intersection |
Use the intersection of the names of |
In the case of set_colsuborder
the group of columns cols
occupy the same positions
in DT
but in a different order. See examples.
library(data.table) DT <- data.table(y = 1:5, z = 11:15, x = letters[1:5]) set_cols_first(DT, "x")[] set_cols_last(DT, "x")[] set_colsuborder(DT, c("x", "y"))[]
library(data.table) DT <- data.table(y = 1:5, z = 11:15, x = letters[1:5]) set_cols_first(DT, "x")[] set_cols_last(DT, "x")[] set_colsuborder(DT, c("x", "y"))[]
Swap values simultaneously. Present since hutils 1.4.0
.
x %<->% value
x %<->% value
x , value
|
Objects whose values are to be reassigned by swapping. |
NULL
invisibly. Called for its side-effect: the values
of x
and value
are swapped. So
x %<->% value
is equivalent to
temp <- x x <- value value <- temp rm(temp)
a <- 1 b <- 2 a %<->% b a b
a <- 1 b <- 2 a %<->% b a b
Present since hutils 1.2.0
. Vectorized version of switch
. Used to avoid or make clearer the result of
if_else(Expr == , ..1, if_else(Expr == , ..2, ...))
Switch(Expr, ..., DEFAULT, IF_NA = NULL, MUST_MATCH = FALSE)
Switch(Expr, ..., DEFAULT, IF_NA = NULL, MUST_MATCH = FALSE)
Expr |
A character vector. |
... |
As in |
DEFAULT |
A mandatory default value should any name of |
IF_NA |
Optional value to replace missing ( |
MUST_MATCH |
(logical, default: |
For every element of ...
whose name matches an element of Expr
,
that element's value.
Switch(c("a", "b", "c", "a"), "a" = 1, "b" = 2, "c" = 3, "4" = 4, DEFAULT = 0)
Switch(c("a", "b", "c", "a"), "a" = 1, "b" = 2, "c" = 3, "4" = 4, DEFAULT = 0)
A data.table
's key
need not be unique, but there are frequently circumstances
where non-unique keys can wreak havoc.
has_unique_key
reports the existence of a unique key, and
set_unique_key
both sets and ensures the uniqueness of keys.
has_unique_key(DT) set_unique_key(DT, ...)
has_unique_key(DT) set_unique_key(DT, ...)
DT |
A data.table |
... |
keys to set |
has_unique_key
returns TRUE
if DT
has a unique key, FALSE
otherwise.
set_unique_key
runs setkey(DT, ...)
then checks whether the key is unique, returning the keyed
data.table
if the key is unique, or an error message otherwise.
Present since v1.0.0
.
Argument rows.out
available since v1.3.0
;
rows.out < 1
supported since v 1.4.0
.
Argument discard_weight.var
available since v1.3.0
.
weight2rows(DT, weight.var, rows.out = NULL, discard_weight.var = FALSE)
weight2rows(DT, weight.var, rows.out = NULL, discard_weight.var = FALSE)
DT |
A |
weight.var |
Variable in |
rows.out |
If not Since |
discard_weight.var |
If |
DT
but with the number of rows expanded to sum(DT[[weight.var]])
to reflect the weighting.
library(data.table) DT <- data.table(x = 1:5, y = c(1, 1, 1, 1, 2)) weight2rows(DT, "y") weight2rows(DT, "y", rows.out = 5)
library(data.table) DT <- data.table(x = 1:5, y = c(1, 1, 1, 1, 2)) weight2rows(DT, "y") weight2rows(DT, "y", rows.out = 5)
Weighted (ranked) quantiles
weighted_ntile(vector, weights = rep(1, times = length(vector)), n)
weighted_ntile(vector, weights = rep(1, times = length(vector)), n)
vector |
The vector for which quantiles are desired. |
weights |
The weights associated with the vector. None should be |
n |
The number of quantiles desired. |
With a short-length vector, or with weights of a high variance, the results may be unexpected.
A vector of integers corresponding to the ntiles. (As in dplyr::ntile
.)
weighted_ntile(1:10, n = 5) weighted_ntile(1:10, weights = c(rep(4, 5), rep(1, 5)), n = 5)
weighted_ntile(1:10, n = 5) weighted_ntile(1:10, weights = c(rep(4, 5), rep(1, 5)), n = 5)
quantile
when the values are weighted
weighted_quantile(v, w = NULL, p = (0:4)/4, v_is_sorted = FALSE)
weighted_quantile(v, w = NULL, p = (0:4)/4, v_is_sorted = FALSE)
v |
A vector from which sample quantiles are desired. |
w |
Weights corresponding to each |
p |
Numeric vector of probabilities. Missing values or values outside
|
v_is_sorted |
(logical, default: |
A vector the same length as p
, the quantiles corresponding
to each element of p
.