Title: | Convert Addresses to Standard Inputs |
---|---|
Description: | Efficient tools for parsing and standardizing Australian addresses from textual data. It utilizes optimized algorithms to accurately identify and extract components of addresses, such as street names, types, and postcodes, especially for large batched data in contexts where sending addresses to internet services may be slow or inappropriate. The core functionality is built on fast string processing techniques to handle variations in address formats and abbreviations commonly found in Australian address data. Designed for data scientists, urban planners, and logistics analysts, the package facilitates the cleaning and normalization of address information, supporting better data integration and analysis in urban studies, geography, and related fields. |
Authors: | Hugh Parsonage [aut, cre] |
Maintainer: | Hugh Parsonage <[email protected]> |
License: | GPL-2 |
Version: | 0.4.3 |
Built: | 2024-11-13 05:14:32 UTC |
Source: | https://github.com/hughparsonage/healthyaddress |
Extract the n-th digit of a duocentehexaquinquagesimal number
.digit256(x, d)
.digit256(x, d)
x |
|
d |
|
For b = 256
if
then .digit(x, d) = a_d
Street types allowed.
.permitted_street_type_ord()
.permitted_street_type_ord()
A character vector, the permitted street codes. In order of (approximate) occurrence; more common street types appear in the head of the vector.
Although lat and lon are represented by doubles, this is usually slightly wasteful. This function allows you to represent coordinates as single integer, vastly reducing memory footprint.
compress_latlon(lat, lon, nThread = getOption("healthyAddress.nThread", 1L)) decompress_latlon(x, nThread = getOption("healthyAddress.nThread", 1L)) compress_latlon_general( lat, lon, nThread = getOption("healthyAddress.nThread", 1L) ) decompress_latlon_general(x, nThread = getOption("healthyAddress.nThread", 1L))
compress_latlon(lat, lon, nThread = getOption("healthyAddress.nThread", 1L)) decompress_latlon(x, nThread = getOption("healthyAddress.nThread", 1L)) compress_latlon_general( lat, lon, nThread = getOption("healthyAddress.nThread", 1L) ) decompress_latlon_general(x, nThread = getOption("healthyAddress.nThread", 1L))
lat , lon
|
Coordinates to compress. |
nThread |
Number of threads to use. |
x |
An integer vector formed by one of the compression functions. |
The _general
version of the compression/decompression use the observed
range of the latitude and longitude to form a grid, while the
bare versions use the known limits of Australian address coordinates
(including the overseas territories). Since, in the latter, the grid
will be much less fine, you should expect greater loss of information,
possibly exceeding 100 metres.
compress_latlon
An integer vector.
decompress_latlon
The original lat,lon
, with some information loss
compress_latlon_general
An integer vector, with attributes minmaxLat
and minmaxLon
.
decompress_latlon_general
The original lat,lon
, with some information loss.
Download latitude longitude data by address
download_latlon_data( .ste = c("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT", "OT"), data_dir = getOption("healthyAddress.data_dir"), repo = "https://github.com/HughParsonage/PSMA-202311", overwrite = NA )
download_latlon_data( .ste = c("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT", "OT"), data_dir = getOption("healthyAddress.data_dir"), repo = "https://github.com/HughParsonage/PSMA-202311", overwrite = NA )
.ste |
The jurisdiction to download. Default is to download all. |
data_dir |
The directory for |
repo |
The repository from which data will be downloaded. Currently only the default is supported,
and |
overwrite |
|
Called for its side effect (downloading the files), but returns the files downloaded.
Extract the flat number, number first/last from an address
extract_flatNumberFirstLast(address)
extract_flatNumberFirstLast(address)
address |
A character vector from which the numbers are to be extracted. |
A data.table
of three components: the flat number,
the number first, and number last.
Extract the postcode from the suffix of a string
extract_postcode(x)
extract_postcode(x)
x |
A character vector. |
An integer vector the same length as x
, giving the
postcode as it appears in the last 3 or 4 characters in each
string. Returns NA_integer_
for other strings.
There is no guarantee made that the postcode is a real postcode.
extract_postcode("3000") extract_postcode("Melbourne Vic 3000")
extract_postcode("3000") extract_postcode("Melbourne Vic 3000")
Hash a street name quickly and accurately
HashStreetName(x) unHashStreetName(x)
HashStreetName(x) unHashStreetName(x)
x |
A character vector of uppercase street names (without the street type). |
For HashStreetName
, an integer vector the same length as x
,
a hash of the input; for unHashStreetName
the inverse operation.
If the original x
does not contain a recognized street name, the
result of unHashStreetName
will be NA
.
HashStreetName("FLINDERS")
HashStreetName("FLINDERS")
Find the street type within an address
match_StreetType(address)
match_StreetType(address)
address |
A character vector, every string an address. |
A list of two elements. The first element are the indices of
street type in .permitted_street_type_ord()
that is found in the
address. The second element are the corresponding string positions of
the street so identified.
cds <- .permitted_street_type_ord() head(cds) match_StreetType("712 FLINDERS STREET MELBOURNE 3004") # 012345678901234 match_StreetType("712 FLINDERS ST MELBOURNE 3004")
cds <- .permitted_street_type_ord() head(cds) match_StreetType("712 FLINDERS STREET MELBOURNE 3004") # 012345678901234 match_StreetType("712 FLINDERS ST MELBOURNE 3004")
Find word within a sentence
match_word(x, tbl)
match_word(x, tbl)
x |
A character vector of uppercase sentences. |
tbl |
A table of words. Long vectors are not permitted. |
An integer vector the same length as x
, where the
i
-th entry
is the integer position of the first word in tbl
detected in x[i]
. Non-matches return NA
. Words
are strings of uppercase separated by spaces.
Add latitude and longitude columns to a standard address
mutate_latlon(DT, data_dir = getOption("healthyAddress.data_dir"))
mutate_latlon(DT, data_dir = getOption("healthyAddress.data_dir"))
DT |
A |
data_dir |
The directory in which the latitude longitude data has been
downloaded. (See |
DT
with the columns lat
and lon
added, by reference,
the latitude and longitude of the address for that row.
Ensures all elements of a character vector are uppercase; no lowercase characters.
nany_lowercase(x, nThread = getOption("healthyAddress.nThread", 1L))
nany_lowercase(x, nThread = getOption("healthyAddress.nThread", 1L))
x |
A character vector, of ASCII elements. |
nThread |
Number of threads to use. |
nany_lowercase
FALSE
if any char in x
is a lowercase letter.
nany_lowercase("ABC") nany_lowercase("ABC 123 /--") nany_lowercase("ABC 123 /-- z")
nany_lowercase("ABC") nany_lowercase("ABC 123 /--") nany_lowercase("ABC 123 /-- z")
While for most postcodes, the state enclosing it is easy to evaluate (e.g. most postcodes in 2000-2999 are in NSW), the general case is non-trivial. In particular, some postcodes straddle state borders.
postcode2ste(Postcodes, result = c("integer", "character"))
postcode2ste(Postcodes, result = c("integer", "character"))
Postcodes |
An integer vector of postcodes. |
result |
One of |
A vector, the minimal states that cover all postcodes given. For example, if all postcodes lie within a single state a scalar integer/string of that state is returned.
vic_poa <- c(3021L, 3084L, 3013L, 3147L, 3030L, 3123L, 3070L, 3004L, 3250L, 3630L) postcode2ste(vic_poa) postcode2ste(vic_poa, result = "character") postcode2ste(c(vic_poa, 2000L)) postcode2ste(3644L)
vic_poa <- c(3021L, 3084L, 3013L, 3147L, 3030L, 3123L, 3070L, 3004L, 3250L, 3630L) postcode2ste(vic_poa) postcode2ste(vic_poa, result = "character") postcode2ste(c(vic_poa, 2000L)) postcode2ste(3644L)
Get internal data
read_ste_fst( ste = c("ACT", "NSW", "NT", "OT", "QLD", "SA", "TAS", "VIC", "WA"), columns = NULL, data_env = getOption("healthyAddress.data_env"), data_dir = getOption("healthyAddress.data_dir", tempfile()), rbind = TRUE )
read_ste_fst( ste = c("ACT", "NSW", "NT", "OT", "QLD", "SA", "TAS", "VIC", "WA"), columns = NULL, data_env = getOption("healthyAddress.data_env"), data_dir = getOption("healthyAddress.data_dir", tempfile()), rbind = TRUE )
ste |
The abbreviated state name. |
columns |
Character vector of columns to select. If |
data_env |
The environment in which objects are cached. Mainly for internal use. |
data_dir |
The file directory into which the downloaded files should be
stored. Defaults to a temporary directory. It is recommended to set the option
|
rbind |
Whether or not to bind the list result should multiple states be requested. |
A data.table
containing all the addresses in the given states.
Standardize an address from a free text expression into its components as used in the PSMA (formerly, "Public Sector for Mapping Agencies") database.
standardize_address( Address, AddressLine2 = NULL, return.type = c("data.table", "integer"), integer_StreetType = FALSE, hash_StreetName = FALSE, check = 1L, nThread = getOption("healthyAddress.nThread", 1L) ) standard_address2(Address, nThread = getOption("healthyAddres.nThread", 1L)) standard_address3(Line1, Line2, Postcode = NULL, KeepStreetName = FALSE)
standardize_address( Address, AddressLine2 = NULL, return.type = c("data.table", "integer"), integer_StreetType = FALSE, hash_StreetName = FALSE, check = 1L, nThread = getOption("healthyAddress.nThread", 1L) ) standard_address2(Address, nThread = getOption("healthyAddres.nThread", 1L)) standard_address3(Line1, Line2, Postcode = NULL, KeepStreetName = FALSE)
Address |
A character vector, either a full address or (if |
AddressLine2 |
Either |
return.type |
Either |
integer_StreetType |
Should the street type be returned as an integer vector? |
hash_StreetName |
Should |
check |
An integer, whether the inputs should be checked for possibly invalid addresses or addresses that may not be parsed correctly. |
nThread |
Number of threads to use. |
Line1 , Line2 , Postcode
|
For addresses split by line. |
KeepStreetName |
Should an additional character vector be included in the result of the street name? |
By convention observed in the PSMA, street names such as 'THE ESPLANADE' have a street name of 'THE ESPLANADE' and an absent street type code.
Non-addresses passed have unspecified behaviour, though usually the numbers of the standard address will be 0 or NA. Postcodes may be negative in some circumstances where a postcode is not detected, though this should not be relied on.
For maximum performance, consider setting integer_StreetType
and
hash_StreetName
to TRUE
. It has been observed that joining
two tables together has been faster when using the hash of the standardized
street name, rather than the street name, even when taking into account
the hashing process.
For performance reasons, addresses with more than 32 words are not supported.
If a postcode-like number exists at the end of a Address
, but is not
in fact a postcode, then NA
will be in each field, except postcode,
which will have the value -1.
A data.table
containing columns indicating the components of the standard address:
FLAT_NUMBER
The flat or unit number. This includes things like SHOP number.
NUMBER_FIRST
As used in the PSMA, this identified the first (or only) number in the address range.
NUMBER_LAST
As used in the PSMA, if an address is marked as having a range of street numbers, the last of the range.
NUMBER_SUFFIX
A raw
vector. The suffix observed after the numbers. The PSMA
technically has multiple suffixes for each number component.
H0
If hash_StreetName = TRUE
, the DJB2 hash (as used in
HashStreetName
of the street name.). Observed to have performance
benefits.
STREET_NAME
The (uppercase) of the street name. Streets such
as 'THE ESPLANADE' or 'THE AVENUE' are treated as entirely made up of a street
name and have a STREET_TYPE_CODE
of zero.
STREET_TYPE_CODE
An integer, the street type code marking the type of street such as ROAD, STREET, AVENUE, etc. They code corresponds approximately to the rank of their frequency in addresses.
STREET_TYPE
If integer_StreetType = FALSE
, then the (uppercase)
standard name of the street type.
POSTCODE
An integer vector, the postcode observed.
Uppercase
toupper_basic(x)
toupper_basic(x)
x |
A character vector |
The same as toupper(x)
for ASCII entries. For implementation
reasons, strings wider than 32767 characters (bytes) will be ignored.
Unique postcodes of
unique_Postcodes(x, strict = TRUE) uniqueN_Postcodes(x, strict = TRUE)
unique_Postcodes(x, strict = TRUE) uniqueN_Postcodes(x, strict = TRUE)
x |
An integer vector of postcodes. |
strict |
(logical, default: |
unique_Postcodes
A (sorted) integer vector of the unique, non-NA values in x
.
uniqueN_Postcodes
The number of unique postcodes.