geo_join {fuzzyjoin} | R Documentation |
Join two tables based on a geo distance of longitudes and latitudes
Description
This allows joining based on combinations of longitudes and latitudes. If
you are using a distance metric that is *not* based on latitude and
longitude, use distance_join
instead. Distances are
calculated based on the distHaversine
, distGeo
,
distCosine
, etc methods in the geosphere package.
Usage
geo_join(
x,
y,
by = NULL,
max_dist,
method = c("haversine", "geo", "cosine", "meeus", "vincentysphere",
"vincentyellipsoid"),
unit = c("miles", "km"),
mode = "inner",
distance_col = NULL,
...
)
geo_inner_join(
x,
y,
by = NULL,
method = "haversine",
max_dist = 1,
distance_col = NULL,
...
)
geo_left_join(
x,
y,
by = NULL,
method = "haversine",
max_dist = 1,
distance_col = NULL,
...
)
geo_right_join(
x,
y,
by = NULL,
method = "haversine",
max_dist = 1,
distance_col = NULL,
...
)
geo_full_join(
x,
y,
by = NULL,
method = "haversine",
max_dist = 1,
distance_col = NULL,
...
)
geo_semi_join(
x,
y,
by = NULL,
method = "haversine",
max_dist = 1,
distance_col = NULL,
...
)
geo_anti_join(
x,
y,
by = NULL,
method = "haversine",
max_dist = 1,
distance_col = NULL,
...
)
Arguments
x |
A tbl |
y |
A tbl |
by |
Columns by which to join the two tables |
max_dist |
Maximum distance to use for joining |
method |
Method to use for computing distance: one of "haversine" (default), "geo", "cosine", "meeus", "vincentysphere", "vincentyellipsoid" |
unit |
Unit of distance for threshold (default "miles") |
mode |
One of "inner", "left", "right", "full" "semi", or "anti" |
distance_col |
If given, will add a column with this name containing the geographical distance between the two |
... |
Extra arguments passed on to the distance method |
Details
"Haversine" was chosen as default since in some tests it is approximately the fastest method. Note that by far the slowest method is vincentyellipsoid, and on fuzzy joins should only be used when there are very few pairs and accuracy is imperative.
If you need to use a custom geo method, you may want to write it directly
with the multi_by
and multi_match_fun
arguments to
fuzzy_join
.
Examples
library(dplyr)
data("state")
# find pairs of US states whose centers are within
# 200 miles of each other
states <- data_frame(state = state.name,
longitude = state.center$x,
latitude = state.center$y)
s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)
pairs <- s1 %>%
geo_inner_join(s2, max_dist = 200) %>%
filter(state1 != state2)
pairs
# plot them
library(ggplot2)
ggplot(pairs, aes(x = longitude.x, y = latitude.x,
xend = longitude.y, yend = latitude.y)) +
geom_segment(color = "red") +
borders("state") +
theme_void()
# also get distances
s1 %>%
geo_inner_join(s2, max_dist = 200, distance_col = "distance")