geolocate_text {epitweetr} | R Documentation |
geolocate text in a data frame given a text column and optionally a language column
Description
extracts geolocaion information on text on a column of the provided data frame and returns a new data frame with geolocation information
Usage
geolocate_text(df, text_col = "text", lang_col = NULL, min_score = NULL)
Arguments
df |
A data frame containing at least character column with text, a column with the language name can be provided to improve geolocation quality |
text_col |
character, name of the column on the data frame containing the text to geolocalize, default:text |
lang_col |
character, name of the column on the data frame containing the language of texts, default: NULL |
min_score |
numeric, the minimum score obtained on the Lucene scoring function to accept matches on GeoNames. It has to be empirically set default: NULL |
Details
This function perform a call to the epitweetr database which includes functionality for geolocating for languages activated and successfully processed on the shiny app.
The geolocation process tries to find the best match in GeoNames database https://www.geonames.org/ including all local aliases for words.
If no language is associated to the text, all tokens will be sent as a query to the indexed GeoNames database.
If a language code is associated to the text and this language is trained on epitweetr, entity recognition techniques will be used to identify the best candidate in text to contain a location and only these tokens will be sent to the GeoNames query.
A custom scoring function is implemented to grant more weight to cities increasing with population to try to perform disambiguation.
Rules for forcing the geolocation choices of the algorithms and for tuning performance with manual annotations can be performed on the geotag tab of the Shiny app.
A prerequisite to this function is that the tasks download_dependencies
update_geonames
and update_languages
has been run successfully.
This function is called from the Shiny app on geolocation evaluation tab but can also be used for manually evaluating the epitweetr geolocation algorithm.
Value
A new data frame containing the following geolocation columns: geo_code, geo_country_code, geo_country, geo_name, tags
See Also
Examples
if(FALSE) {
library(epitweetr)
# setting up the data folder
message('Please choose the epitweetr data directory')
setup_config(file.choose())
# creating a test dataframe
df <- data.frame(text = c("Me gusta Santiago de Chile es una linda ciudad"), lang = c("es"))
geo <- geolocate_text(df = df, text_col = "text", lang_col="lang")
}