geolocate_text {epitweetr}R Documentation

geolocate text in a data frame given a text column and optionally a language column

Description

extracts geolocaion information on text on a column of the provided data frame and returns a new data frame with geolocation information

Usage

geolocate_text(df, text_col = "text", lang_col = NULL, min_score = NULL)

Arguments

df

A data frame containing at least character column with text, a column with the language name can be provided to improve geolocation quality

text_col

character, name of the column on the data frame containing the text to geolocalize, default:text

lang_col

character, name of the column on the data frame containing the language of texts, default: NULL

min_score

numeric, the minimum score obtained on the Lucene scoring function to accept matches on GeoNames. It has to be empirically set default: NULL

Details

This function perform a call to the epitweetr database which includes functionality for geolocating for languages activated and successfully processed on the shiny app.

The geolocation process tries to find the best match in GeoNames database https://www.geonames.org/ including all local aliases for words.

If no language is associated to the text, all tokens will be sent as a query to the indexed GeoNames database.

If a language code is associated to the text and this language is trained on epitweetr, entity recognition techniques will be used to identify the best candidate in text to contain a location and only these tokens will be sent to the GeoNames query.

A custom scoring function is implemented to grant more weight to cities increasing with population to try to perform disambiguation.

Rules for forcing the geolocation choices of the algorithms and for tuning performance with manual annotations can be performed on the geotag tab of the Shiny app.

A prerequisite to this function is that the tasks download_dependencies update_geonames and update_languages has been run successfully.

This function is called from the Shiny app on geolocation evaluation tab but can also be used for manually evaluating the epitweetr geolocation algorithm.

Value

A new data frame containing the following geolocation columns: geo_code, geo_country_code, geo_country, geo_name, tags

See Also

download_dependencies

update_geonames

detect_loop

Examples

if(FALSE) {
   library(epitweetr)
   # setting up the data folder
   message('Please choose the epitweetr data directory')
   setup_config(file.choose())

   # creating a test dataframe
   df <- data.frame(text = c("Me gusta Santiago de Chile es una linda ciudad"), lang = c("es"))
   geo <- geolocate_text(df = df, text_col = "text", lang_col="lang") 
   
}

[Package epitweetr version 2.2.16 Index]