R: Standardize plant names according to World Flora Online...

WFO.match.fuzzyjoin {WorldFlora}

R Documentation

Standardize plant names according to World Flora Online taxonomic backbone

Description

An alternative and typically faster method of matching records than WFO.match that allows for different methods of calculating the fuzzy distance via stringdist.

Usage

    WFO.match.fuzzyjoin(spec.data = NULL, WFO.file = NULL, WFO.data = NULL,
        no.dates = TRUE,
        spec.name = "spec.name",
        Authorship = "Authorship",
        stringdist.method = "lv", fuzzydist.max = 4,
        Fuzzy.min = TRUE,
        acceptedNameUsageID.match = TRUE,
        squish = TRUE,
        spec.name.tolower = FALSE, spec.name.nonumber = TRUE, spec.name.nobrackets = TRUE,
        spec.name.sub = TRUE,
        sub.pattern=c(" sp[.] A", " sp[.] B", " sp[.] C", " sp[.]", " spp[.]", " pl[.]",
            " indet[.]", " ind[.]", " gen[.]", " g[.]", " fam[.]", " nov[.]", " prox[.]",
            " cf[.]", " aff[.]", " s[.]s[.]", " s[.]l[.]",
            " p[.]p[.]", " p[.] p[.]", "[?]", " inc[.]", " stet[.]", "Ca[.]",
            "nom[.] cons[.]", "nom[.] dub[.]", " nom[.] err[.]", " nom[.] illeg[.]",
            " nom[.] inval[.]", " nom[.] nov[.]", " nom[.] nud[.]", " nom[.] obl[.]",
            " nom[.] prot[.]", " nom[.] rej[.]", " nom[.] supp[.]", " sensu auct[.]"))

Arguments

`spec.data`	A data.frame containing variables with species names. In case that a character vector is provided, then this vector will be converted to a data.frame
`WFO.file`	File name of the static copy of the Taxonomic Backbone. If not `NULL`, then data will be reloaded from this file.
`WFO.data`	Data set with the static copy of the Taxonomic Backbone. Ignored if `WFO.file` is not `NULL`.
`no.dates`	Speeding up the loading of the WFO.data by not loading fields of 'created' and 'modified'.
`spec.name`	Name of the column with taxonomic names.
`Authorship`	Name of the column with the naming authorities.
`stringdist.method`	Method used to calculate the fuzzy distance as used by in the internally called `stringdist`.
`fuzzydist.max`	Maximum distance used for joining as in `stringdist_join`.
`Fuzzy.min`	Limit the results of fuzzy matching to those with the smallest distance.
`acceptedNameUsageID.match`	If `TRUE`, obtain the accepted name and others details from the earlier acceptedNameUsageID.
`squish`	If `TRUE`, remove repeated whitespace and white space from the start and end of the submitted full name via str_squish.
`spec.name.tolower`	If `TRUE`, then convert all characters of the `spec.name` to lower case via tolower.
`spec.name.nonumber`	If `TRUE`, then submitted `spec.name` that contain numbers will be interpreted as genera, only matching the first word.
`spec.name.nobrackets`	If `TRUE`, then submitted `spec.name` then sections of the submitted name after '(' will be removed. Note that this will also remove sections after ')', such as authorities for plant names that are in a separate column of WFO.
`spec.name.sub`	If `TRUE`, then delete sections of the `spec.name` that match the `sub.pattern`.
`sub.pattern`	Sections of the `spec.name` to be deleted

Details

This function matches plant names by using the stringdist_left_join function internally. The results are provided in a similar formatto those from WFO.match; therefore the WFO.one function can be used in a next step of the analysis.

For large data sets the function may fail due to memory limits. A solution is to analyse different subsets of large data, as for example shown by Kindt (2023).

Column 'Unique' shows whether there was a unique match (or not match) in the WFO.

Column 'Matched' shows whether there was a match in the WFO.

Column 'Fuzzy' shows whether matching was done by the fuzzy method.

Column 'Fuzzy.dist' gives the fuzzy distance calculated between submitted and matched plant names, calculated internally with stringdist_left_join.

Column 'Auth.dist' gives the Levenshtein distance calculated between submitted and matched authorship names, if the former were provided. This distance is calculated in the same way as for the WFO.match function via adist.

Column 'Subseq' gives different numbers for different matches for the same plant name.

Column 'Hybrid' shows whether there was a hybrid character in the scientificName.

Column 'New.accepted' shows whether the species details correspond to the current accepted name.

Column 'Old.status' gives the taxonomic status of the first match with the non-blank acceptedNameUsageID.

Column 'Old.ID' gives the ID of the first match with the non-blank acceptedNameUsageID.

Column 'Old.name' gives the name of the first match with the non-blank acceptedNameUsageID.

Value

The main function returns a data.set with the matched species details from the WFO.

Author(s)

Roeland Kindt (World Agroforestry, CIFOR-ICRAF)

References

World Flora Online. An Online Flora of All Known Plants. https://www.worldfloraonline.org

Sigovini M, Keppel E, Tagliapietra. 2016. Open Nomenclature in the biodiversity era. Methods in Ecology and Evolution 7: 1217-1225.

Kindt, R. 2020. WorldFlora: An R package for exact and fuzzy matching of plant names against the World Flora Online taxonomic backbone data. Applications in Plant Sciences 8(9): e11388

Kindt, R. 2023. Standardizing tree species names of GlobalTreeSearch with WorldFlora while testing the faster matching function of WFO.match.fuzzyjoin. https://rpubs.com/Roeland-KINDT/996500

Examples


## Not run: 
data(WFO.example)

library(fuzzyjoin)

spec.test <- data.frame(spec.name=c("Faidherbia albida", "Acacia albida",
    "Faidherbia albiad",
    "Omalanthus populneus", "Pygeum afric"))

WFO.match.fuzzyjoin(spec.data=spec.test, WFO.data=WFO.example)

# Using the Damerau-Levenshtein distance
WFO.match.fuzzyjoin(spec.data=spec.test, WFO.data=WFO.example,
    stringdist.method="dl")

## End(Not run)

[Package WorldFlora version 1.14-3 Index]