WFO.match.fuzzyjoin {WorldFlora}R Documentation

Standardize plant names according to World Flora Online taxonomic backbone

Description

An alternative and typically faster method of matching records than WFO.match that allows for different methods of calculating the fuzzy distance via stringdist.

Usage

    WFO.match.fuzzyjoin(spec.data = NULL, WFO.file = NULL, WFO.data = NULL,
        no.dates = TRUE,
        spec.name = "spec.name",
        Authorship = "Authorship",
        stringdist.method = "lv", fuzzydist.max = 4,
        Fuzzy.min = TRUE,
        acceptedNameUsageID.match = TRUE,
        squish = TRUE,
        spec.name.tolower = FALSE, spec.name.nonumber = TRUE, spec.name.nobrackets = TRUE,
        spec.name.sub = TRUE,
        sub.pattern=c(" sp[.] A", " sp[.] B", " sp[.] C", " sp[.]", " spp[.]", " pl[.]",
            " indet[.]", " ind[.]", " gen[.]", " g[.]", " fam[.]", " nov[.]", " prox[.]",
            " cf[.]", " aff[.]", " s[.]s[.]", " s[.]l[.]",
            " p[.]p[.]", " p[.] p[.]", "[?]", " inc[.]", " stet[.]", "Ca[.]",
            "nom[.] cons[.]", "nom[.] dub[.]", " nom[.] err[.]", " nom[.] illeg[.]",
            " nom[.] inval[.]", " nom[.] nov[.]", " nom[.] nud[.]", " nom[.] obl[.]",
            " nom[.] prot[.]", " nom[.] rej[.]", " nom[.] supp[.]", " sensu auct[.]"))

Arguments

spec.data

A data.frame containing variables with species names. In case that a character vector is provided, then this vector will be converted to a data.frame

WFO.file

File name of the static copy of the Taxonomic Backbone. If not NULL, then data will be reloaded from this file.

WFO.data

Data set with the static copy of the Taxonomic Backbone. Ignored if WFO.file is not NULL.

no.dates

Speeding up the loading of the WFO.data by not loading fields of 'created' and 'modified'.

spec.name

Name of the column with taxonomic names.

Authorship

Name of the column with the naming authorities.

stringdist.method

Method used to calculate the fuzzy distance as used by in the internally called stringdist.

fuzzydist.max

Maximum distance used for joining as in stringdist_join.

Fuzzy.min

Limit the results of fuzzy matching to those with the smallest distance.

acceptedNameUsageID.match

If TRUE, obtain the accepted name and others details from the earlier acceptedNameUsageID.

squish

If TRUE, remove repeated whitespace and white space from the start and end of the submitted full name via str_squish.

spec.name.tolower

If TRUE, then convert all characters of the spec.name to lower case via tolower.

spec.name.nonumber

If TRUE, then submitted spec.name that contain numbers will be interpreted as genera, only matching the first word.

spec.name.nobrackets

If TRUE, then submitted spec.name then sections of the submitted name after '(' will be removed. Note that this will also remove sections after ')', such as authorities for plant names that are in a separate column of WFO.

spec.name.sub

If TRUE, then delete sections of the spec.name that match the sub.pattern.

sub.pattern

Sections of the spec.name to be deleted

Details

This function matches plant names by using the stringdist_left_join function internally. The results are provided in a similar formatto those from WFO.match; therefore the WFO.one function can be used in a next step of the analysis.

For large data sets the function may fail due to memory limits. A solution is to analyse different subsets of large data, as for example shown by Kindt (2023).

Column 'Unique' shows whether there was a unique match (or not match) in the WFO.

Column 'Matched' shows whether there was a match in the WFO.

Column 'Fuzzy' shows whether matching was done by the fuzzy method.

Column 'Fuzzy.dist' gives the fuzzy distance calculated between submitted and matched plant names, calculated internally with stringdist_left_join.

Column 'Auth.dist' gives the Levenshtein distance calculated between submitted and matched authorship names, if the former were provided. This distance is calculated in the same way as for the WFO.match function via adist.

Column 'Subseq' gives different numbers for different matches for the same plant name.

Column 'Hybrid' shows whether there was a hybrid character in the scientificName.

Column 'New.accepted' shows whether the species details correspond to the current accepted name.

Column 'Old.status' gives the taxonomic status of the first match with the non-blank acceptedNameUsageID.

Column 'Old.ID' gives the ID of the first match with the non-blank acceptedNameUsageID.

Column 'Old.name' gives the name of the first match with the non-blank acceptedNameUsageID.

Value

The main function returns a data.set with the matched species details from the WFO.

Author(s)

Roeland Kindt (World Agroforestry, CIFOR-ICRAF)

References

World Flora Online. An Online Flora of All Known Plants. https://www.worldfloraonline.org

Sigovini M, Keppel E, Tagliapietra. 2016. Open Nomenclature in the biodiversity era. Methods in Ecology and Evolution 7: 1217-1225.

Kindt, R. 2020. WorldFlora: An R package for exact and fuzzy matching of plant names against the World Flora Online taxonomic backbone data. Applications in Plant Sciences 8(9): e11388

Kindt, R. 2023. Standardizing tree species names of GlobalTreeSearch with WorldFlora while testing the faster matching function of WFO.match.fuzzyjoin. https://rpubs.com/Roeland-KINDT/996500

See Also

WFO.match

Examples


## Not run: 
data(WFO.example)

library(fuzzyjoin)

spec.test <- data.frame(spec.name=c("Faidherbia albida", "Acacia albida",
    "Faidherbia albiad",
    "Omalanthus populneus", "Pygeum afric"))

WFO.match.fuzzyjoin(spec.data=spec.test, WFO.data=WFO.example)

# Using the Damerau-Levenshtein distance
WFO.match.fuzzyjoin(spec.data=spec.test, WFO.data=WFO.example,
    stringdist.method="dl")

## End(Not run)

[Package WorldFlora version 1.14-3 Index]