read.ids {dplR} | R Documentation |
Read Site-Tree-Core IDs
Description
These functions try to read site, tree, and core IDs from a
rwl data.frame
.
Usage
read.ids(rwl, stc = c(3, 2, 3), ignore.site.case = FALSE,
ignore.case = FALSE, fix.typos = FALSE, typo.ratio = 5,
use.cor = TRUE)
autoread.ids(rwl, ignore.site.case = TRUE, ignore.case = "auto",
fix.typos = TRUE, typo.ratio = 5, use.cor = TRUE)
Arguments
rwl |
a |
stc |
a vector of three integral values or character string
"auto". The numbers indicate the number of characters to split the
site code ( |
use.cor |
a |
The following parameters affect the handling of suspected typing
errors. Some have different default values in read.ids
and
autoread.ids
.
ignore.site.case |
a |
ignore.case |
a |
fix.typos |
a |
typo.ratio |
a |
Details
Because dendrochronologists often take more than one core per tree, it is occasionally useful to calculate within vs. between tree variance. The International Tree Ring Data Bank (ITRDB) allows the first eight characters in an rwl file for series IDs but these are often shorter. Typically the creators of rwl files use a logical labeling method that can allow the user to determine the tree and core ID from the label.
Argument stc
tells how each series separate into site,
tree, and core IDs. For instance a series code might be
"ABC011"
indicating site "ABC"
, tree 1, core 1. If this
format is consistent then the stc
mask would be
c(3, 2, 3)
allowing up to three characters for the core
ID (i.e., pad to the right). If it is not possible to
define the scheme (and often it is not possible to machine read
IDs), then the output data.frame
can be built
manually. See Value for format.
The function autoread.ids
is a wrapper to read.ids
with
stc="auto"
, i.e. automatic detection of the site / tree / core
scheme, and different default values of some parameters. In automatic
mode, the names in the same rwl
can even follow different
site / tree / core schemes. As there are numerous possible encoding
schemes for naming measurement series, the function cannot always
produce the correct result.
With stc="auto"
, the site part can be one of the following.
In names mostly consisting of numbers, the longest common prefix is the site part
Alphanumeric site part ending with alphabet, when followed by numbers and alphabets
Alphabetic site part (quite complicated actual definition). Setting
ignore.case
to"auto"
allows the function to try to guess when a case change in the middle of a sequence of alphabets signifies a boundary between the site part and the tree part.The characters before the first sequence of space / punctuation characters in a name that contains at least two such sequences
These descriptions are somewhat general, and the details can be found in regular expressions inside the function. If a name does not match any of the descriptions, it is matched against a previously found site part, starting from the longest.
The following ID schemes are detected and supported in the tree / core part. The detection is done per site.
Numbers in tree part, core part starts with something else
Alphabets in tree part, core part starts with something else
Alphabets, either tree part all lower case and core part all upper case or vice versa. For this to work,
ignore.case
must be set to"auto"
orFALSE
.All digits. In this case, the number of characters belonging to the tree and core parts is detected with one of the following methods.
If numeric tree parts were found before, it is assumed that the core part is missing (one core per tree).
It the series are numbered continuously, one core per tree is assumed.
Otherwise, try to find a core part as the suffix so that the cores are numbered continuously.
If none of the above fits, the tree / core split of the all-digit names will be decided with the methods described further down the list, or finally with the fallback mechanism.
The combined tree / core part is empty or one character. In this case, the core part is assumed to be missing.
Tree and core parts separated by a punctuation or white space character
If the split of a tree / core part cannot be found with any of the
methods described above, the prefix of the string is matched against a
previously found tree part, starting from the longest. The fallback
mechanism for the still undecided tree / core parts is one of the
following. The first one is used if use.cor
is
TRUE
, number two if it is FALSE
.
Pairwise correlation coefficients are computed between all remaining series. Pairs of series with above median correlation are flagged as similar, and the other pairs are flagged as dissimilar. Each possible number of characters (minimum 1) is considered for the share of the tree ID. The corresponding unique would-be tree IDs determine a set of clusterings where one cluster is formed by all the measurement series of a single tree. For each clustering (allocation of characters), an agreement score is computed. The agreement score is defined as the sum of the number of similar pairs with matching cluster number and the number of dissimilar pairs with non-matching cluster number. The number of characters with the maximum agreement is chosen.
If the majority of the names in the site use k characters for the tree part, that number is chosen. Otherwise, one core per tree is assumed. Parameter
typo.ratio
has a double meaning as it also defines what is meant by majority here: at leasttypo.ratio / (typo.ratio + 1) * n.tot
, where n.tot is the number of names in the site.
In both fallback mechanisms, the number of characters allocated for the tree part will be increased until all trees have a non-zero ID or there are no more characters.
Suspected typing errors will be fixed by the function if
fix.typos
is TRUE
. The parameter
typo.ratio
affects the eagerness to fix typos, i.e. the
number of counterexamples required to declare a typo. The following
main typo fixing mechanisms are implemented:
- Site IDs.
If a rare site string resembles an at least
typo.ratio
times more frequent alternative, and if fixing it would not create any name collisions, make the fix. The alternative string must be unique, or if there is more than one alternative, it is enough if only one of them is a look-alike string. Any kind of substitution in one character place is allowed if the alternative string has the same length as the original string. The alternative string can be one character longer or one character shorter than the original string, but only if it involves interpreting one digit as the look-alike alphabet or vice versa. There are requirements to how long a site string must be in order to be eligible for replacement / typo fixing, i.e. cannot be shortened to zero length, cannot change the only character of a site string. The parametersignore.case
andignore.site.case
have some effect on this typo fixing mechanism.- Tree and core IDs.
If all tree / core parts of a site have the same length, each character position is inspected individually. If the characters in the i:th position are predominantly digits (alphabets), any alphabets (digits) are changed to the corresponding look-alike digit (alphabet) if there is one. The look-alike groups are {0, O, o}, {1, I, i}, {5, S, s} and {6, G}. The parameter
typo.ratio
determines the decision threshold of interpreting the type of each character position as alphabet (digit): the ratio of alphabets (digits) to the total number of characters must be at leasttypo.ratio / (typo.ratio + 1)
. If a name differs from the majority type in more than one character position, it is not fixed. Also, no fixes are performed if any of them would cause a possible monotonic order of numeric prefixes to break.
The function attempts to convert the tree and core substrings to
integral values. When this succeeds, the converted values are copied
to the output without modification. When non-integral substrings are
observed, each unique tree is assigned a unique integral value. The
same applies to cores within a tree, but there are some subtleties
with respect to the handling of duplicates. Substrings are sorted
before assigning the numeric
IDs.
The order of columns in rwl
, in most cases, does not
affect the tree and core IDs assigned to each series.
Value
A data.frame
with column one named "tree"
giving an
ID for each tree and column two named "core"
giving
an ID for each core. The original series IDs are
copied from rwl as rownames. The order of the rows in the output
matches the order of the series in rwl
. If more than one
site is detected, an additional third column named "site"
will
contain a site ID. All columns have integral valued
numeric
values.
Author(s)
Andy Bunn (original version) and Mikko Korpela (patches,
stc="auto"
, fix.typos
, etc.).
See Also
Examples
library(utils)
data(ca533)
read.ids(ca533, stc = c(3, 2, 3))
autoread.ids(ca533)