edit_dist_df {lingdist} | R Documentation |
Compute edit distance between all row pairs of a dataframe
Description
Compute average edit distance between all row pairs of a dataframe, empty or NA cells are ignored. If all values in a row are not valid strings, all average distances involving this row is set to -1.
Usage
edit_dist_df(
data,
cost_mat = NULL,
delim = "",
squareform = FALSE,
symmetric = TRUE,
parallel = FALSE,
n_threads = 2L
)
Arguments
data |
DataFrame with n rows and m columns indicating there are n languages or dialects to involve in the calculation and there are at most m words to base on, in which the rownames are the language ids. |
cost_mat |
Dataframe in squareform indicating the cost values when one symbol is deleted, inserted or substituted by another. Rownames and colnames are symbols. 'cost_mat[char1,"_NULL_"]' indicates the cost value of deleting char1 and 'cost_mat["_NULL_",char1]' is the cost value of inserting it. When an operation is not defined in the cost_mat, it is set 0 when the two symbols are the same, otherwise 1. |
delim |
The delimiter separating atomic symbols. |
squareform |
Whether to return a dataframe in squareform. |
symmetric |
Whether to the result matrix is symmetric. This depends on whether the 'cost_mat' is symmetric. |
parallel |
Whether to parallelize the computation. |
n_threads |
The number of threads is used to parallelize the computation. Only meaningful if 'parallel' is TRUE. |
Value
A dataframe in long table form if 'squareform' is FALSE, otherwise in squareform. If 'symmetric' is TRUE, the long table form has C_n^2
rows otherwise n^2
rows.
Examples
df <- as.data.frame(rbind(a=c("a_bc_d","d_bc_a"),b=c("b_bc_d","d_bc_a")))
cost.mat <- data.frame()
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_")
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_", squareform=TRUE)
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_", parallel=TRUE, n_threads=4)