edit_dist_df {lingdist}R Documentation

Compute edit distance between all row pairs of a dataframe

Description

Compute average edit distance between all row pairs of a dataframe, empty or NA cells are ignored. If all values in a row are not valid strings, all average distances involving this row is set to -1.

Usage

edit_dist_df(
  data,
  cost_mat = NULL,
  delim = "",
  squareform = FALSE,
  symmetric = TRUE,
  parallel = FALSE,
  n_threads = 2L
)

Arguments

data

DataFrame with n rows and m columns indicating there are n languages or dialects to involve in the calculation and there are at most m words to base on, in which the rownames are the language ids.

cost_mat

Dataframe in squareform indicating the cost values when one symbol is deleted, inserted or substituted by another. Rownames and colnames are symbols. 'cost_mat[char1,"_NULL_"]' indicates the cost value of deleting char1 and 'cost_mat["_NULL_",char1]' is the cost value of inserting it. When an operation is not defined in the cost_mat, it is set 0 when the two symbols are the same, otherwise 1.

delim

The delimiter separating atomic symbols.

squareform

Whether to return a dataframe in squareform.

symmetric

Whether to the result matrix is symmetric. This depends on whether the 'cost_mat' is symmetric.

parallel

Whether to parallelize the computation.

n_threads

The number of threads is used to parallelize the computation. Only meaningful if 'parallel' is TRUE.

Value

A dataframe in long table form if 'squareform' is FALSE, otherwise in squareform. If 'symmetric' is TRUE, the long table form has C_n^2 rows otherwise n^2 rows.

Examples

df <- as.data.frame(rbind(a=c("a_bc_d","d_bc_a"),b=c("b_bc_d","d_bc_a")))
cost.mat <- data.frame()
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_")
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_", squareform=TRUE)
result <- edit_dist_df(df, cost_mat=cost.mat, delim="_", parallel=TRUE, n_threads=4)

[Package lingdist version 1.0 Index]