auto_match_seqs {concatipede} | R Documentation |
Build a template table with automatically matched sequence names
Description
The algorithm used to match sequences across fasta files based on their names is outlined below.
Usage
auto_match_seqs(x, method = "lv", xlsx)
Arguments
x |
A table (data frame or tibble) typically produced by
|
method |
Method for string distance calculation. See
|
xlsx |
Optional, a path to use to save the output table as an Excel file. |
Details
Let's assume a situation with N fasta files, with each fasta file i having n_i sequence names. The problem of matching the names in the best possible way across the fasta files is similar to that of identifying homologous proteins across species, using e.g. reciprocal blast.
The algorithm steps are:
For each pair of fasta files, identify matching names using a reciprocal match approach: two names match if and only if they are their reciprocal best match.
Those matches across fasta files define a graph.
We identify sub-graphs such that (i) they contain at most one sequence name per fasta file and (ii) all nodes in a given sub-graph are fully connected (i.e., they are all their best reciprocal matches across any pair of fasta files).
Value
A table (tibble) with the same columns as x
and with sequence
names automatically matched across fasta files. Sequence names which did
not have a best reciprocal match in other fasta files are appended to
the end of the table, so that the output table columns contain all the
unique sequence names present in the corresponding column of the input
table. The first column, "name", contains a suggested name for the row
(not guaranteed to be unique). If a path was provided to the xlsx
argument, an Excel file is saved and the table is returned invisibly.
Examples
xlsx_file <- concatipede_example("sequences-test-matching.xlsx")
xlsx_template <- readxl::read_xlsx(xlsx_file)
auto_match_seqs(xlsx_template)
## Not run:
auto_match_seqs(xlsx_template, xlsx = "my-automatic-output.xlsx")
## End(Not run)