R: Imports molecular data in various formats and transforms them...

moimport {MLMOI}

R Documentation

Imports molecular data in various formats and transforms them into a standard format.

Description

moimport() imports molecular data from Excel workbooks. The function handles various types of molecular data (e.g. STRs, SNPs), codings (e.g. 4-letter vs. IUPAC format for SNPs), and detects inconsistencies (e.g. typos, incorrect entries). moimport() allows users to import data from single or multiple worksheets.

Usage

moimport(
  file,
  multsheets = FALSE,
  nummtd = 0,
  molecular = "str",
  coding = "integer",
  transposed = FALSE,
  keepmtd = FALSE,
  export = NULL,
  keepwarnings = NULL
)

Arguments

`file`	string; specifying the path to the file to be imported.
`multsheets`	logical; indicating whether data is contained in a single or multiple worksheets. The default value is `multsheets = FALSE`, corresponding to data contained in a single worksheet.
`nummtd`	numeric number or vector; number of metadata columns (e.g. date, sample location, etc.) in the worksheet(s) to be imported (default value `nummtd = 0`). In case of multiple worksheet dataset, if all worksheets have the same number of metadata columns an integer value is sufficient. If the numbers differ, they have to be specified by an integer vector.
`molecular`	string vector or list; specifies the type of molecular data to be imported. STR, SNP, amino acid and codon markers are specified with 'STR', 'SNP', 'amino' and 'codon' values, respectively (default value `molecular = 'str'`). For importing single worksheets, `molecular` is a single string or string vector. When importing multiple worksheets, `molecular` is a string in case the data contains only one type of molecular data. Else it is a list, with the k-th element being a string value or a vector describing the data types of the k-th worksheet.
`coding`	string vector or list; specifies the coding of each data variable (marker) depending on their type. Admissible values for `coding` depend on molecular data types are: 'integer', 'nearest', 'ceil' and 'floor' for STRs; SNPs with '4let' and 'iupac' for SNPs; '3let', '1let' and 'full' amino acids and 'triplet' and 'compact' for codons.
`transposed`	logical or logical vector; if markers are entered in rows and samples in columns, set `transposed = TRUE` (default value `transposed = FALSE`). When importing multiple worksheets, `transposed` can be logical vector specifying for each worksheet whether it is in transposed format.
`keepmtd`	logical; determines whether metadata (e.g., date) should be retained during import (default value `keepmtd = TRUE`).
`export`	string; the path where the imported data is stored in standardized format. Data is not stored if no path is specified (default value `export = NULL`).
`keepwarnings`	string; the path where the warnings are stored. Warnings are not stored if no path is specified (default value `keepwarnings = NULL`).

Details

Each worksheet of the data to be imported must have one of the following formats: i) one row per sample and one column per marker. Here cells can have multiple entries, separated by a special character (separator), e.g. a punctuation character. ii) one column per marker and multiple rows per sample (standard format). iii) one row per sample and multiple columns per marker. Importantly, within one worksheet formats ii) and iii) cannot be combined (see section Warnings and Errors). Combinations of other formats are permitted but might result in warnings. Additionally, Occurrence of different separators are reported (see section Warnings and Errors).

Users should check the following before data import:

the dataset is placed in the first worksheet of the workbook;
in case of multiple worksheets, all worksheets contain data (additional worksheets need to be removed);
sample IDs are placed in the first column (first row in case of transposed data; see section Exceptions);
marker labels are placed in the first row (first column in case of transposed data; see section Exceptions);
sample IDs and as well the marker labels are unique (the duplication of ID/labels are allowed when sample/marker contains data in consecutive rows/columns);
entries such as sentences (e.g. comments in the worksheet) or meaningless words (e.g. 'missing' for missing data) are removed from data;
metadata columns (rows in case of transposed data) are placed between sample IDs and molecular-marker columns.

If data is contained in multiple worksheets, above requirements need to be fulfilled for every worksheet in the Excel workbook. Not all sample IDs must occur in every worksheet. The sample ID must not be confused with the patient's ID, the former refers to a particular sample taken from a patient, the latter to a unique patient. Several sample IDs can have the same patient's ID. In case of multiple-worksheet datasets, all marker labels across all worksheets need to be unique.

The option molecular needs to be specified as a vector, for single-worksheet data (multsheets = FALSE) containing different types of molecular markers. A list is specified, if data spread across multiple worksheets with different types of molecular across the worksheets. List elements are vectors or single values, referring to the types of molecular data of the corresponding worksheets. Users do not need to set a vector if all markers are of the same molecular type (single or multiple worksheet dataset).

Setting the option coding as vector or list is similar to setting molecular type by molecular. Every molecular data type has a pre-specified coding class as default which users do not need to specify. Namely, 'integer' for STRs, '4let' for SNPs, '3let' for amino acids and 'triplet' for codons.

Value

returns a data frame. moimport() imports heterogeneous data formats and converts them into a standard format which are free from typos (e.g. incompatible and unidentified entries) appropriate for further analyses. Metadata is retained (if keepmtd = TRUE) and, in case of data from multiple worksheets, unified if metadata variables have the same labels across two or more worksheets. If the argument export is set, then the result is saved in the first worksheet of the workbook of the file specified by export. The imported/exported dataset will be appropriate for other functions of the package.

Warnings and Errors

Usually warnings are generated if data is corrected pointing to suspicious entries in the original data. Users should read warnings carefully and check respective entries and apply manual corrections if necessary. In case of issues an error occurs and the function is stopped.

Usually, if arguments are not set properly, errors occur. Other cases of errors are: i) if sample IDs in a worksheet are not uniquely defined, i.e., two samples in non-consecutive rows have the same sample ID; ii) if formats 'one column per marker and multiple rows per sample' and 'one row per sample and multiple columns per marker' are mixed.

Warnings are issued in several cases. Above all, when typos (e.g., punctuation characters) are found. Entries which cannot be identified as a molecular type/coding class specified by the user are also reported (e.g., '9' is reported when marker is of type SNPs, or 'L' is reported when coding class of an amino-acid marker is '3let').

Empty rows and columns are deleted and eventually reported. Samples with ambiguous metadata (in a worksheet or across worksheets in case of multiple worksheet dataset), or missing are also reported.

The function only prints the first 50 warnings. If the number of warnings are more than 50, the user is recommended to set the argument keepwarnings, in order to save the warnings in an Excel file.

Exceptions

Transposed data: usually data is entered with samples in rows and markers in columns. However, on the contrary some users might enter data the opposite way. That is the case of transposed data. If so, the argument transposed = TRUE is set, or a vector in case of multiple worksheets with at least one worksheet being transposed.

Examples

#datasets are provided by the package

#importing dataset with metadata variables:
infile <- system.file("extdata", "testDatametadata.xlsx", package = "MLMOI")
moimport(infile, nummtd = 3, keepmtd = TRUE)


##more examples are included in 'examples' vignette:

#vignette("examples", package = "MLMOI")

[Package MLMOI version 0.1.2 Index]