as_data_object {familiar} | R Documentation |
Creates a valid data object from input data.
Description
Creates dataObject
a object from input data. Input data can be
a data.frame
or data.table
, a path to such tables on a local or network
drive, or a path to tabular data that may be converted to these formats.
In addition, a familiarEnsemble
or familiarModel
object can be passed
along to check whether the data are formatted correctly, e.g. by checking
the levels of categorical features, whether all expected columns are
present, etc.
Usage
as_data_object(data, ...)
## S4 method for signature 'dataObject'
as_data_object(data, object = NULL, ...)
## S4 method for signature 'data.table'
as_data_object(
data,
object = NULL,
sample_id_column = waiver(),
batch_id_column = waiver(),
series_id_column = waiver(),
development_batch_id = waiver(),
validation_batch_id = waiver(),
outcome_name = waiver(),
outcome_column = waiver(),
outcome_type = waiver(),
event_indicator = waiver(),
censoring_indicator = waiver(),
competing_risk_indicator = waiver(),
class_levels = waiver(),
exclude_features = waiver(),
include_features = waiver(),
reference_method = waiver(),
check_stringency = "strict",
...
)
## S4 method for signature 'ANY'
as_data_object(
data,
object = NULL,
sample_id_column = waiver(),
batch_id_column = waiver(),
series_id_column = waiver(),
...
)
Arguments
data |
A |
... |
Unused arguments. |
object |
A |
sample_id_column |
(recommended) Name of the column containing
sample or subject identifiers. See If unset, every row will be identified as a single sample. |
batch_id_column |
(recommended) Name of the column containing batch or cohort identifiers. This parameter is required if more than one dataset is provided, or if external validation is performed. In familiar any row of data is organised by four identifiers:
|
series_id_column |
(optional) Name of the column containing series
identifiers, which distinguish between measurements that are part of a
series for a single sample. See If unset, rows which share the same batch and sample identifiers but have a different outcome are assigned unique series identifiers. |
development_batch_id |
(optional) One or more batch or cohort
identifiers to constitute data sets for development. Defaults to all, or
all minus the identifiers in |
validation_batch_id |
(optional) One or more batch or cohort
identifiers to constitute data sets for external validation. Defaults to
all data sets except those in |
outcome_name |
(optional) Name of the modelled outcome. This name will
be used in figures created by If not set, the column name in |
outcome_column |
(recommended) Name of the column containing the
outcome of interest. May be identified from a formula, if a formula is
provided as an argument. Otherwise an error is raised. Note that |
outcome_type |
(recommended) Type of outcome found in the outcome column. The outcome type determines many aspects of the overall process, e.g. the available feature selection methods and learners, but also the type of assessments that can be conducted to evaluate the resulting models. Implemented outcome types are:
If not provided, the algorithm will attempt to obtain outcome_type from contents of the outcome column. This may lead to unexpected results, and we therefore advise to provide this information manually. Note that |
event_indicator |
(recommended) Indicator for events in |
censoring_indicator |
(recommended) Indicator for right-censoring in
|
competing_risk_indicator |
(recommended) Indicator for competing
risks in |
class_levels |
(optional) Class levels for |
exclude_features |
(optional) Feature columns that will be removed
from the data set. Cannot overlap with features in |
include_features |
(optional) Feature columns that are specifically
included in the data set. By default all features are included. Cannot
overlap with |
reference_method |
(optional) Method used to set reference levels for categorical features. There are several options:
|
check_stringency |
Specifies stringency of various checks. This is mostly:
|
Details
You can specify settings for your data manually, e.g. the column for
sample identifiers (sample_id_column
). This prevents you from having to
change the column name externally. In the case you provide a familiarModel
or familiarEnsemble
for the object
argument, any parameters you provide
take precedence over parameters specified by the object.
Value
A dataObject
object.