R: Creates One Finalized Table Ready for Statistical Analysis

prep {prepdat}

R Documentation

Creates One Finalized Table Ready for Statistical Analysis

Description

prep() aggregates a single dataset in a long format according to any number of grouping variables. This makes prep() suitable for aggregating data from various types of experimental designs such as between-subjects, within-subjects (i.e., repeated measures), and mixed designs (i.e., experimental designs that include both between- and within- subjects independent variables). prep() returns a data frame with a number of dependent measures for further analysis for each aggregated cell (i.e., experimental cell) according to the provided grouping variables (i.e., independent variables). Dependent measures for each experimental cell include among others means before and after rejecting observations according to a flexible standard deviation criteria, number of rejected observations according to the flexible standard deviation criteria, proportions of rejected observations according to the flexible standard deviation criteria, number of observations before rejection, means after rejecting observations according to procedures described in Van Selst & Jolicoeur (1994; suitable when measuring reaction-times), standard deviations, medians, means according to any percentile (e.g., 0.05, 0.25, 0.75, 0.95) and harmonic means. The data frame prep() returns can also be exported as a txt or csv file to be used for statistical analysis in other statistical programs.

Usage

prep(
   dataset = NULL
   , file_name = NULL
   , file_path = NULL
   , id = NULL
   , within_vars = c()
   , between_vars = c()
   , dvc = NULL
   , dvd = NULL
   , keep_trials = NULL
   , drop_vars = c()
   , keep_trials_dvc = NULL
   , keep_trials_dvd = NULL
   , id_properties = c()
   , sd_criterion = c(1, 1.5, 2)
   , percentiles = c(0.05, 0.25, 0.75, 0.95)
   , outlier_removal = NULL
   , keep_trials_outlier = NULL
   , decimal_places = 4
   , notification = TRUE
   , dm = c()
   , save_results = TRUE
   , results_name = "results.txt"
   , results_path = NULL
   , save_summary = TRUE
)

Arguments

`dataset`	Name of the data frame in R that contains the long format table after merging the individual data files using `file_merge()`. Either `dataset` or `file_name` must be provided. Default is `NULL`.
`file_name`	A string with the name of a txt or csv file (including the file extension, e.g. `"my_data.txt"`) with the merged table in case the user already merged the individual data files. Either `dataset` or `file_name` must be provided. Default is `NULL`.
`file_path`	A string with the path of the folder in which `file_name` is located. If `file_name` was used, then `file_path` must be provided. Default is `NULL`.
`id`	A string with the name of the column in `file_name` or in `dataset` that contains the variable specifying the case identifier (i.e., the variable upon which the measurement took place; e.g., `"subject_number"`). This should be a unique value per case. Values in this column must be numeric. Argument must be provided. Default is `NULL`.
`within_vars`	String vector with names of grouping variables in `file_name` or in `dataset` that contain independent variables manipulated (or observed) within-ids (i.e., within-subjects, repeated measures). Single or multiple values must be specified as a string (e.g., `c("SOA", "condition")`) according to the hierarchical order you wish. Note that the order of the names in `within_vars()` is important because `prep()` aggregates the data for the dependent measures by first dividing them to the levels of the first grouping variable in `witin_vars()`, and then within each of those levels `prep()` divides the data according to the next variable in `within_vars()` and so forth. Values in these columns must be numeric. Either `within_vars` or `between_vars` (or both) arguments must be provided. Default is `c()`.
`between_vars`	String vector with names of grouping variables in `file_name` or in `dataset` that contain independent variables manipulated (or observed) between-ids (i.e., between-subjects). Single or multiple values must be specified as a string (e.g., `c("order")`). Order of the names in `between_vars()` does not matter. Values in this column must be numeric. Either `between_vars` or `within_vars` (or both) arguments must be provided. Default is `c()`.
`dvc`	A string with the name of the column in `file_name` or in `dataset` that contains the dependent variable (e.g., "rt" for reaction-time as a dependent variable). Values in this column must be in an interval or ratio scale. Either `dvc` or `dvd` (or both) arguments must be provided. Default is `NULL`.
`dvd`	A string with the name of the column in `file_name` or in `dataset` that contains the dependent variable (e.g., `"ac"` for accuracy as a dependent variable). Values in this column must be numeric and discrete (e.g., 0 and 1). Either `dvc` or `dvd` (or both) arguments must be provided. Default is `NULL`.
`keep_trials`	A string. Allows deleting unnecessary observations and keeping necessary observations in `file_name` or in `dataset` according to logical conditions specified as a string. For example, if the dataset contains practice trials for each subject, these trials should not be included in the aggregation. The user should remove these trials by specifying how they were coded in the raw data (i.e., data before aggregation). For example, if practice trials are the ones for which the "block" column in the raw data tables equals to zero, the `keep_trials` argument should be `"raw_data$block !== 0"`. `raw_data` is the internal object in `prep()` representing the merged table. All logical conditions in `keep_trials` should be put in the same string and be concatenated by `&` or `\|`. Logical conditions for this argument can relate to different columns in the merged table. Note that all further arguments of `prep()` will relate to the remaining observations in the merged table. Default is `NULL`.
`drop_vars`	String vector with names of columns to delete in `file_name` or in `dataset`. Single or multiple values must be specified as a string (e.g., `c("font_size")`). Order of the names in `drop_vars` does not matter. Note that all further arguments of `prep()` will relate to the remaining variables in the merged table. Default is `c()`.
`keep_trials_dvc`	A string. Allows deleting unnecessary observations and keeping necessary observations in `file_name` or in `dataset` for calculations and aggregation of the dependent variable in `dvc` according to logical conditions specified as a string. Logical conditions should be specified as a string as in the `keep_trials` argument (e.g., `"raw_data$rt > 100 & raw_data$rt < 3000 & raw_dada$ac == 1"`). All dependent measures for `dvc` except for those specified in `outlier_removal` will be calculated on the remaining observations. Defalut is `NULL`.
`keep_trials_dvd`	A string. Allows deleting unnecessary observations and keeping necessary observations in `file_name` or in `dataset` for calculations and aggregation of the dependent variable in `dvd` according to logical conditions specified as a string. Logical conditions should be specified as a string as in the `keep_trials` argument (e.g., `raw_data$rt > 100 & raw_data$rt < 3000`). All dependent measures for `dvd` (i.e., `"mdvd"` and `"merr"`) will be calculated on the remaining observations. Default is `NULL`.
`id_properties`	String vector with names of columns in `dataset` or in `file_name` that describe the ids (e.g., subjects) in the data and were not manipulated within-or between-ids. For example, in case the user logged for each observation and for each id in an experiment also the age and the gender of the subject, this argument will be `c("age", "gender")`. Order of the names in `id_properties` does not matter. Single or multiple values must be specified as a string. Values in these columns must be numeric. Default is `c()`.
`sd_criterion`	Numeric vector specifying a number of standard deviation criteria for which `prep()` will calculate the mean `dvc` for each cell in the finalized table after rejecting observations that did not meet the criterion (e.g., rejecting observations that were more than 2 standard deviations above or below the mean of that cell). Values in this vector must be numeric. Default is `c(1, 1.5, 2)`.
`percentiles`	Numeric vector containing wanted percentiles for `dvc`. Values in this vector must be decimal numbers between 0 to 1. Percentiles are calculated according to `type = 7` (see `quantile` for more information). Default is `c(0.05, 0.25, 0.75, 0.95)`.
`outlier_removal`	Numeric. Specifies which outlier removal procedure with moving criterion to calculate for `dvc` according to procedures described by Van Selst & Jolicoeur (1994). If `1` then non-recursive procedure is calculated, if `2` then modified recursive procedure is calculated, if `3` then hybrid recursive procedure is calculated. Moving criterion is according to Table 4 in Van Selst & Jolicoeur (1994). If experimental cell has 4 trials or less it will result in `NA`. Default is `NULL`.
`keep_trials_outlier`	A string. Allows deleting unnecessary observations and keeping necessary observations in `file_name` or in `dataset` for calculations and aggregation of the outlier removal procedures by Van Selst & Jolicoeur (1994). Logical conditions should be specified as a string as in the `keep_trials` argument (e.g., `"raw_data$ac == 1"`). `outlier_removal` procedure will be calculated on the remaining observations. Defalut is `NULL`.
`decimal_places`	Numeric. Specifies number of decimals to be written in `results_name` for each value of the dependent measures for `dvc`. Value must be numeric. Default is `4`.
`notification`	Logical. If `TRUE`, prints messages about the progress of the function. Default is `TRUE`.
`dm`	String vector with names of dependent measures the function returns. If empty (i.e., `c()`) the function returns a data frame with all possible dependent measures in `prep()`. Values in this vector must be strings from the following list: "mdvc", "sdvc", "meddvc", "tdvc", "ntr", "ndvc", "ptr", "prt", "rminv", "mdvd", "merr". Default is `c()`. See Value section below for more details.
`save_results`	Logical. If TRUE, the function creates a txt file containing the returned data frame. Default is `TRUE`.
`results_name`	A string with the name of the file `prep` returns in case `save_results` is `TRUE`. Extension of the file can be txt or csv and should be included. Default is `"results.txt"`.
`results_path`	A string with the path of the folder in which `results_name` will be saved. Default is the path provided in `file_path`. In case no path was provided in `file_path`, `results_path` must be provided.
`save_summary`	Logical. if `TRUE`, creates a summary file in the same format as `results_name`. Default is `TRUE`.

Value

A data frame with dependent measures for the dependent variables in dvc and dvd by id and grouping variables.

The first column in the finalized table is the id column. In case id_properties was used, the next columns will be the value of each id_properties for each id.

If between_vars was used then the next column{}s will be the value of each beween_vars for each id.

The next columns of the finalized table contain the dependent measures according to the design specified. If within_vars was used, then the data for each dependent measure was first divided according to the levels of the first grouping variable in witin_vars(), and then within each of those levels prep() divided the data according to the next variable in within_vars() and so forth. The dependent measures in the finalized table are:

mdvc: mean dvc.

sdvc: SD for dvc.

meddvc: median dvc.

tdvc: mean dvc after rejecting observations above standard deviation criteria specified in sd_criterion.

ntr: number of observations rejected for each standard deviation criterion specified in sd_criterion.

ndvc: number of observations before rejection.

ptr: proportion of observations rejected for each standard deviation criterion specified in sd_criterion.

rminv: harmonic mean of dvc.

prt: dvc according to each of the percentiles specified in percentiles.

mdvd: mean dvd.

merr: mean error.

nrmc: mean dvc according to non-recursive procedure with moving criterion.

nnrmc: number of observations rejected for dvc according to non-recursive procedure with moving criterion.

pnrmc: percent of observations rejected for dvc according to non-recursive procedure with moving criterion.

tnrmc: total number of observations upon which the non-recursive procedure with moving criterion was applied.

mrmc: mean dvc according to modified-recursive procedure with moving criterion.

nmrmc: number of observations rejected for dvc according to modified-recursive procedure with moving criterion.

pmrmc: percent of observations rejected for dvc according to modified-recursive procedure with moving criterion.

tmrmc: total number of observations upon which the modified-recursive procedure with moving criterion was applied.

hrmc: mean dvc according to hybrid-recursive procedure with moving criterion.

nhrmc: number of observations rejected for dvc according to hybrid-recursive procedure with moving criterion.

thrmc: total number of observations upon which the hybrid-recursive procedure with moving criterion was applied.

References

Grange, J.A. (2015). trimr: An implementation of common response time trimming methods. R Package Version 1.0.1. https://CRAN.R-project.org/package=trimr

Van Selst, M., & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. The quarterly journal of experimental psychology, 47(3), 631-650.

Examples

data(stroopdata)
finalized_stroopdata <- prep(
           dataset = stroopdata
           , file_name = NULL
           , file_path = NULL
           , id = "subject"
           , within_vars = c("block", "target_type")
           , between_vars = c("order")
           , dvc = "rt"
           , dvd = "ac"
           , keep_trials = NULL
           , drop_vars = c()
           , keep_trials_dvc = "raw_data$rt > 100 & raw_data$rt < 3000 & raw_data$ac == 1"
           , keep_trials_dvd = "raw_data$rt > 100 & raw_data$rt < 3000"
           , id_properties = c()
           , sd_criterion = c(1, 1.5, 2)
           , percentiles = c(0.05, 0.25, 0.75, 0.95)
           , outlier_removal = 2
           , keep_trials_outlier = "raw_data$ac == 1"
           , decimal_places = 0
           , notification = TRUE
           , dm = c()
           , save_results = FALSE
           , results_name = "results.txt"
           , results_path = NULL
           , save_summary = FALSE
         )

[Package prepdat version 1.0.8 Index]