R: Load data then clean and format it

load_clean {reappraised}

R Documentation

Load data then clean and format it

Description

Function loads and cleans data for the nine functions

Usage

load_clean(
  import = "yes",
  file.cont = "",
  file.cat = "",
  dir = "",
  file.name = "",
  pval_cont = "no",
  match = "no",
  cohort = "no",
  anova = "no",
  dir.cont = "",
  file.name.cont = "",
  sheet.name.cont = "Sheet1",
  range.name.cont = "",
  format.cont = "wide",
  cat = "no",
  sr = "no",
  cat_all = "no",
  pval_cat = "no",
  cat.names = c("n"),
  dir.cat = "",
  file.name.cat = "",
  sheet.name.cat = "Sheet1",
  range.name.cat = "",
  format.cat = "wide",
  generic = "",
  gen.vars.keep = "",
  gen.vars.del = "",
  verbose = TRUE
)

Arguments

`import`	'yes' indicates import excel file. 'no' indicates takes dataset already loaded into R as data frame
`file.cont`	If import = 'no', name of data frame containing continuous data
`file.cat`	If import = 'no', name of data frame containing categorical data
`dir`	If import = 'yes', path to location of excel file for continuous and categorical data
`file.name`	If import = 'yes', file name of excel file containing continuous and categorical data
`pval_cont`	'yes'/'no' indicating if data will be used for pval_cont_fn. Only data for 1 continuous data function can be loaded with each run of this function.
`match`	'yes'/'no' indicating if data will be used for match_fn. Only data for 1 continuous data function can be loaded with each run of this function.
`cohort`	'yes'/'no' indicating if data will be used for cohort_fn. Only data for 1 continuous data function can be loaded with each run of this function.
`anova`	'yes'/'no' indicating if data will be used for anova_fn. Only data for 1 continuous data function can be loaded with each run of this function.
`dir.cont`	If import = 'yes', path to location of excel file for continuous data
`file.name.cont`	If import = 'yes', file name of excel file containing continuous data
`sheet.name.cont`	Sheet name containing continuous data
`range.name.cont`	Range of cells containing continuous data. Can be in format 'a1:b20' or 'a:b'
`format.cont`	'wide'/'long' indicating continuous data is in wide or long format
`cat`	'yes'/'no' indicating if data will be used for cat_fn. Only data for 1 categorical data function can be loaded with each run of this function.
`sr`	'yes'/'no' indicating if data will be used for sr_fn. Only data for 1 categorical data function can be loaded with each run of this function.
`cat_all`	'yes'/'no' indicating if data will be used for cat_all_fn. Only data for 1 categorical data function can be loaded with each run of this function.
`pval_cat`	'yes'/'no' indicating if data will be used for cat_all_fn. Only data for 1 categorical data function can be loaded with each run of this function.
`cat.names`	names of variables to be used in cat_fn and sr_fn
`dir.cat`	If import = 'yes', path to location of excel file for categorical data
`file.name.cat`	If import = 'yes', file name of excel file containing categorical data
`sheet.name.cat`	Sheet name containing categorical data
`range.name.cat`	Range of cells containing categorical. Can be in format 'a1:b20' or 'a:b'
`format.cat`	'wide'/'long' indicating categorical data is in wide or long format
`generic`	'yes'/'no' indicating if data to be loaded for generic use
`gen.vars.keep`	Vector of variables in data to keep
`gen.vars.del`	Vector of variables in data to delete
`verbose`	TRUE/FALSE TRUE indicates comments will be printed during loading

Details

Function can load continuous or categorical data. Continuous data can be used for comparison of baseline p-values (pval_cont_fn), matching summary stats within a trial (match_fn), matching summary stats in different cohorts (cohort_fn), or comparing means of baseline p-values (anova_fn). Categorical data can be used for comparisons of observed with expected distributions for single variable (cat_fn), for group numbers in trials using simple randomisation (sr_fn), for all variables (cat_all_fn), and for comparison of baseline p-values (pval_cat_fn).

There is one function in development that allows assessment of proportion of final digits in summary statistics (final_digit_fn). This function works using summary statistics but could be adapted to use on raw continuous or categorical data.

Only 1 continuous and/or 1 categorical data set allowed per load to avoid clashes

Data can be imported from a file (import = "yes") or taken from an existing data frame, import = "no"

If loading from an existing data use file.cont and file.cat

If loading from common directory or file, can use dir and file.name rather than more specific dir.cont, dir.cat, file.name.cont, or file.name.cat.

Comments about each indicator: pval_cont
loads continuous data for pval_cont_fn, outputs as list of 1 containing named data frame pval_cont_data.

format should be study, variable or var, n, m, s, p. Can be in any order. n = sample size, m = mean, s = standard deviation, p = baseline p value (can omit if not reported)

can be in wide or long format
wide: study, var, n1, n2, n3 ..., m1, m2, m3 ... s1, s2, s3..., p
long: study, var, group, m , s, n , p

group or g or grp required for long format
separators (eg n1 n_1 n.1) are stripped and replaced

match
loads continuous data for match_fn, outputs as list of 1 containing named data frame match_data

remainder is same as for pval_cont above.
only difference between pval_cont and match is that match allows for missing mean or SD whereas pval_cont does not

format should be study, variable or var, n, m, s. Can be in any order. n = sample size, m = mean, s = standard deviation

can be in wide or long format
wide: study, var, n1, n2, m1, m2, s1, s2, p
long: study, var, group, m , s, n

group or g or grp required for long format
separators (eg n1 n_1 n.1) are stripped and replaced

cohort
loads continuous data for cohort_fn, outputs as list of 1 containing named data frame cohort_data

same as pval_cont but allows a lookup variable for variable names

format should be study, variable or var, n, m, s, p. Can be in any order. n = sample size, m = mean, s = standard deviation

can be in wide or long format
wide: study, var, n1, n2, n3 ..., m1, m2, m3 ... s1, s2, s3...
long: study, var, group, m , s, n

group or g or grp required for long format
separators (eg n1 n_1 n.1) are stripped and replaced

lookup table is var_name_final, var_name_orig and allows you to specify a list of all variables names (var_name_orig) from all studies and a lookup table of standardised names (var_name_final) allowing different names in different studies to be standardised

has optional variable 'population' which can be used to subset the data if trials in different populations are reported

anova
loads continuous data for anova_fn, outputs as list of 1 containing named data frame anova_data

same as for pval_cont above but allows for optional value for decimal place

format should be study, variable or var, n, m, s, p. Can be in any order. n = sample size, m = mean, s = standard deviation, d= decimal place of mean (if omitted, this is calculated automatically in anova_fn)

can be in wide or long format
wide: study, var, n1, n2, n3 ..., m1, m2, m3 ... s1, s2, s3..., d
long: study, var, group, m , s, n , d

group or g or grp required for long format
separators (eg n1 n_1 n.1) are stripped and replaced

cat
loads categorical data for cat_fn, outputs as list of 1 containing named data frame cat_data

format should be study, n, v. Can be in any order, n= group size, v= number with characteristic

can be in wide or long format
wide: study, n1, n2, n3 ..., v1, v2, v3...
long: study, group, n, v

group or g or grp required for long format
use cat.names to name variable eg c("n", "v") , c("n", "g") ...
separators (eg n1 n_1 n.1) are stripped and replaced

sr
loads categorial data for sr_fn, outputs as list of 1 containing named data frame sr_data

as for cat but only requires study and n

format should be study, n. n= group size

can be in wide or long format
wide: study, n1, n2, n3 ...
long: study, group, n

group or g or grp required for long format
separators (eg n1 n_1 n.1) are stripped and replaced

cat_all
loads categorical data for cat_all_fn, outputs as list of 1 containing named data frame cat_all_data

format should be study, var or variable, n, N, level, stat, recode, p. Can be in any order, n = number with characteristic, N = group size, p = baseline p value (can omit if not reported), can use "ns" for not significant or "<" or ">" to indicate threshold (eg "<0.05")

optional level - number for level of variable (eg y/n =1,2; high/med/low =1,2,3)
optional recode- for variables with >2 levels to tell how to recode into 2 groups
optional stat: statistical test used for p-value : chisq - Chisquare, chisqc- Chisquare with correction, fisher- Fisher's exact, midp - midp -calculated using two different methods, lr- likelihood ratio, mh - Mantel-Haenszel test

can be in wide or long format
wide study, var, n1, n2, n3, ... N1, N2, N3... p, stat, level, recode
long study, var, group, n, N, p, stat, level, recode

group or g or grp required for long format

if variable has 2 levels, only 1 required, other will be calculated.

separators (eg n1 n_1 n.1) are stripped and replaced

pval_cat
loads categorical data for pval_cat_fn, outputs as list of 1 containing named data frame pval_cat_data

as for cat_all but recode variable is not generated

format should be study, var or variable, n, N, p. Can be in any order, n = number with characteristic, N = group size, p = baseline p value (can omit if not reported), can use "ns" for not significant or "<" or ">" to indicate threshold (eg "<0.05")

optional level - number for level of variable (eg y/n =1,2; high/med/low =1,2,3)
optional stat: statistical test used for p-value : chisq - Chisquare, fisher- Fisher's exact

can be in wide or long format
wide study, var, n1, n2, n3, ... N1, N2, N3... p, stat, level
long study, var, group, n, N, p, stat, level

group or g or grp required for long format

if variable has 2 levels, only 1 required, other will be calculated.

separators (eg n1 n_1 n.1) are stripped and replaced

generic
loads data for use generic use, outputs as list of 1 containing named data frame generic_data

use cont suffixes for file details: dir.cont (or dir), file.name.cont (or file.name), sheet.name,cont, range.name.cont)

format should be study, var or variable, variable names

optional gen.vars.keep = vector of variables to keep
optional gen.vars.del = vector of variables to delete

can be in wide or long format
wide study, var, a1, a2..., b1, b2 ...
long study, var, group, a, b, ....

group or g or grp required for long format

separators (eg n1 n_1 n.1) are stripped and replaced
no data checking or other transformations take place

Value

list containing a named data frame containing data in suitable format for appropriate function as described in Details

Examples

# examples of loading data for each function are given in the individual functions.
# Here is one- for pval_cont_fn():

pval_cont_data <- load_clean(import= "no", file.cont = "SI_pvals_cont", pval_cont= "yes",
format.cont = "wide")$pval_cont_data


# to import an excel spreadsheet (modify using local path,
# file and sheet name, range, and format):

# get path for example files
path <- system.file("extdata", "reappraised_examples.xlsx", package = "reappraised",
                    mustWork = TRUE)
# delete file name from path
path <- sub("/[^/]+$", "", path)

# load data
pval_cont_data <- load_clean(import= "yes", pval_cont = "yes", dir = path,
     file.name.cont = "reappraised_examples.xlsx", sheet.name.cont = "SI_pvals_cont",
     range.name.cont = "A1:O51", format.cont = "wide")$pval_cont_data

[Package reappraised version 0.1.1 Index]