data_cleansing {creditmodel} | R Documentation |
Data Cleaning
Description
The data_cleansing
function is a simpler wrapper for data cleaning functions, such as
delete variables that values are all NAs;
checking dat and target format.
delete low variance variables
replace null or NULL or blank with NA;
encode variables which NAs & miss value rate is more than 95
encode variables which unique value rate is more than 95
merge categories of character variables that is more than 10;
transfer time variables to dateformation;
remove duplicated observations;
process outliers;
process NAs.
Usage
data_cleansing(
dat,
target = NULL,
obs_id = NULL,
occur_time = NULL,
pos_flag = NULL,
x_list = NULL,
ex_cols = NULL,
miss_values = NULL,
remove_dup = TRUE,
outlier_proc = TRUE,
missing_proc = "median",
low_var = 0.999,
missing_rate = 0.999,
merge_cat = TRUE,
note = TRUE,
parallel = FALSE,
save_data = FALSE,
file_name = NULL,
dir_path = tempdir()
)
Arguments
dat |
A data frame with x and target. |
target |
The name of target variable. |
obs_id |
The name of ID of observations.Default is NULL. |
occur_time |
The name of occur time of observations.Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
x_list |
A list of x variables. |
ex_cols |
A list of excluded variables. Default is NULL. |
miss_values |
Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing". |
remove_dup |
Logical, if TRUE, remove the duplicated observations. |
outlier_proc |
Logical, process outliers or not. Default is TRUE. |
missing_proc |
If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis. |
low_var |
The maximum percent of unique values (including NAs) for filtering low variance variables. |
missing_rate |
The maximum percent of missing values for recoding values to missing and non_missing. |
merge_cat |
The minimum number of categories for merging categories of character variables. |
note |
Logical. Outputs info. Default is TRUE. |
parallel |
Logical, parallel computing or not. Default is FALSE. |
save_data |
Logical, save the result or not. Default is FALSE. |
file_name |
The name for periodically saved data file. Default is NULL. |
dir_path |
The path for periodically saved data file. Default is tempdir(). |
Value
A preprocessed data.frame
See Also
remove_duplicated
,
null_blank_na
,
entry_rate_na
,
low_variance_filter
,
process_nas
,
process_outliers
Examples
#data cleaning
dat_cl = data_cleansing(dat = UCICreditCard[1:2000,],
target = "default.payment.next.month",
x_list = NULL,
obs_id = "ID",
occur_time = "apply_date",
ex_cols = c("PAY_6|BILL_"),
outlier_proc = TRUE,
missing_proc = TRUE,
low_var = TRUE,
save_data = FALSE)