R: Fill missing values in a data frame

fill_missing_values {bulkreadr}

R Documentation

Fill missing values in a data frame

Description

fill_missing_values() is an efficient function that addresses missing values in a data frame. It uses imputation by function, also known as column-based imputation, to impute the missing values. For continuous variables, it supports various methods of imputation, including minimum, maximum, mean, median, harmonic mean, and geometric mean. For categorical variables, missing values are replaced with the mode of the column. This approach ensures accurate and consistent replacements derived from individual columns, resulting in a complete and reliable dataset for improved analysis and decision-making.

Usage

fill_missing_values(
  df,
  selected_variables = NULL,
  method = c("mean", "min", "max", "median", "harmonic", "geometric")
)

Arguments

`df`	A dataframe to process for missing value imputation.
`selected_variables`	An optional character vector of variable names within `df` for which missing values should be imputed. If `NULL` (default), imputation is applied to all variables in the data frame.
`method`	A character string specifying the imputation method for continuous variables. Supported methods are `"min"`, `"max"`, `"mean"`, `"median"`, `"harmonic"`, and `"geometric"`. The default method is `"mean"`. For categorical variables, the `mode` is always used.

Value

A data frame with missing values imputed according to the specified method.

Examples


library(dplyr)

# Assuming 'df' is the dataframe you want to process

df <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(NA, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", NA, "versicolor", "setosa",
           NA, "virginica", "setosa")
)

# If you do not specify `selected_variables` (i.e., leave it as `NULL`),
# the function will impute missing values for all columns in the dataframe.

result_df_mean <- fill_missing_values(df, method = "mean")

result_df_mean

# If you specify column names, only those columns will be imputed. For
# example, impute for variables `Petal_Length` and `Petal_Width` using
# the geometric mean.

result_df_geomean <- fill_missing_values(df, selected_variables = c
("Petal_Length", "Petal_Width"), method = "geometric")

result_df_geomean

# If you specify column positions, only the columns at those positions will #' # be imputed.

result_df_max <- fill_missing_values(df, selected_variables = c
(2, 3), method = "max")

result_df_max

# Impute missing values (NAs) in a grouped data frame

# You can do that by using the following:

sample_iris <- tibble::tibble(
Sepal_Length = c(5.2, 5, 5.7, NA, 6.2, 6.7, 5.5),
Petal_Length = c(1.5, 1.4, 4.2, 1.4, NA, 5.8, 3.7),
Petal_Width = c(0.3, 0.2, 1.2, 0.2, 1.3, 1.8, NA),
Species = c("setosa", "setosa", "versicolor", "setosa",
           "virginica", "virginica", "setosa")
)

sample_iris %>%
group_by(Species) %>%
group_split() %>%
map_df(fill_missing_values, method = "median")

[Package bulkreadr version 1.1.1 Index]