R: Thyroid disease dataset

thyroid_disease {MLDataR}

R Documentation

Thyroid disease dataset

Description

The dataset is to be used with a supervised classification ML model to classify thyroid disease. The dataset was sourced and adapted from the UCI Machine Learning repository https://archive.ics.uci.edu/ml/index.php.

Usage

thyroid_disease

Format

A data frame with 3772 rows and 28 variables:

ThryroidClass: binary classification label indicating whether sick = 1 or negative=0
patient_age: age of the patient
patient_gender: flag indicating gender of patient - 1=Female and 0=Male
presc_thyroxine: flag to indicate whether thyroxine replacement prescribed 1=Thyroxine prescribed
queried_why_on_thyroxine: flag to indicate query has been actioned
presc_anthyroid_meds: flag to indicate whether anti-thyroid medicine has been prescribed
sick: flag to indicate sickness due to thyroxine depletion or over activity
pregnant: flag to indicate whether the patient is pregnant
thyroid_surgery: flag to indicate whether the patient has had thyroid surgery
radioactive_iodine_therapyI131: indicates whether patient has had radioactive iodine treatment: https://www.nhs.uk/conditions/thyroid-cancer/treatment/
query_hypothyroid: flag to indicate under active thyroid query https://www.nhs.uk/conditions/underactive-thyroid-hypothyroidism/
query_hyperthyroid: flag to indicate over active thyroid query https://www.nhs.uk/conditions/overactive-thyroid-hyperthyroidism/
lithium: Lithium carbonate administered to decrease the level of thyroid hormones
goitre: flag to indicate swelling of the thyroid gland https://www.nhs.uk/conditions/goitre/
tumor: flag to indicate a tumor
hypopituitarism: flag to indicate a diagnosed under active thyroid
psych_condition: indicates whether a patient has a psychological condition
TSH_measured: a TSH level lower than normal indicates there is usually more than enough thyroid hormone in the body and may indicate hyperthyroidism
TSH_reading: the reading result of the TSH blood test
T3_measured: linked to TSH reading - when free triiodothyronine rise above normal this indicates hyperthyroidism
T3_reading: the reading result of the T3 blood test looking for above normal levels of free triiodothyronine
T4_measured: free thyroxine, also known as T4, is used with T3 and TSH tests to diagnose hyperthyroidism
T4_reading: the reading result of th T4 test
thyrox_util_rate_T4U_measured: flag indicating the thyroxine utilisation rate https://pubmed.ncbi.nlm.nih.gov/1685967/
thyrox_util_rate_T4U_reading: the result of the test
FTI_measured: flag to indicate measurement on the Free Thyroxine Index (FTI)https://endocrinology.testcatalog.org/show/FRTUP
FTI_reading: the result of the test mentioned above
ref_src: [nominal] indicating the referral source of the patient

Source

Prepared and adatped by Gary Hutson hutsons-hacks@outlook.com, Dec-2021 and sourced from Garavan Institute and J. Ross Quinlan.

References

Thyroid disease records supplied by the Garavan Institute and J. Ross Quinlan.

Examples

library(dplyr)
library(ConfusionTableR)
library(parsnip)
library(rsample)
library(recipes)
library(ranger)
library(workflows)
data("thyroid_disease")
td <- thyroid_disease
# Create a factor of the class label to use in ML model
td$ThryroidClass <- as.factor(td$ThryroidClass)
# Check the structure of the data to make sure factor has been created
str(td)
# Remove missing values, or choose more advaced imputation option
td <- td[complete.cases(td),]
#Drop the column for referral source
td <- td %>%
 dplyr::select(-ref_src)
# Analyse class imbalance
class_imbalance <- prop.table(table(td$ThryroidClass))
class_imbalance
#Divide the data into a training test split
set.seed(123)
split <- rsample::initial_split(td, prop=3/4)
train_data <- rsample::training(split)
test_data <- rsample::testing(split)
# Create recipe to upsample and normalise
set.seed(123)
td_recipe <-
 recipe(ThryroidClass ~ ., data=train_data) %>%
  step_normalize(all_predictors()) %>%
  step_zv(all_predictors())
# Instantiate the model
set.seed(123)
rf_mod <-
  parsnip::rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")
# Create the model workflow
td_wf <-
  workflow() %>%
  workflows::add_model(rf_mod) %>%
  workflows::add_recipe(td_recipe)
# Fit the workflow to our training data
set.seed(123)
td_rf_fit <-
  td_wf %>%
  fit(data = train_data)
# Extract the fitted data
td_fitted <- td_rf_fit %>%
   extract_fit_parsnip()
# Predict the test set on the training set to see model performance
class_pred <- predict(td_rf_fit, test_data)
td_preds <- test_data %>%
bind_cols(class_pred)
# Convert both to factors
td_preds$.pred_class <- as.factor(td_preds$.pred_class)
td_preds$ThryroidClass <- as.factor(td_preds$ThryroidClass)
# Evaluate the data with ConfusionTableR
cm <- ConfusionTableR::binary_class_cm(td_preds$ThryroidClass ,
                                       td_preds$.pred_class,
                                       positive="sick")
#View Confusion matrix
cm$confusion_matrix
#View record level
cm$record_level_cm

[Package MLDataR version 1.0.1 Index]