thyroid_disease {MLDataR}R Documentation

Thyroid disease dataset

Description

The dataset is to be used with a supervised classification ML model to classify thyroid disease. The dataset was sourced and adapted from the UCI Machine Learning repository https://archive.ics.uci.edu/ml/index.php.

Usage

thyroid_disease

Format

A data frame with 3772 rows and 28 variables:

ThryroidClass

binary classification label indicating whether sick = 1 or negative=0

patient_age

age of the patient

patient_gender

flag indicating gender of patient - 1=Female and 0=Male

presc_thyroxine

flag to indicate whether thyroxine replacement prescribed 1=Thyroxine prescribed

queried_why_on_thyroxine

flag to indicate query has been actioned

presc_anthyroid_meds

flag to indicate whether anti-thyroid medicine has been prescribed

sick

flag to indicate sickness due to thyroxine depletion or over activity

pregnant

flag to indicate whether the patient is pregnant

thyroid_surgery

flag to indicate whether the patient has had thyroid surgery

radioactive_iodine_therapyI131

indicates whether patient has had radioactive iodine treatment: https://www.nhs.uk/conditions/thyroid-cancer/treatment/

query_hypothyroid

flag to indicate under active thyroid query https://www.nhs.uk/conditions/underactive-thyroid-hypothyroidism/

query_hyperthyroid

flag to indicate over active thyroid query https://www.nhs.uk/conditions/overactive-thyroid-hyperthyroidism/

lithium

Lithium carbonate administered to decrease the level of thyroid hormones

goitre

flag to indicate swelling of the thyroid gland https://www.nhs.uk/conditions/goitre/

tumor

flag to indicate a tumor

hypopituitarism

flag to indicate a diagnosed under active thyroid

psych_condition

indicates whether a patient has a psychological condition

TSH_measured

a TSH level lower than normal indicates there is usually more than enough thyroid hormone in the body and may indicate hyperthyroidism

TSH_reading

the reading result of the TSH blood test

T3_measured

linked to TSH reading - when free triiodothyronine rise above normal this indicates hyperthyroidism

T3_reading

the reading result of the T3 blood test looking for above normal levels of free triiodothyronine

T4_measured

free thyroxine, also known as T4, is used with T3 and TSH tests to diagnose hyperthyroidism

T4_reading

the reading result of th T4 test

thyrox_util_rate_T4U_measured

flag indicating the thyroxine utilisation rate https://pubmed.ncbi.nlm.nih.gov/1685967/

thyrox_util_rate_T4U_reading

the result of the test

FTI_measured

flag to indicate measurement on the Free Thyroxine Index (FTI)https://endocrinology.testcatalog.org/show/FRTUP

FTI_reading

the result of the test mentioned above

ref_src

[nominal] indicating the referral source of the patient

Source

Prepared and adatped by Gary Hutson hutsons-hacks@outlook.com, Dec-2021 and sourced from Garavan Institute and J. Ross Quinlan.

References

Thyroid disease records supplied by the Garavan Institute and J. Ross Quinlan.

Examples

library(dplyr)
library(ConfusionTableR)
library(parsnip)
library(rsample)
library(recipes)
library(ranger)
library(workflows)
data("thyroid_disease")
td <- thyroid_disease
# Create a factor of the class label to use in ML model
td$ThryroidClass <- as.factor(td$ThryroidClass)
# Check the structure of the data to make sure factor has been created
str(td)
# Remove missing values, or choose more advaced imputation option
td <- td[complete.cases(td),]
#Drop the column for referral source
td <- td %>%
 dplyr::select(-ref_src)
# Analyse class imbalance
class_imbalance <- prop.table(table(td$ThryroidClass))
class_imbalance
#Divide the data into a training test split
set.seed(123)
split <- rsample::initial_split(td, prop=3/4)
train_data <- rsample::training(split)
test_data <- rsample::testing(split)
# Create recipe to upsample and normalise
set.seed(123)
td_recipe <-
 recipe(ThryroidClass ~ ., data=train_data) %>%
  step_normalize(all_predictors()) %>%
  step_zv(all_predictors())
# Instantiate the model
set.seed(123)
rf_mod <-
  parsnip::rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")
# Create the model workflow
td_wf <-
  workflow() %>%
  workflows::add_model(rf_mod) %>%
  workflows::add_recipe(td_recipe)
# Fit the workflow to our training data
set.seed(123)
td_rf_fit <-
  td_wf %>%
  fit(data = train_data)
# Extract the fitted data
td_fitted <- td_rf_fit %>%
   extract_fit_parsnip()
# Predict the test set on the training set to see model performance
class_pred <- predict(td_rf_fit, test_data)
td_preds <- test_data %>%
bind_cols(class_pred)
# Convert both to factors
td_preds$.pred_class <- as.factor(td_preds$.pred_class)
td_preds$ThryroidClass <- as.factor(td_preds$ThryroidClass)
# Evaluate the data with ConfusionTableR
cm <- ConfusionTableR::binary_class_cm(td_preds$ThryroidClass ,
                                       td_preds$.pred_class,
                                       positive="sick")
#View Confusion matrix
cm$confusion_matrix
#View record level
cm$record_level_cm

[Package MLDataR version 1.0.1 Index]