data_VOC {cellWise} R Documentation

## VOC dataset

### Description

This dataset contains the data on volatile organic components (VOCs) in urine of children between 3 and 10 years old. It is composed of pubicly available data from the National Health and Nutrition Examination Survey (NHANES) and was analyzed in Raymaekers and Rousseeuw (2020). See below for details and references.

### Usage

data("data_VOC")

### Format

A matrix of dimensions 512 \times 19. The first 16 variables are the VOC, the last 3 are:

• SMD460: number of smokers that live in the same home as the subject

• SMD470: number of people that smoke inside the home of the subject

• RIDAGEYR: age of the subject

Note that the original variable names are kept.

### Details

All of the data was collected from the NHANES website, and was part of the NHANES 2015-2016 survey. This was the most recent epoch with complete data at the time of extraction. Three datasets were matched in order to assemble this data:

• UVOC_I: contains the information on the Volative organic components in urine

• DEMO_I: contains the demographical information such as age

• SMQFAM_I: contains the data on the smoking habits of family members

The dataset was constructed as follows:

1. Select the relevant VOCs from the UVOC_I data (see column names) and transform by taking the logarithm

2. Match the subjects in the UVOC_I data with their age in the DEMO_I data

3. Select all subjects with age at most 10

4. Match the data on smoking habits with the selected subjects.

### References

J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Arxiv: 1912.12446. (link to open access pdf)

### Examples

data("data_VOC")
# For an analysis of this data, we refer to the vignette:
vignette("DI_examples")


[Package cellWise version 2.2.5 Index]