data_VOC {cellWise}R Documentation

VOC dataset


This dataset contains the data on volatile organic components (VOCs) in urine of children between 3 and 10 years old. It is composed of pubicly available data from the National Health and Nutrition Examination Survey (NHANES) and was analyzed in Raymaekers and Rousseeuw (2020). See below for details and references.




A matrix of dimensions 512 \times 19. The first 16 variables are the VOC, the last 3 are:

Note that the original variable names are kept.


All of the data was collected from the NHANES website, and was part of the NHANES 2015-2016 survey. This was the most recent epoch with complete data at the time of extraction. Three datasets were matched in order to assemble this data:

The dataset was constructed as follows:

  1. Select the relevant VOCs from the UVOC_I data (see column names) and transform by taking the logarithm

  2. Match the subjects in the UVOC_I data with their age in the DEMO_I data

  3. Select all subjects with age at most 10

  4. Match the data on smoking habits with the selected subjects.



J. Raymaekers and P.J. Rousseeuw (2020). Handling cellwise outliers by sparse regression and robust covariance. Arxiv: 1912.12446. (link to open access pdf)


# For an analysis of this data, we refer to the vignette:

[Package cellWise version 2.2.5 Index]