Adult {arules} | R Documentation |
Adult Data Set
Description
The AdultUCI
data set contains the questionnaire data of the
Adult database (originally called the Census Income
Database) formatted as a data.frame. The Adult
data set contains the
data already prepared and coerced to transactions for
use with arules.
Format
Adult
is an object of class transactions
with 48842 transactions
and 115 items. See below for details.
The AdultUCI data set contains a data frame with 48842 observations on the following 15 variables.
- age
a numeric vector.
- workclass
a factor with levels
Federal-gov
,Local-gov
,Never-worked
,Private
,Self-emp-inc
,Self-emp-not-inc
,State-gov
, andWithout-pay
.- education
an ordered factor with levels
Preschool
<1st-4th
<5th-6th
<7th-8th
<9th
<10th
<11th
<12th
<HS-grad
<Prof-school
<Assoc-acdm
<Assoc-voc
<Some-college
<Bachelors
<Masters
<Doctorate
.- education-num
a numeric vector.
- marital-status
a factor with levels
Divorced
,Married-AF-spouse
,Married-civ-spouse
,Married-spouse-absent
,Never-married
,Separated
, andWidowed
.- occupation
a factor with levels
Adm-clerical
,Armed-Forces
,Craft-repair
,Exec-managerial
,Farming-fishing
,Handlers-cleaners
,Machine-op-inspct
,Other-service
,Priv-house-serv
,Prof-specialty
,Protective-serv
,Sales
,Tech-support
, andTransport-moving
.- relationship
a factor with levels
Husband
,Not-in-family
,Other-relative
,Own-child
,Unmarried
, andWife
.- race
a factor with levels
Amer-Indian-Eskimo
,Asian-Pac-Islander
,Black
,Other
, andWhite
.- sex
a factor with levels
Female
andMale
.- capital-gain
a numeric vector.
- capital-loss
a numeric vector.
- fnlwgt
a numeric vector.
- hours-per-week
a numeric vector.
- native-country
a factor with levels
Cambodia
,Canada
,China
,Columbia
,Cuba
,Dominican-Republic
,Ecuador
,El-Salvador
,England
,France
,Germany
,Greece
,Guatemala
,Haiti
,Holand-Netherlands
,Honduras
,Hong
,Hungary
,India
,Iran
,Ireland
,Italy
,Jamaica
,Japan
,Laos
,Mexico
,Nicaragua
,Outlying-US(Guam-USVI-etc)
,Peru
,Philippines
,Poland
,Portugal
,Puerto-Rico
,Scotland
,South
,Taiwan
,Thailand
,Trinadad&Tobago
,United-States
,Vietnam
, andYugoslavia
.- income
an ordered factor with levels
small
<large
.
Details
The Adult database was extracted from the census bureau database
found at https://www.census.gov/ in 1994 by Ronny Kohavi and Barry
Becker (Data Mining and Visualization, Silicon Graphics). It was originally
used to predict whether income exceeds USD 50K/yr based on census data. We
added the attribute income
with levels small
and large
(>50K).
We prepared the data set for association mining as shown in the section
Examples. We removed the continuous attribute fnlwgt
(final weight).
We also eliminated education-num
because it is just a numeric
representation of the attribute education
. The other 4 continuous
attributes we mapped to ordinal attributes as follows:
age: cut into levels
Young
(0-25),Middle-aged
(26-45),Senior
(46-65) andOld
(66+)hours-per-week: cut into levels
Part-time
(0-25),Full-time
(25-40),Over-time
(40-60) andToo-much
(60+)capital-gain and capital-loss: each cut into levels
None
(0),Low
(0 < median of the values greater zero < max) andHigh
(>=max)
Author(s)
Michael Hahsler
Source
References
A. Asuncion & D. J. Newman (2007): UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science.
The data set was first cited in Kohavi, R. (1996): Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
Examples
data("AdultUCI")
dim(AdultUCI)
AdultUCI[1:2, ]
## remove attributes
AdultUCI[["fnlwgt"]] <- NULL
AdultUCI[["education-num"]] <- NULL
## map metric attributes
AdultUCI[["age"]] <- ordered(cut(AdultUCI[["age"]], c(15, 25, 45, 65, 100)),
labels = c("Young", "Middle-aged", "Senior", "Old"))
AdultUCI[["hours-per-week"]] <- ordered(cut(AdultUCI[["hours-per-week"]],
c(0,25,40,60,168)),
labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))
AdultUCI[["capital-gain"]] <- ordered(cut(AdultUCI[["capital-gain"]],
c(-Inf,0,median(AdultUCI[["capital-gain"]][AdultUCI[["capital-gain"]] > 0]),
Inf)), labels = c("None", "Low", "High"))
AdultUCI[["capital-loss"]] <- ordered(cut(AdultUCI[["capital-loss"]],
c(-Inf,0, median(AdultUCI[["capital-loss"]][AdultUCI[["capital-loss"]] > 0]),
Inf)), labels = c("None", "Low", "High"))
## create transactions
Adult <- transactions(AdultUCI)
Adult