classDataGen {CORElearn} | R Documentation |
Artificial data for testing classification algorithms
Description
The generator produces classification data with 2 classes, 7 discrete and 3 numeric attributes.
Usage
classDataGen(noInst, t1=0.7, t2=0.9, t3=0.34, t4=0.32,
p1=0.5, classNoise=0)
Arguments
noInst |
Number of instances to generate. |
t1 , t2 , t3 |
Parameters, which control the hardness of the discrete attributes. |
t4 |
Parameter, which controls the hardness of the numeric attributes.. |
p1 |
Probability of class 1. |
classNoise |
Proportion of noise in the class variable for classification or virtual class variable for regression. |
Details
Class probabilities are p1
and 1 - p1
, respectively. The conditional distribution of attributes
under each of the classes depends on parameters t1, t2, t3, t4
from [0,1].
Attributes a7 and x3 are irrelevant for all values of parameters.
Examples of extreme settings of the parameters.
Setting satisfying t1*t2 = t3 implies no difference between the distributions of individual discrete attributes among the two classes. However, if t1 < 1, then the joint distribution of them is different for the two classes.
Setting t1 = 1 and t2 = t3 implies no difference between the joint distribution of the discrete attributes among the two classes.
Setting t1 = 1, t2 = 1, t3 = 0 implies disjoint supports of the distributions of a1, a2, a4, a5, so this allows exact classification.
Setting t4 = 1 implies no difference between the distribution of x1, x2 between the classes. Setting t4 = 0 allows correct classification with probability one only using x1 and x2.
For class 1 the attributes have distributions
(a1, a2, a3) | D_1(t1, t2) |
a4, a5, a6 | D_2(t3) |
a7 | irrelevant attribute, probabilities of {a,b,c,d} are (1/2, 1/6, 1/6, 1/6) |
x1, x2, x3 | independent normal variables with mean 0 and standard deviation 1, t4, 1 |
x4, x5 | independent uniformly distributed variables on [0,1] |
For class 2 the attributes have distributions
a1, a2, a3 | D_2(t3) |
(a4, a5, a6) | D_1(t1, t2) |
a7 | irrelevant attribute, probabilities of {a,b,c,d} are (1/2, 1/6, 1/6, 1/6) |
x1, x2, x3 | independent normal variables with mean 0 and st. dev. t4, 1, 1 |
x4, x5 | independent uniformly distributed variables on [0,1] |
x3 is irrelevant for classification, since it has the same distribution under both classes.
Attributes in a bracket are mutually dependent. Otherwise, the attributes are conditionally independent for each of the two classes. This means that if we consider groups of the attributes such that the attributes in each of the two brackets form a group and each of the remaining attributes forms a group with one element, then for each class, we have 7 groups, which are conditionally independent for the given class. Note that the splitting into groups differs for class 1 and 2.
Distribution D_1(t1,t2)
consists of three dependent attributes. The
distribution of individual attributes depends only on t1*t2. For a given t1*t2,
the level of dependence decreases with t1 and increases with t2. There are
two extreme settings:
Setting t1 = 1, t2 = t1*t2 has the largest t1 and the smallest t2 and all three
attributes are independent.
Setting t1 = t1*t2, t2 = 1 has the smallest t1 and the largest t2 and also the
largest dependence between attributes.
Distribution D_2(t3)
is equal to D_1(1, t3)
, so it contains three independent
attributes, whose distributions are the same as in D_1(t1,t2)
for every
setting satifying t1*t2 = t3.
In other words, if t3 = t1*t2, then the distributions D_1(t1, t2)
and D_2(t3)
have the same distributions of individual attributes and may differ only
in the dependences. There are no in D_2(t3)
and there are some in D_1(t1, t2)
if t1 < 1.
Hardness of the discrete part
Setting t1 = 1 and t2 = t3 implies no difference between the discrete attributes among the two classes.
Setting satisfying t1*t2 = t3 implies no difference between the distributions of individual discrete attributes among the two classes. However, there may be a difference in dependences.
Setting t1 = 1, t2 = 1, t3 = 0 implies disjoint supports of the distributions of a1, a2, a4, a5, so this allows exact classification.
Hardness of the continuous part
Depends monotonically on t4. Setting t4 = 1 implies no difference between the classes. Setting t4 = 0 allows correct classification with probability one.
Value
The method classDataGen
returns a data.frame
with noInst
rows and 11 columns.
Range of values of the attributes and class are
a1 |
0,1 |
a2 |
0,1 |
a3 |
a,b,c,d |
a4 |
0,1 |
a5 |
0,1 |
a6 |
a,b,c,d |
a7 |
a,b,c,d |
x1 |
numeric |
x2 |
numeric |
x3 |
numeric |
class |
1,2 |
For detailed specification of attributes (columns) see details section below.
Author(s)
Petr Savicky
See Also
regDataGen
, ordDataGen
,CoreModel
.
Examples
#prepare a classification data set
classData <-classDataGen(noInst=200)
# build random forests model with certain parameters
modelRF <- CoreModel(class~., classData, model="rf",
selectionEstimator="MDL", minNodeWeightRF=5,
rfNoTrees=100, maxThreads=1)
print(modelRF)
destroyModels(modelRF) # clean up