dataSimilarity {semiArtificial} | R Documentation |
Evaluate statistical similarity of two data sets
Description
Use mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test to compare similarity of two data sets.
Usage
dataSimilarity(data1, data2, dropDiscrete=NA)
Arguments
data1 |
A |
data2 |
A |
dropDiscrete |
A vector discrete attribute indices to skip in comparison. Typically we might skip class, because its distribution was forced by the user. |
Details
The function compares data stored in data1
with data2
on per attribute basis by
computing several statistics:
mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test.
Value
The method returns a list of statistics computed on both data sets:
equalInstances |
The number of instances in |
stats1num |
A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of |
stats2num |
A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of |
ksP |
A vector with p-values of Kolmogorov-Smirnov two sample tests, performed on matching attributes from |
freq1 |
A list with value frequencies for discrete attributes in |
freq2 |
A list with value frequencies for discrete attributes in |
dfreq |
A list with differences in frequencies of discrete attributes' values between |
dstatsNorm |
A matrix with rows containing difference between statistics (mean, standard deviation, skewness, and kurtosis)
computed on [0,1] normalized numeric attributes for |
hellingerDist |
A vector with Hellinger distances between matching attributes from |
Author(s)
Marko Robnik-Sikonja
See Also
Examples
# use iris data set, split into training and testing data
set.seed(12345)
train <- sample(1:nrow(iris),size=nrow(iris)*0.5)
irisTrain <- iris[train,]
irisTest <- iris[-train,]
# create RBF generator
irisGenerator<- rbfDataGen(Species~.,irisTrain)
# use the generator to create new data
irisNew <- newdata(irisGenerator, size=100)
# compare statistics of original and new data
dataSimilarity(irisTest, irisNew)