idaSample {ibmdbR} | R Documentation |
Taking a random sample from a IDA data frame
Description
This function draws a random sample from a IDA data frame (that is, an object of the class ida.data.frame
).
Usage
idaSample(bdf, n, stratCol=NULL,stratVals=NULL,stratProbs=NULL,
dbPreSamplePercentage=100,fetchFirst=F);
Arguments
bdf |
The IDA data frame from which the sample is to be drawn. |
n |
The number of rows of sample data to be retrieved. |
stratCol |
For stratified sampling, the column that determines the strata. |
stratVals |
For stratified sampling, a vector of values that determine the subset of strata from which samples are to be drawn. |
stratProbs |
For stratified sampling, a vector of explicit sampling probabilities.
Each value corresponds to a value of the vector specified for |
dbPreSamplePercentage |
The percentage of the IDA data frame from which the sample is to be drawn (see details). |
fetchFirst |
Fetch first rows instead of using random sample. |
Details
If stratCol
is specified, a stratified sample based on the contents of the specified column is taken.
Unless stratVals
is also specified, each unique value in the column results in one stratum.
If stratVals
is also specified, only the values it specifies result in strata, and only rows that contain one of those values are included in the
sample; other rows are ignored.
Unless stratProbs
is also specified, the number of rows retrieved for each stratum is proportional
to the size of that stratum relative to the overall sample.
To undersample or oversample data, use stratProbs
to specify, for each
value of stratVals
, the fraction of the rows of the corresponding stratum that are to be included in the sample.
For each stratum, the calculated number of rows is rounded up to the next highest integer. This ensures that there
is at least one sample for each stratum. Consequently, the number of samples that is returned might
be higher than the value specified for n
.
The value of dbPreSamplePercentage
is a numeric value in the range 0-100 that represents the percentage of the
IDA data frame that is to serve as the source of the sample data.
When working with an especially large IDA data frame, specifying a value smaller than 100 improves performance, because less data must be processed.
However, the proportionality of the pre-sampled data might vary from that of the complete data, and this would result in
a biased sample. It can even happen that entire strata are excluded from the final sample.
When fetchFirst
is set to TRUE, the sample values of each stratum are taken in the order in which they are
returned from the database rather than randomly. This is usually much faster than random sampling, but can introduce bias.
Value
An object of class data.frame
that contains the sample.
Examples
## Not run:
idf<-ida.data.frame('IRIS')
#Simple random sampling
df <- idaSample(idf,10)
#Stratified sample
df <- idaSample(idf,10,'Species')
## End(Not run)