R: Taking a random sample from a IDA data frame

idaSample {ibmdbR}

R Documentation

Taking a random sample from a IDA data frame

Description

This function draws a random sample from a IDA data frame (that is, an object of the class ida.data.frame).

Usage

idaSample(bdf, n, stratCol=NULL,stratVals=NULL,stratProbs=NULL,
dbPreSamplePercentage=100,fetchFirst=F);

Arguments

`bdf`	The IDA data frame from which the sample is to be drawn.
`n`	The number of rows of sample data to be retrieved.
`stratCol`	For stratified sampling, the column that determines the strata.
`stratVals`	For stratified sampling, a vector of values that determine the subset of strata from which samples are to be drawn.
`stratProbs`	For stratified sampling, a vector of explicit sampling probabilities. Each value corresponds to a value of the vector specified for `stratVals`.
`dbPreSamplePercentage`	The percentage of the IDA data frame from which the sample is to be drawn (see details).
`fetchFirst`	Fetch first rows instead of using random sample.

Details

If stratCol is specified, a stratified sample based on the contents of the specified column is taken. Unless stratVals is also specified, each unique value in the column results in one stratum. If stratVals is also specified, only the values it specifies result in strata, and only rows that contain one of those values are included in the sample; other rows are ignored.

Unless stratProbs is also specified, the number of rows retrieved for each stratum is proportional to the size of that stratum relative to the overall sample.

To undersample or oversample data, use stratProbs to specify, for each value of stratVals, the fraction of the rows of the corresponding stratum that are to be included in the sample.

For each stratum, the calculated number of rows is rounded up to the next highest integer. This ensures that there is at least one sample for each stratum. Consequently, the number of samples that is returned might be higher than the value specified for n.

The value of dbPreSamplePercentage is a numeric value in the range 0-100 that represents the percentage of the IDA data frame that is to serve as the source of the sample data. When working with an especially large IDA data frame, specifying a value smaller than 100 improves performance, because less data must be processed. However, the proportionality of the pre-sampled data might vary from that of the complete data, and this would result in a biased sample. It can even happen that entire strata are excluded from the final sample.

When fetchFirst is set to TRUE, the sample values of each stratum are taken in the order in which they are returned from the database rather than randomly. This is usually much faster than random sampling, but can introduce bias.

Value

An object of class data.frame that contains the sample.

Examples

## Not run: 
idf<-ida.data.frame('IRIS')

#Simple random sampling
df <- idaSample(idf,10)

#Stratified sample
df <- idaSample(idf,10,'Species')


## End(Not run)

[Package ibmdbR version 1.51.0 Index]