DMTL {DMTL} | R Documentation |
Distribution Mapping based Transfer Learning
Description
This function performs distribution mapping based transfer learning (DMTL) regression for given target (primary) and source (secondary) datasets. The data available in the source domain are used to design an appropriate predictive model. The target features with unknown response values are transferred to the source domain via distribution matching and then the corresponding response values in the source domain are predicted using the aforementioned predictive model. The response values are then transferred to the original target space by applying distribution matching again. Hence, this function needs an unmatched pair of target datasets (features and response values) and a matched pair of source datasets.
Usage
DMTL(
target_set,
source_set,
use_density = FALSE,
pred_model = "RF",
model_optimize = FALSE,
sample_size = 1000,
random_seed = NULL,
all_pred = FALSE,
get_verbose = FALSE,
allow_parallel = FALSE
)
Arguments
target_set |
List containing the target datasets. A named list with
components |
source_set |
List containing the source datasets. A named list with
components |
use_density |
Flag for using kernel density as distribution estimate
instead of histogram counts. Defaults to |
pred_model |
String indicating the underlying predictive model. The currently available options are -
|
model_optimize |
Flag for model parameter tuning. If |
sample_size |
Sample size for estimating distributions of target and
source datasets. Defaults to |
random_seed |
Seed for random number generator (for reproducible
outcomes). Defaults to |
all_pred |
Flag for returning the prediction values in the source space.
If |
get_verbose |
Flag for displaying the progress when optimizing the
predictive model i.e., |
allow_parallel |
Flag for allowing parallel processing when performing
grid search i.e., |
Value
If all_pred = FALSE
, a vector containing the final prediction values.
If all_pred = TRUE
, a named list with two components target
and source
i.e., predictions in the original target space and in source space,
respectively.
Note
The datasets in
target_set
(i.e.,X
andy
) do not need to be matched (i.e., have the same number of rows) since the response values are used only to estimate distribution for mapping while the feature values are used for both mapping and final prediction. In contrast, the datasets insource_set
(i.e.,X
andy
) must have matched samples.It is recommended to normalize the two response values (
y
) so that they will be in the same range. If normalization is not performed,DMTL()
uses the range of targety
values as the prediction range.
Examples
set.seed(8644)
## Generate two dataset with different underlying distributions...
x1 <- matrix(rnorm(3000, 0.3, 0.6), ncol = 3)
dimnames(x1) <- list(paste0("sample", 1:1000), paste0("f", 1:3))
y1 <- 0.3*x1[, 1] + 0.1*x1[, 2] - x1[, 3] + rnorm(1000, 0, 0.05)
x2 <- matrix(rnorm(3000, 0, 0.5), ncol = 3)
dimnames(x2) <- list(paste0("sample", 1:1000), paste0("f", 1:3))
y2 <- -0.2*x2[, 1] + 0.3*x2[, 2] - x2[, 3] + rnorm(1000, 0, 0.05)
## Model datasets using DMTL & compare with a baseline model...
library(DMTL)
target <- list(X = x1, y = y1)
source <- list(X = x2, y = y2)
y1_pred <- DMTL(target_set = target, source_set = source, pred_model = "RF")
y1_pred_bl <- RF_predict(x_train = x2, y_train = y2, x_test = x1)
print(performance(y1, y1_pred, measures = c("MSE", "PCC")))
print(performance(y1, y1_pred_bl, measures = c("MSE", "PCC")))