create.reco.train.test {recometrics} | R Documentation |
Create Train-Test Splits of Implicit-Feedback Data
Description
Creates train-test splits of implicit-feedback data (recorded user-item interactions) for fitting and evaluating models for recommender systems.
These splits choose "test users" and "items for a given user" separately, offering three modes of splitting the data:
Creating training and testing sets for each user in the data (when passing ‘split_type=’all'').
This is meant for cases in which the number of users is small or the users to test have already been selected (e.g. one typically does not want to create a train-test split which would leave one item for the user in the training data and zero in the test set, or would want to have other minimum criteria for the test set to be usable). Typically, one would want to take only a sub-sample of users for evaluation purposes, as calculating recommendation quality metrics can take a very long time.
Selecting a sub-set of users for testing, for which training and testing data splits will be generated, while leaving the remainder of users with all the data for model fitting (when passing ‘split_type=’separated'').
This is meant to be used for fitting a model to the remainder of the data, then generating latent factors (assuming a low-rank matrix factorization model) or top-K recommendations for the test users given their training data, and evaluating these recommendations on the test data for each user (which can be done through the function calc.reco.metrics).
Selecting a sub-set of users for testing as above, but adding those users to the training data, in which case they will be the first rows (when passing ‘split_type=’joined'').
This is meant to be used for fitting a model to all such training data, and then evaluating the produced user factors or top-K recommendations for the test users against the test data.
It is recommended to use the 'separated' mode, as it is more reflective of real scenarios, but some models or libraries do not have the capabilities for producing factors/recommendations for users which where not in the training data, and this option then comes in handy.
Usage
create.reco.train.test(
X,
split_type = "separated",
users_test_fraction = 0.1,
max_test_users = 10000L,
items_test_fraction = 0.3,
min_items_pool = 2L,
min_pos_test = 1L,
consider_cold_start = FALSE,
seed = 1L
)
Arguments
X |
The implicit feedback data to split into training-testing-remainder for evaluating recommender systems. Should be passed as a sparse CSR matrix from the 'Matrix' package (class 'dgRMatrix'). Users should correspond to rows, items to columns, and non-zero values to observed user-item interactions. |
split_type |
Type of data split to generate. Allowed values are: 'all', 'separated', 'joined' (see the function description above for more details). |
users_test_fraction |
Target fraction of the users to set as test (see the function documentation for more details). If the number represented by this fraction exceeds the number set by 'max_test_users', then the actual number will be set to 'max_test_users'. Note however that the end result might end up containing fewer users if there are not enough users in the data meeting the minimum desired criteria. If passing 'NULL', will not take a fraction, but will instead take the number that is passed for 'max_test_users'. Ignored when passing ‘split_type=’all''. |
max_test_users |
Maximum number of users to set as test. Note that this will only be applied for choosing the minimum between this and 'ncol(X)*users_test_fraction', while the actual number might end up being lower if there are not enough users meeting the desired minimum conditions. If passing 'NULL' for 'users_test_fraction', will interpret this as the number of test users to take. Ignored when passing ‘split_type=’all''. |
items_test_fraction |
Target fraction of the data (items) to set for test for each user. Should be a number between zero and one (non-inclusive). The actual number of test entries for each user will be determined as 'round(n_entries_user*items_test_fraction)', thus in a long-tailed distribution (typical for recommender systems), the actual fraction that will be obtained is likely to be slightly lower than what is passed here. Note that items are sampled independently for each user, thus the items that are in the test set for some users might be in the training set for different users. |
min_items_pool |
Minimum number of items (sum of positive and negative items) that a user must have in order to be eligible as test user. |
min_pos_test |
Minimum number of positive entries (non-zero entries in the test set) that users would need to have in order to be eligible as test user. |
consider_cold_start |
Whether to still set users as eligible for test in situations in which some user would have test data but no positive (non-zero) entries in the training data. This will only happen when passing 'test_fraction>=0.5'. |
seed |
Seed to use for random number generation. |
Value
Will return a list with two to four elements depending on the requested split type:
If passing ‘split_type=’all'', will have elements 'X_train' and 'X_test', both of which will be sparse CSR matrices (class 'dgRMatrix' from the 'Matrix' package, which can be converted to other types through e.g. 'MatrixExtra::as.csc.matrix') with the same number of rows and columns as the 'X' that was passed as input.
If passing ‘split_type=’separated'', will have the entries 'X_train' and 'X_test' as above (but with a number of rows corresponding to the number of selected test users instead), plus an entry 'X_rem' which will be a CSR matrix containing the data for the remainder of the users (those which were not selected for testing and on which the recommendation model is meant to be fitted), and an entry 'users_test' which will be an integer vector containing the indices of the users/rows in 'X' which were selected for testing. The selected test users will be in sorted order, and the remaining data will remain in sorted order with the test users removed (e.g. if there's 5 users, with the second and fifth selected for testing, then 'X_train' and 'X_test' will contain rows [2,5] of 'X', while 'X_rem' will contain rows [1,3,4]).
If passing ‘split_type=’joined'', will not contain the entry 'X_rem', but instead, 'X_train' will be the concatenation of 'X_train' and 'X_rem', with ‘X_train' coming first (e.g. if there’s 5 users, with the second and fifth selected for testing, then 'X_test' will contain rows [2,5] of 'X', while 'X_train' will contain rows [2,5,1,3,4], in that order).
The training and testing items for each user will not intersect, and each item in the original 'X' data for a given test user will be assigned to either the training or the testing sets.