SURF {SuRF.vs}R Documentation

SURF

Description

SuRF is a sparse variable selection method with uses a subsampling approach an LASSO to rank variables before applying forward selection using a permutation test. The function is able to give results at a range of significance levels simultaneously.

Usage

SURF(
  Xo,
  y,
  X = NULL,
  fold = 10,
  Alpha = 1,
  prop = 0.1,
  weights = FALSE,
  B = 1000,
  C = 200,
  ncores = 1,
  display.progress = TRUE,
  family = stats::binomial(link = "logit"),
  alpha_u = 0.1,
  alpha = 0.05
)

Arguments

Xo

- other type of predictor variables

y

- response variable, a vecotr for most families. For family="cox", y will should be a matrix of the response variable in column1 and censoring status in column 2.

X

- count data, need to be converted to proportion

fold

- number of folds for cross-validation in Lasso

Alpha

- Alpha parameter for elastic net

prop

- proportion of observations left out in subsampling

weights

- use weighted regression: for unbalanced class sizes (bimomial family only) or weighted sample for other families;In a binomial model, weights: =TRUE: if weighted version is desired; =FALSE, otherwise ; In other models,weights: =vector of weights of the same size as the sample size N: if weighted version is desired;=FALSE, otherwise (other generalized model)

B

- number of subsamples to take

C

- number of permutations used to estimate null distribution

display.progress

- whether SuRF should print a message on completion of each

alpha_u

- the upper bound of significance level for the permutation test: alpha_u has to be in the range of (0,1). The large of this value, the longer the program will run;

alpha

- the alpha value of interest (alpha >0 and must be <=alpha_u). It can be a single value or a vector.If missing, by default it is 0.05.

ncores

whether SuRF should compute in parallel: 1 indicates NOT; anything greater will compute in parallel

family

The distribution family of the response variable

Details

SuRF consists of two steps. In the first step, LASSO variable selection is applied to a large number of subsamples of the data set, to provide a list of selected variables for each subsample. This list is used to rank the variables, based on the number of subsamples in which each variable is selected, so that variables that are selected in more subsamples are ranked more highly. In the second step, this list is used as a basis for forward selection, with variables higher on the list tried first. If a highly-ranked variable is not selected, later variables are tried, and after each variable is selected, the variables not yet selected (even previously non-selected variables) are tried in order of the ranking from Step 1. The decision whether to include a variable is based on a permutation test for the deviance statistic.

Full details of the SuRF method are in the paper:

Lihui Liu, Hong Gu, Johan Van Limbergen, Toby Kenney (2020) SuRF: A new method for sparse variable selection, with application in microbiome data analysis Statistics in Medicine 40 897-919

doi: https://onlinelibrary.wiley.com/doi/10.1002/sim.8809

Value

Bmod: sub-sampling results

trdata: data frame including both X and y

ranklist: ranking table

modpath: variable selection path (along the alpha range)

selmod: model results at the selected alpha(s)

family: model family used

Examples

library(survival)
library(glmnet)
library(SuRF.vs)
N=100;p=200
nzc=p/3
X=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
fx=X[,seq(nzc)]%*%beta/3
hx=exp(fx)
ty=rexp(N,hx)
tcens=rbinom(n=N,prob=.3,size=1)# censoring indicator (1 or 0)
Xo=NULL
B=20
Alpha=1
fold=5
ncores=1
prop=0.1
C=3
alpha_u=0.2
alpha=seq(0.01,0.1,len=20)

#binomial model
XX=X[,1:2]
f=1+XX%*%c(2,1.5)
p=exp(f)/(1+exp(f))
y=rbinom(100,1,p)
weights=FALSE
family=stats::binomial(link="logit")


surf_binary=SURF(Xo=X,y=y,fold=5,weights=weights,B=10,C=5,family=family,alpha_u=0.1,alpha=alpha)


#linear regression
y=1+XX%*%c(0.1,0.2)
family=stats::gaussian(link="identity")

surf_lm=SURF(Xo=X,y=y,fold=5,weights=weights,B=10,C=5,family=family,alpha_u=0.1,alpha=alpha)


#cox proportional model
y=cbind(time=ty,status=1-tcens)
weights=rep(1,100)
rseed=floor(runif(20,1,100))
weights[rseed]=2
family=list(family="cox")

surf_cox=SURF(Xo=X,y=y,fold=5,weights=weights,B=10,C=5,family=family,alpha_u=alpha_u,alpha=alpha)


[Package SuRF.vs version 1.1.0.1 Index]