familyRank {FamilyRank} | R Documentation |
Feature Ranking with Family Rank
Description
Ranks features by incorporating graphical knowledge to weight empirical feature scores. This is the main function of the FamilyRank package.
Usage
familyRank(scores, graph, d = 0.5, n.rank = min(length(scores), 1000),
n.families = min(n.rank, 1000), tol = 0.001)
Arguments
scores |
A numeric vector of empirical feature scores. Higher scores should indicate a more predictive feature. |
graph |
A matrix or data frame representation of a graph object. |
d |
Damping factor |
n.rank |
Number of features to rank. |
n.families |
Number of families to grow. |
tol |
Tolerance |
Details
The scores
vector should be generated using an existing statistical method. Higher scores should correspond to more predictive features. It is up to the user to adjust accordingly. For example, if the user wishes to use p-values as the empirical score, the user should first adjust the p-values, perhaps by subtracting all p-values from 1, so that a higher value corresponds to a more predictive feature.
The graph
must be supplied in matrix form, where the first two columns represent graph nodes and the third column represents the edge weights between nodes. The graph nodes must be represented by the index of the feature that corresponds with the index in the score
vector. For example, a node corresponding to the first value of the score
vector should be indicated by a 1 in the graph
object, the second by a 2, etc. It is not necessary that every feature in the score
vector appear in the graph
. Missing pairwise interactions will be considered to have interaction scores of 0.
The damping factor, d
, represents the percentage of weight given to the interaction scores. The damping factor must be between 0 and 1. Higher values give more weight to the interaction score while lower values give more weight to the empirical score.
The value for n.rank
must be less than or equal to the number of scored features. The algorithm will include only the top n.rank
features in the ranking process (e.g. the n.rank
features with the highest values in the score
vector will be used to grow families). Higher values of n.rank
require longer compute times.
The value for n.families
must be less than or equal to the value of n.rank
. This is the number of families the algorithm will grow. If n.families
is less than n.rank
, the algorithm will initate families using the n.families
highest scoring features. Higher values of n.families
require longer compute times.
The tolerance variable, tol
, tells the algorithm when to stop growing a family. Features are added to families until the weighted score is less than the tolerance level, or until all features have been added.
Value
Returns a vector of the weighted feature scores.
Author(s)
Michelle Saul
References
ADD REFERENCE
Examples
# Toy Example
scores <- c(.6, .2, .9)
graph <- cbind(c(1,1), c(2,3), c(.4, .8))
familyRank(scores = scores, graph = graph, d = .5)
# Simulate data set
# 100 samples
# 1000 features
# Features 1 through 15 perfectly define response
# All other features are random noise
simulatedData <- createData(n.case = 50, n.control = 50, mean.upper=13, mean.lower=5,
sd.upper=1, sd.lower=1, n.features = 10000,
subtype1.feats = 1:5, subtype2.feats = 6:10,
subtype3.feats = 11:15)
x <- simulatedData$x
y <- simulatedData$y
graph <- simulatedData$graph
# Score simulated features using absolute difference in group means
scores <- apply(x, 2, function(col){
splt <- split(col, y)
group.means <- unlist(lapply(splt, mean))
score <- abs(diff(group.means))
names(score) <- NULL
return(score)
})
# Display top 15 features using emprical score
order(scores, decreasing = TRUE)[1:15]
# Rank scores using familyRank
scores.fr <- familyRank(scores = scores, graph = graph, d = .5)
# Display top 15 features using emprical scores with Family Rank
order(scores.fr, decreasing = TRUE)[1:15]