harsm {ohenery} | R Documentation |
Friendly interface to softmax regression under Harville model.
Description
A user friendly interface to the softmax regression under the Harville model.
Usage
harsm(formula, data, group = NULL, weights = NULL,
na.action = na.omit)
## S3 method for class 'harsm'
vcov(object, ...)
## S3 method for class 'harsm'
print(x, ...)
Arguments
formula |
an object of class |
data |
an optional data frame, list or environment (or object
coercible by |
group |
the string name of the group variable in the data, or a bare character with the group name. The group indices need not be integers, but that is more efficient. They need not be sorted. |
weights |
an optional vector of weights, or the string or bare name of the
weights in the |
na.action |
How to deal with missing values in the outcomes, groups, weights, etc. |
object |
an object of class |
... |
additional arguments to be passed to the low level regression fitting functions (see below). |
x |
logicals. If |
Details
Performs a softmax regression by groups, via Maximum Likelihood Estimation,
under the Harville model.
We fit \beta
where odds are \eta = x^{\top}\beta
for
independent variables x
.
The probability of taking first place is then \mu=c\exp{\eta}
,
where the c
is chosen so the \mu
sum to one.
Under the Harville model, conditional on the first place finisher
having been observed, the probability model for second
(and successive) places with the probabilities of the remaining
participants renormalized.
The print
method of the harsm
object includes
a display of the R-squared. This measures the improvement
in squared errors of the expected rank from the model
over the null model which posits that all odds are equal.
When the formula includes an offset, a ‘delta R-squared’
is also output. This is the improvement in predicted
ranks over the model based on the offset term.
Note that the expected ranks are only easy to produce
under the Harville model; under the Henery model,
the summary R-squared statistics are not produced.
Note that this computation uses weighted sums of squares,
as the weights contribute to the likelihood term.
However, the square sum computation does not take into
account the standard error of the rank, and so
unlike in linear regression, the softmax regression
does not always give positive R-squareds,
and the statistic is otherwise hard to interpret.
Value
An object of class harsm
, but also of maxLik
with the
fit.
Note
Since version 0.1.0 of this package, the normalization of weights used in this function have changed under the hood. This is to give correct inference in the case where zero weights are used to signify finishing places were not observed. If in doubt, please confirm inference by simulations, taking as example the simulations in the README.
This regression may give odd results when the outcomes are tied, imposing an arbitrary order on the tied outcomes. Moreover, no warning may be issued in this case. In future releases, ties may be dealt with differently, perhaps in analogy to how ties are treated in the Cox Proportional Hazards regression, using the methods of Breslow or Efron.
To avoid incorrect inference when only the top performers are recorded, and all others are effectively tied, one should use weighting. Set the weights to zero for participants who are tied non-winners, and one for the rest So for example, if you observe the Gold, Silver, and Bronze medal winners of an Olympic event that had a starting field of 12 participants, set weights to 1 for the medal winners, and 0 for the others. Note that the weights do not attach to the participants, they attach to the place they took.
Author(s)
Steven E. Pav shabbychef@gmail.com
See Also
Examples
nfeat <- 5
set.seed(1234)
g <- ceiling(seq(0.1,1000,by=0.1))
X <- matrix(rnorm(length(g) * nfeat),ncol=nfeat)
beta <- rnorm(nfeat)
eta <- X %*% beta
y <- rsm(eta,g)
# now the pretty frontend
data <- cbind(data.frame(outcome=y,race=g),as.data.frame(X))
fmla <- outcome ~ V1 + V2 + V3 + V4 + V5
fitm <- harsm(fmla,data,group=race)
eta0 <- rowMeans(X)
data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X))
fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5
fitm <- harsm(fmla,data,group=race)
# with weights
data <- cbind(data.frame(outcome=y,race=g,eta0=eta0),as.data.frame(X))
data$wts <- runif(nrow(data),min=1,max=2)
fmla <- outcome ~ offset(eta0) + V1 + V2 + V3 + V4 + V5
fitm <- harsm(fmla,data,group=race,weights=wts)
# softmax on the Best Picture data
data(best_picture)
df <- best_picture
df$place <- ifelse(df$winner,1,2)
df$weight <- ifelse(df$winner,1,0)
fmla <- place ~ nominated_for_BestDirector + nominated_for_BestActor + Drama
harsm(fmla,data=df,group=year,weights=weight)
# test against logistic regression
if (require(dplyr)) {
nevent <- 10000
set.seed(1234)
adf <- data_frame(eventnum=floor(seq(1,nevent + 0.7,by=0.5))) %>%
mutate(x=rnorm(n()),
program_num=rep(c(1,2),nevent),
intercept=as.numeric(program_num==1),
eta=1.5 * x + 0.3 * intercept,
place=ohenery::rsm(eta,g=eventnum))
# Harville model
modh <- harsm(place ~ intercept + x,data=adf,group=eventnum)
# the collapsed data.frame for glm
ddf <- adf %>%
arrange(eventnum,program_num) %>%
group_by(eventnum) %>%
summarize(resu=as.numeric(first(place)==1),
delx=first(x) - last(x),
deli=first(intercept) - last(intercept)) %>%
ungroup()
# glm logistic fit
modg <- glm(resu ~ delx + 1,data=ddf,family=binomial(link='logit'))
all.equal(as.numeric(coef(modh)),as.numeric(coef(modg)),tolerance=1e-4)
all.equal(as.numeric(vcov(modh)),as.numeric(vcov(modg)),tolerance=1e-4)
}