R: Softball run expectancy using discrete Markov chains

chain {runexp}

R Documentation

Softball run expectancy using discrete Markov chains

Description

Uses discrete Markov chains to calculate softball run expectancy for a single (half) inning. Calculations depend on specified player probabilities (see details) and a nine-player lineup. Optionally incorporates attempted steals and "fast" players who are able to strech bases.

Usage

chain(lineup, stats, cycle = FALSE, max_at_bats = 18)

Arguments

`lineup`	either character vector of player names or numeric vector of player numbers. Must be of length 1 or 9. If lineup is of length 1, the single player will be "copied" nine times to form a complete lineup.
`stats`	data frame of player statistics (see details)
`cycle`	logical indicating whether to calculate run expectancy for each of the 9 possible lead-off batters. Preserves the order of the lineup. As a default, only the first player in `lineup` is used as lead-off. Cycling is not relevant when the lineup is made up of a single player.
`max_at_bats`	maximum number of at bats (corresponding to matrix powers) used in calculation. Must be sufficiently large to achieve convergence. Convergence may be checked using `plot` with `type = 1`.

Details

The typical state space for softball involves 25 states defined by the base situation (runners on base) and number of outs. The standard base situations are: (1) bases empty, (2) runner on first, (3) runner on second, (4) runner on third, (5) runners on first and second, (6) runners on second and third, (7) runners on first and third, and (8) bases loaded. These 8 states are crossed with each of three out states (0 outs, 1 out, or 2 outs) to form 24 states. The final 25th state is the 3 outs that marks the end of an inning.

We expand these 25 states to incorporate "fast" players. We make the following assumptions concerning fast players:

If a fast player is on first and the batter hits a single, the fast player will stretch to third base (leaving the batter on first).
If a fast player is on second and the batter hits a single, the fast player will stretch home (leaving the batter on first and a single run scored).
If a fast player is on first and the batter hits a double, the fast player will stretch home (leaving the batter on second base and a single run scored).
A typical player (not fast) who successfully steals a base will become a fast player for the remainder of that inning (meaning that a player who successfully steals second base will stretch home on a single).

Based on these assumptions, we add base situations that designate runners on first and second base as either typical runners (R) or fast runners (F). The entirety of these base situations can be viewed using plot.chain with fast = TRUE. Aside from these fast player assumptions, runners advance bases as expected (a single advances each runner one base, a double advances each runner two bases, etc.).

Each at bat results in a change to the base situation and/or the number of outs. The outcomes of an at-bat are limited to:

batter out (O): base state does not change, outs increase by one
single (S): runners advance accordingly, score may increase, outs do not change
double (D): runners advance accordingly, score may increase, outs do not change
triple (TR): runners advance accordingly, score may increase, outs do not change
homerun (HR): bases cleared, score increases accordingly, outs do not change
walk (W): runners advance accordingly, score may increase, outs do not change

The transitions resulting from these outcomes are stored in "transition matrices." We utilize separate transition matrices for typical batters and fast batters (in order to keep fast runners designated separately). We additionally incorporate stolen bases. Steals are handled separately than the six at-bat outcomes because they do not result in changes to the batter. Following softball norms, we only entertain steals of second base. Steals are considered in cases when there is a runner on first and no runner on second. In this situation, steal possibilities are limited to:

no steal attempt: base situation and outs do not change
successful steal: runner advances to second base
caught steal: runner is removed, outs increase by one

Steal possibilities are implemented in separate transition matrices. All transition matrices are stored as internal RData files.

The stats input must be a data frame containing player probabilities. It must contain columns "O", "S", "D", "TR", "HR", and "W" whose entries are probabilities summing to one, corresponding to the probability of a player's at-bat resulting in each outcome. The data frame must contain either a "NAME" or "NUMBER" column to identify players (these must correspond to the lineup). Extra rows for players not in the lineup will be ignored. This data frame may be generated from player statistics using prob_calc.

The stats data frame may optionally include an "SBA" (stolen base attempt) column that provides the probability a given player will attempt a steal (provided they are on first base with no runner on second). If "SBA" is specified, the data frame must also include a "SB" (stolen base) column that provides the probability of a given player successfully stealing a base (conditional on them attempting a steal). If these probabilities are not specified, calculations will not involve any steals.

The stats data frame may also include a logical "FAST" column that indicates whether a player is fast. If this column is not specified, the "FAST" designation will be assigned based on each player's "SBA" probability. Generally, players who are more likely to attempt steals are the fast players.

The cycle parameter is a useful tool for evaluating an entire lineup. Through the course of a game, any of the nine players may lead-off an inning. A weighted or un-weighted average of these nine expected scores provides a more holistic representation of the lineup than the expected score based on a single lead-off.

Value

A list of the S3 class "chain" with the following elements:

lineup: copy of input lineup
stats: copy of input stats
score_full: list of matrices containing expected score by each base/out state and the number of at-bats (created by matrix powers). List index corresponds to lead-off batter. Rows of matrix correspond to base/out states. Each column represents an additional matrix power. Used to assess convergence of the chain (through convergence of each row).
score_state: matrix of expected score at the completion of an inning based on starting base/out state. Rows correspond to initial state; columns correspond to lead-off batter. Equal to the final column of score_full.
score: vector of expected score for an entire inning (starting from zero runners and zero outs). Index corresponds to lead-off batter. Equal to the first row of score_state.
time: computation time in seconds

References

B. Bukiet, E. R. Harold, and J. L. Palacios, “A Markov Chain Approach to Baseball,” Operations Research 45, 14–23 (1997).

Examples

# Expected score for single batter (termed "offensive potential")
chain1 <- chain("B", wku_probs)
plot(chain1)

# Expected score without cycling
lineup <- wku_probs$name[1:9]
chain2 <- chain(lineup, wku_probs)
plot(chain2)

# Expected score with cycling
chain3 <- chain(lineup, wku_probs, cycle = TRUE)
plot(chain3, type = 1:3)


# GAME SITUATION COMPARISON OF CHAIN AND SIMULATOR

# Select lineup made up of the nine "starters"
lineup <- sample(wku_probs$name[1:9], 9)

# Average chain across lead-off batters
chain_avg <- mean(chain(lineup, wku_probs, cycle = TRUE)$score)

# Simulate full 7 inning game (recommended to increase cores)
sim_score <- sim(lineup, wku_probs, inn = 7, reps = 50000, cores = 1)

# Split into bins in order to plot averages
sim_grouped <- split(sim_score$score, rep(1:100, times = 50000 / 100))

# Plot results
boxplot(sapply(sim_grouped, mean), ylab = 'Expected Score for Game')
points(1, sim_score$score_avg_game, pch = 16, cex = 2, col = 2)
points(1, chain_avg * 7, pch = 18, cex = 2, col = 3)

[Package runexp version 0.2.1 Index]