Bayesian Workflow of a Situational Random Effects Model

Author

Jim Albert

Published

July 13, 2024

Abstract
This paper describes in detail the Bayesian workflow for fitting a multilevel model to balls-in-play data where there is a bias term to account for a count advantage. All of the steps of Bayesian model fitting are described including specifying a model script, obtaining a matrix of simulated draws from the posterior, summarizing the draws to perform inference, and generating replicated predicted datasets to check the adequacy of the model. Different methods for fitting random effects models such as maximum likelihood, JAGS and Stan are compared.

1 Introduction

A general problem in baseball is understanding the sources of variation of outcomes of balls put into play. This paper describes the application of a Bayesian approach for modeling balls in play with a focus on predicting outcomes in future seasons.

The first step of this Bayesian modeling is determining a suitable measure of the outcome variable. One could focus on a wOBA measure where each outcome on a ball in play is assigned a suitable weight. But the result of a ball in play depends not only on the quality of the ball hit off the bat, measured by the Statcast launch variables, but also on the luck component where hard-hit balls can be outs and softly-hit balls land for hits. A better measure of batter performance is the expected wOBA which provides an estimate of the wOBA given values of the launch angle and exit velocity.

The next step in the construction of a Bayesian regression model determines the important inputs that explain much of the variation of the outcome variable. What happens on a ball put in play depends primarily on the batter rather than the pitcher so the model should include batter ability effects. Also since the outcomes of balls in play can depend on particular situations, it is desirable to include terms in the model that allow for the situation. One of the most important situations determining the outcome on a ball in play is the count, and the regression model will include a term indicating if the count is neutral (like a 1-1 count) or if the count favors the pitcher (like a 0-2 count).

Here are some goals of this modeling exercise:

  1. Provide good estimates of the batter abilities to be productive on balls put in play. These estimates will shrink or adjust the observed averages towards the overall mean and are better than the observed averages in predicting future performance.

  2. Estimate the situational effect. In this case, we want to measure the advantage that batters have when hitting in a neutral count relative to a pitcher’s count.

  3. Provide predictions of future performance. In addition, by simulating replications of the current data from the fitted model, one can check if the model is a suitable fit to the observed data.

Section 2 describes the Statcast data that is used in this study and Section 3 illustrates the use of a mean-difference plot to see how batters take advantage of a favorable count. Section 4 describes the multilevel model for batter abilities with a bias term accounting for the count. Section 5 provides an overview of the use of one particular MCMC simulation algorithm (JAGS) to simulate from the joint posterior distribution. The output of JAGS is a matrix of simulated draws from the posterior and Sections 6 and 7 illustrate the use of this simulated parameter matrix for inference and prediction. In the concluding Section 8, we talk generally about situational effects in baseball and give an overview of alternative methods for fitting these multilevel models. Although the focus of the study is on the use of JAGS, the Appendix gives details on alternative methods for fitting the model using maximum likelihood and Stan.

All of the work in this paper is reproducible. The Quarto file at https://gist.github.com/bayesball/5877d9c2239f48189b0ed4c1dfe3167a reads the Statcast data from a Github site, fits the model using JAGS, and implements all of the inferential and prediction calculations. R code for reproducing all of the figures in this paper is provided in this file.

2 Data

Statcast data is collected for all games in the 2023 season before July 15. We focus on balls in play and the response variable is the expected wOBA which is the estimate of the wOBA based on the launch angle and exit velocity measurements. We also collect the identity of the batter and the balls and strikes count. From the hitter’s perspective, the count is categorized as “ahead” (2-0, 3-0, 3-1), “neutral” (0-0, 1-0, 1-1, 2-1, 3-2), or “behind” (0-1, 0-2, 1-2, 2-2). Since 94% of the balls in play occur on neutral or behind counts, we remove the ahead-in-count balls in play from the data and focus on the batter’s advantage in a neutral count compared with a behind count.

In this paper, we focus on the square root of the expected wOBA \[ rwOBA = \sqrt{E(wOBA)} \]

The root reexpression is taken primarily for modeling considerations. The top panel of Figure 1 displays a histogram of the expected wOBA variable. Note that this variable is heavily right-skewed. The bottom panel displays a histogram of the square root of \(E(wOBA)\). Note that this transformed variable is more bell-shaped which is consistent with the modeling assumption (to be described shortly) that the sampling errors are normally distributed.

Figure 1: Histograms of the expected wOBA and square root of the expected wOBA variable for 2023 Statcast data.

3 A Mean-Difference Plot

We collect data for all batters who had at least 100 balls in play before July 15. Since we are interested in understanding the batter advantage due to the count, we compute for each player the mean root wOBA on a neutral count and a mean root wOBA on a behind count:

\[ \bar Y_n = \frac{\Sigma \, rwOBA_{neutral}}{BIP_{neutral}}, \bar Y_b = \frac{\Sigma \, rwOBA_{behind}}{BIP_{behind}} \]

From this paired data, we construct a Tukey mean-difference plot (Kozak and Wnuk, 2014) in Figure 2 which graphs the average \(M = (\bar {Y_{n}} + \bar {Y_{b}}) / 2\) against the difference \(D = \bar {Y_{n}} - \bar {Y_{b}}\). The average \(M\) measures the general ability of the batter to produce on a ball in play, and the difference \(D\) measures the batter’s advantage in hitting on a neutral count compared to a behind count. Note that there is a wide scatter in the mean (horizontal) values of \(M\), indicating that there are significant differences among the players in the quality of the balls put into play. A majority of the points fall over the horizontal line at 0, indicating that players tend to take advantage of the neutral count with a positive difference \(D\). But there is substantial variation in the values of \(D\) ranging from \(-0.1\) to \(0.2\).

Figure 2: A mean-difference plot for seeing the count effect on the expected wOBA on balls in play.

4 A Multilevel Model for Situational Data

Let \(y_i\) denote the root expected wOBA on the \(i\)th ball in play in our dataset. At the sampling level, we assume \(y_i\) is normally distributed with mean \(\mu_i\) and standard deviation \(\sigma\).

The prior distribution for the means {\(\mu_i\)} has a multilevel structure with two stages. We assume that there exists a normally distributed talent curve representing the population of hitter abilities with mean \(\theta\) and standard deviation \(\tau\). As displayed in Figure 3, the first stage of the prior assumes the \(J\) batters have hitting talents \(\alpha_1, ..., \alpha_J\) that are randomly drawn from this talent curve.

Figure 3: Display showing the batter abilities drawn from a normally distributed talent curve.

The count situation (neutral compared to behind) adds a constant \(\beta\) to the mean of the talent curve. The talent distribution of the batters will be a \(N(\theta + \beta / 2, \tau)\) distribution under a neutral count and a \(N(\theta - \beta / 2, \tau)\) distribution under a behind count. Figure 4 displays two hypothetical talent curves for hitting under this bias model. It can be seen from Figure 4 that the talent curve under a behind count is shifted \(\beta\) units to the right to obtain the talent curve under a neutral count.

Figure 4: Talent curves of hitters under a behind count and under a neutral count in the bias multilevel model.

In summary, the mean \(\mu_i\) of the response on the \(i\)th ball in play has the structure \[ \mu_i = \alpha_{p(i)} + \beta * EFF_i , \] where \(p(i)\) is the player number (a value between 1 and \(J\)) and \(EFF_i\) is the count input equal to 1/2 under a neutral count and equal to \(-1/2\) under a behind count.

At the second stage of the prior, the prior parameters \(\theta\), \(\beta\), \(\tau\) and the sampling standard deviation \(\sigma\) are assigned weakly informative distributions. The exact specification of these second-stage priors will be given in the JAGS model description.

5 Model Fitting

An approximate fit of this multilevel model is provided by use of the lmer() function in the lme4 package (Bates et al, 2015). Details of fitting this model in R are described in the appendix. This function provides point estimates of the player abilities {\(\alpha_j\)} and the standard deviations \(\sigma\) and \(\tau\).

It is preferable to fit the full Bayesian model to get standard error estimates for all parameters. This is accomplished using a MCMC algorithm to draw simulated samples from the joint posterior distribution of all unknown parameters. The JAGS software (Depaoli, et al, 2016) implements a Gibbs sampling/Metropolis algorithm and the STAN software (Stan Development Team, 2024) implements a Hamiltonian algorithm. We illustrate JAGS for this problem – it provides a relatively quick and accurate fit for this particular multilevel model. (Details on using Stan using the brms package (Bürkner, 2017) are described in the appendix.)

The JAGS software needs to be initially installed on one’s laptop from https://sourceforge.net/projects/mcmc-jags/files/. The runjags package (Denwood, 2016)) provides an R interface to JAGS.

There are three steps in running JAGS:

  1. Write a script to define the Bayesian model.

  2. Collect the data in a list.

  3. Fit the Bayesian model by running the MCMC algorithm using the run.jags() function from the runjags package with the model script and data as inputs.

The following JAGS script defines this multilevel model with the count as a bias term. The normal sampling and prior distributions are indicated using the dnorm syntax. One unusual aspect of JAGS is that the parameters of the dnorm function are the mean and the precision where the precision is defined as the reciprocal of the square of the standard deviation. In assignment statements in the script, the precisions are defined in terms of the standard deviation parameters sigma and tau. Note that this script specifies priors for the unknown parameters at the last stage. The mean theta and bias effect beta are assigned normal priors with means 0 and precisions 0.001 (corresponding to standard deviations of 31.6) reflecting vague prior knowledge about the locations of these location parameters. The standard deviation parameters sigma and tau are assigned standard t distributions (dt function) with one degree of freedom that are truncated below by zero.

modelString <-"
model {
## sampling
for (i in 1:N){
y[i] ~ dnorm(mu[i], invsigma2)
mu[i] <- beta * eff[i] + alpha_j[player[i]]
}
## priors
for (j in 1:J){
   alpha_j[j] ~ dnorm(theta, invtau2)
}
theta ~ dnorm(0, 0.001)
beta ~ dnorm(0, 0.001)
invsigma2 <- 1 / (sigma * sigma)
sigma ~ dt(0, 1, 1) T(0, )
invtau2 <- 1 / (tau * tau)
tau ~ dt(0, 1, 1) T(0, )
}
"

The next step is to provide a list of data needed to run the simulation. In the list defined below, y is a vector of response values, eff are the corresponding count values (either -1/2 or 1/2), player is an integer indicating the batter number, and N and J are respectively the number of balls in play and the number of batters.

jags_data <- list(y = sc_regular_150a$root_woba,
                  eff = sc_regular_150a$effect,
                  player = sc_regular_150a$player,
                  N = dim(sc_regular_150a)[1],
                  J = max(sc_regular_150a$player))

The MCMC fitting is implemented using the run.jags() function. The key inputs are modelString that defines the model and data that provides the data used in the model script. The algorithm is run in an adaptive stage for 1000 iterations to fine-tune the MCMC algorithm, 2000 iterations are run in a burn-in phase, and the simulated parameters are collected in the final 5000 iterations. The monitor arguments indicates that we will collect simulated draws of \(\theta\), \(\beta\), \({\alpha_j}\), \(\sigma\), and \(\tau\).

posterior <- run.jags(modelString,
                      n.chains = 1,
                      data = jags_data,
                      monitor = c("theta", "beta", "alpha_j",
                                  "sigma", "tau"),
                      adapt = 1000,
                      burnin = 2000,
                      sample = 5000)

The final step will be to put all of the simulated draws in a single data frame post2 by use of the as.mcmc() and as.data.frame() functions. The output data frame post2 has 5000 rows and \(J + 4\) columns corresponding respectively to simulations and parameters.

posterior %>% as.mcmc() %>% 
  as.data.frame() -> post2

6 Inference

From a Bayesian perspective, inferences about the model parameters are made by making suitable summaries of the posterior distribution. Since the posterior is obtained by the use of simulation, inference is performed by taking summaries of the posterior data frame of simulated draws.

6.1 Random Effects Distribution

In the model, the batter hitting abilities \(\alpha_1, ..., \alpha_J\) are assumed to have a \(N(\theta, \tau)\) distribution. Figure 5 displays the estimated random effects distribution. Since there is uncertainty in both the mean \(\theta\) and standard deviation \(\tau\), the figure displays normal curves using a sample of 20 drawn from the posterior distribution of \((\theta, \tau)\).

Figure 5: Estimated distribution of the player hitting abilities using 20 draws from the posterior distribution.

6.2 Bias Coefficient

We ask if hitters generally take advantage of the neutral count situation compared to the behind count. This question is addressed by inspecting the posterior of the bias coefficient \(\beta\). Figure 6 constructs a density estimate of the simulated posterior draws of \(\beta\) displaying the limits of a 90% interval estimate. On the root wOBA scale, this interval estimate is (0.021, 0.032). Since this interval doesn’t cover zero, there is a significant count effect on the outcomes of balls in play.

Figure 6: Density estimate of the posterior draws of the bias coefficient.

6.3 Batter Abilities

Estimates of the batter abilities are found by taking posterior means of the \(\{\alpha_j\}\). Figure 7 compares the posterior means with the collection of batter sample means of the root wOBA values using parallel jittered dotplots. Note that the variability of the posterior means is substantially smaller than the variability of the individual ability estimates.

Figure 7: Parallel jittered dotplots of the observed sample means and the posterior means of the ability parameters.

These Bayesian multilevel model estimates shrink or adjust the individual sample means towards the average. This can be seen in Figure 8 which plots the individual and multilevel ability estimates for a sample of players. These multilevel estimates are superior to the sample means in predicting future expected wOBA performance.

Figure 8: Shrinkage plot for a sample of players showing that the multilevel estimates shrink the individual mean estimates towards an average value.

7 Prediction

7.1 Posterior Predictive Distribution

One attractive aspect of the Bayesian viewpoint is that there is a straightforward way of predicting future data through the posterior predictive distribution. Letting \(y\) and \(y^*\) denote observed and future data, the objective is to simulate from the predictive density \(f(y^* | y)\) which is represented as the mixture \[ f(y^* | y) = \int f(y^* | \theta) g(\theta | y) d\theta, \] where \(g(\theta | y)\) represents the posterior density of parameters \(\theta\) given the observed data \(y\), and \(f(y^* | \theta)\) is the sampling density of future data \(y^*\) given the parameters \(\theta\). This representation motivates how one simulates a draw \(y^{*S}\) from the posterior predictive density.

  1. One simulates a draw \(\theta^S\) from the posterior density \(g(\theta | y)\).

  2. One simulates a draw \(y^{*S}\) from the sampling density \(f(y | \theta^S)\).

By repeating steps 1 and 2, one obtains a sample of values from the predictive distribution.

7.2 Predictive Model Checking

To illustrate this method, we demonstrate the use of posterior predictive model checking for this multilevel modeling example. We ask: do simulations of replicated data from the fitted model resemble the observed data? By replicated data, we mean datasets that resemble the observed data with respect to batter sample sizes and values of the count variable.

We simulate from the posterior predictive distribution in two steps. First we simulate a draw of the means \[ \{\mu_j = \alpha_{p(j)} + \beta * EFF_j\} \] by taking draws of \(\alpha\) and \(\beta\) from the posterior distribution. Since \(y_j\) is normal(\(\mu_j, \sigma)\), we next simulate values of \(\{y_j\}\) from normal distributions using the simulated draws of \(\mu_j\) and \(\sigma\) from the posterior.

A Tukey mean-difference plot was initially used to explore the batter root wOBA values and the advantages of the favorable-count situation. To see if posterior predictive simulations resemble the observed data, we use this mean-difference plot to compare the simulated and observed datasets. One graph of this type is displayed in Figure 9 for a single simulated replicated dataset. The two datasets look similar both in the variability of the means and the variability of the difference values.

Figure 9: Mean-difference plots of the observed and one simulated dataset from the posterior predictive distribution.

In Figure 9, we are just comparing one posterior predictive simulated sample with the observed data and noted that the two datasets had similar variability. To see if this statement is true in general, we simulate many samples from the posterior predictive distribution. For each sample, we compute two “test” measures:

  1. \(S_M =\) standard deviation of the averages {\((\bar {Y_{n}} + \bar {Y_{b}}) / 2\)}}.

  2. \(S_D =\) standard deviation of the differences {\(\bar {Y_{n}} - \bar {Y_{b}}\)}}

We looked at 500 posterior predictive simulations and Figure 10 displays values of \((S_M, S_D)\) from simulations of the posterior predictive distribution. The red dot represents values of this ordered pair from the observed data. This dot is in the center of the posterior predictive distribution. From a variability perspective, we conclude that simulations from the model resemble the observed data from the 2023 season. One can perform additional posterior posterior checks using alternative test statistics.

Figure 10: Scatterplot of standard deviations from draws from the posterior predictive distribution. The red dot indicates observed values of the standard deviations.

8 Closing Comments

Measures of performance are currently computed under a variety of situations such as park (home versus away), time of day (day versus night), count, the pitcher’s arm (same or opposite arm as the hitter), and pressure (runners in scoring position or not). Chapter 4 of Albert and Bennett (2003) provides a general discussion of different statistical models for situational hitting data. Generally it is the author’s belief that situations tend to be biases. That means that there is an significant effect due to the situation, for example players tend to hit better during home games compared to away games. But the size of the situational effect is the same for all hitters. In other words, there is little evidence to support the belief that some hitters have extra ability to take advantage of the situation. This article demonstrates the existence of a bias effect in hitting in the count situation. Although we observe a wide variation in count effects among hitters, there is little evidence that particular hitters are especially good in taking advantage of a favorable count.

In this paper, we have focused on variation on balls in play by the hitters. It is natural to ask why we didn’t explore the role of the pitchers in impacting the outcome of the ball in play. A famous article by McCracken (2001) indicates that the pitcher has relatively little control over the result of a ball put into play. This result suggests that we should be more focused on variation between hitters than variation between pitchers. If we were unaware of McCracken’s work, one could implement a non-nested multilevel model where the mean is a function of both the hitter’s ability and the pitcher’s ability. Both the collection of pitcher talents and the collection of hitter talents would be assigned normal densities. The interested reader is welcome to try this random effects model with both hitter and pitcher effects, but one would find that there is a small variation between pitchers relative to the variation between hitters and that motivates the analysis here which only considers the hitter variation.

Albert (2020) describes the use of an alternative multilevel model for exploring count hitting effects, where both the intercept and slope depend on the hitter. One takeaway from this work is that the posterior shrinks the individual-regression slopes strongly towards a common slope. In other words, the variation between the true regression slopes appears to be small which motivates the model used in this paper when we assume the regression slope is the same for all players.

The appendix gives a description of two alternative methods for fitting this multilevel model with a bias component. Maximum likelihood (ML) methods for fitting multilevel models, as reflected in the lmer() function in the lme4 package, are attractive in that they are easy to apply and produce estimates of the ability parameters and the bias coefficient. But these ML methods are approximate and only provide point estimates of the abilities and the standard deviation parameters.

Although Bayesian multilevel modeling methods were available when the first edition of Curve Ball was written in 2001, simulation-based methods for fitting these multilevel models were in their infancy. The JAGS software represents the first-generation MCMC simulation algorithms based on Gibbs sampling and Metropolis sampling. JAGS provides relatively speedy and accurate estimates for this exercise. One of the purposes of this paper was to illustrate the use of JAGS both for Bayesian estimation and prediction purposes.

The latest generation of simulation algorithms represented by Stan uses Hamiltonian Monte Carlo and No-U-Turn sampling to simulate from high-dimensional multilevel distributions. For this exercise, the brms package can be used to simulate the multilevel model with a bias count effect using Stan. This particular interface is attractive as the fitting function resembles the function used in the lme4 package. The Stan fitting takes more time as the code is initially translated to C++ and the MCMC iterations are slower. For this exercise, the JAGS and Stan outputs were similar and gave similar summaries for inferential and predictive results.

9 References

  • Albert, J. (2020). Bayesian Tutorial 4: Multilevel Modeling of Player Count Effects. https://baseballwithr.wordpress.com/2020/08/24/bayesian-tutorial-4-multilevel-modeling-of-player-count-effects/

  • Albert, J. and Bennett, J. (2003), Curve Ball, Copernicus Books.

  • Bates, D., Martin, M. Bolker, B, and Walker, S. (2015). Fitting Linear Mixed-Effects Models Using {lme4}}. Journal of Statistical Software, 67, 1, 1-48.

  • Bürkner P (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1–28

  • Denwood, M. J. (2016). runjags: An R package providing interface utilities, model templates, parallel computing methods and additional distributions for MCMC models in JAGS. Journal of statistical software, 71, 1-25.

  • Depaoli, S., Clifton, J. P., & Cobb, P. R. (2016). Just another Gibbs sampler (JAGS) flexible software for MCMC implementation. Journal of Educational and Behavioral Statistics, 41(6), 628-649.

  • JAGS, 2023. Version 4.3.2. https://mcmc-jags.sourceforge.io/

  • Kozak, M., & Wnuk, A. (2014). Including the Tukey mean‐difference (Bland–Altman) plot in a statistics course. Teaching Statistics, 36(3), 83-87.

  • McCracken, V. (2001), Pitching and Defense: How Much Control Do Hurlers Have? Baseball Prospectus. https://www.baseballprospectus.com/news/article/878/pitching-and-defense-how-much-control-do-hurlers-have/

  • Stan Development Team. 2024. Stan Modeling Language Users Guide and Reference Manual, Version 2.3.5. https://mc-stan.org

10 Appendix - Alternative Fitting Methods

This section describes the R implementation of alternative fitting methods of this multilevel model with a bias component for this 2023 Statcast dataset.

10.1 Maximum Likelhood Fit with the lme4 Package

The lmer() function in the lme4 package provides a quick convenient way of fitting this multilevel model. In the model formulation, the “(1 | player)” syntax indicates that the variable player is to be treated as a random effect, and the “effect” variable is a fixed effect.

require(lme4)
lmer_fit <- lmer(root_woba ~ effect + (1 | player),
            data = sc_regular_150a)

From the returned object lmer_fit, one obtains point estimates of the random effects {\(\alpha_J\)}. In addition, one obtains point estimates of the sampling standard deviation \(\sigma\) and the between-players standard deviation \(\tau\). Estimate and the corresponding standard error is available for the fixed count effect \(\beta\).

Comparing the lmer() and JAGS output, one gets similar posterior point estimates for the random effects and standard deviations. But one gets different results in the posterior predictive checking. In the posterior simulation with lmer() one has only the point estimate of alpha. In the implementation of the model checking strategy of Section 7.2, one finds that the observed value of \(S_M\) exceeds what would be predicted from the posterior fit. The conclusion is that, for predictive work, it is preferable to implement a full Bayesian analysis using JAGS.

10.2 Bayesian Fit with Stan using the brms Package

Stan provides a popular method of fitting a Bayesian multilevel model by MCMC. As in JAGS, one can write a Stan script of the Bayesian model and run this script using the R package rstan. The brms package provides a formula interface for many regression models including the multilevel model described in this paper.

The brm() function fits this Bayesian model using the same model formula “root_woba ~ effect + (1 | player)” as used in the lmer() function. In the background, brm() is assigning weakly informative priors on the parameters \(\theta\), \(\beta\), \(\sigma\) and \(\tau\). As in JAGS, one can obtain a matrix of simulated draws from the joint posterior. By default, brm() will run four chains, each collecting 1000 iterations. When the as_draws() function is run, the output variable post2 is a list of four elements where each element is a matrix of simulated draws for a particular chain.

require(brms)
stan_fit <- brm(root_woba ~ effect + (1 | player),
                data = sc_regular_150a)
post2 <- as_draws(stan_fit)

For this particular problem, Stan gave similar results to the JAGS fit. The posterior means and posterior standard deviations of the abilities, standard deviations and count effect parameter were similar, and each algorithm gave similar predictions on a replicated dataset.

10.3 Computing Times

The three fitting algorithms (maximum likelihood, JAGS, and Stan) ran at different speeds. The following table displays the times (in minutes) to run our problem on a 1.1 GHz Intel i5 Macintosh laptop.

Fitting Method R Package Function Time (minutes)
Maximum Likelihood lme4 lmer() 0.004
JAGS (MCMC) runjags run.jags() 2.54
Stan (MCMC) brms brm() 20.52

The maximum likelihood fit using lmer() is remarkably fast, but it is essentially an approximation to the joint posterior fit since it doesn’t provide standard errors for all estimated parameters. JAGS (function run_jags()) and Stan (function brm()) are applying different MCMC algorithms with a similar number of iterations, but Stan is approximately eight times slower than JAGS. It should be mentioned that Stan was specifically developed to handle large multilevel models (larger than what is capable for JAGS) and it uses more information about the posterior distribution in its simulation algorithm.

11 Appendix - Grouping Data for Computational Speed

One reason why the computational speeds for implementing the MCMC fits are slow is that we are working with a large dataset with over 36,000 rows. Since the model inputs are only the player id and the count effect, one can reduce the size of the dataset by the use of sufficient statistics on grouped data.

For a particular player and particular count effect, one observes the response values \(y_1, ..., y_m\) that are a sample from a \(N(\mu, \sigma)\) distribution. Sufficient statistics for this sample are the sample mean \(\bar y\) and sum of squares about the mean \(S\) given by \[ \bar y = \frac{\sum_{j=1}^m y_j}{m}, \, \, S = \sum_{j=1}^m (y_j - \bar y)^2. \]

Moreover, \(\bar y\) and \(S\) are independent where \[ \bar y \sim N\left(\mu, \frac{\sigma}{\sqrt{m}}\right), S \sim Gamma\left(\frac{m-1}{2}, \frac{1}{2 \sigma^2}\right) \]

For each player and count value, we compute values of \(\bar y\) and \(S\). By using these sufficient statistics, the dataset is reduced from 36,000 rows to 498 rows. We revise the JAGS script to accommodate the use of these sufficient statistics in the sampling component of the model.

modelString <-"
model {
## sampling
for (i in 1:N){
y[i] ~ dnorm(mu[i], ny[i] * invsigma2)
s[i] ~ dgamma((ny[i] - 1) / 2, invsigma2 / 2) 
mu[i] <- beta * eff[i] + alpha_j[player[i]]
}
## priors
for (j in 1:J){
   alpha_j[j] ~ dnorm(theta, invtau2)
}
theta ~ dnorm(0, 0.001)
beta ~ dnorm(0, 0.001)
invsigma2 <- 1 / (sigma * sigma)
sigma ~ dt(0, 1, 1) T(0, )
invtau2 <- 1 / (tau * tau)
tau ~ dt(0, 1, 1) T(0, )
}

The use of sufficient statistics for the grouped data leads to a large decrease in the JAGS simulation time. For the ungrouped data, it takes 2.54 minutes and with the grouping, the time is decreased to 0.08 minutes.

A similar increase in efficiency can be achieved using Stan if the script uses the sufficient statistics in the sampling description. Indeed, by using the CmdStanR interface (https://mc-stan.org/cmdstanr/index.html), I was able to write a Stan script using this reduced dataset, and the time for the MCMC run was only 0.29 minutes.