Winning and Attendance in MLB

Both Doug Pappas and The Sports Economist link to a study by the Star-Telegram (free registration) on MLB team attendance from 1999-2003. The studies findings are startling:

High-scoring games and big-money players are more important than wins in drawing fans to major-league baseball games, according to a Star-Telegram analysis of all 30 teams over the past five years.

— Every increase of 100 runs scored brought in 273,160 fans.

— Every $10 million increase in payroll brought in 130,000 fans.

— Every $10 increase in the cost of attending a game brought in 51,372 fans.

Wow! Especially interesting is the interpretation of a violation of the law of demand, about which Skip notes, “Ahem. Higher prices “brought in” fans? I believe we have a specification issue here!” Clearly, higher prices do not increase consumption, so I decided to further investigate the study. Maybe there is something else wrong with the study.

Since the article does not list the exact methodology I tried to reconstruct it on my own as best as I could. I gathered five years worth of data on attendance, winning percentages, ticket prices, fan cost index (FCI), runs, era, HRs, and opening day payroll by team.* All of this data is available at MLB.com, Doug Pappas’s, and Rodney Fort’s websites. I looked at these variables for all 30 teams from 1999-2003 for 150 total observations (although the observations will be reduced to 120 because of my empirical technique to control for autocorrelation). I estimated the model with dummy variables for each team to capture team-specific factors, and I controlled for detected autocorrelation. If you are familiar with Stata, I used the xtregar command to estimate the model. Here are the regression results.

Variable Coefficient T-stat
Win% 1480091 1.48
Payroll 0.01 3.26
Runs 371.10 0.48
ERA 3418.86 0.03
HR -95.92 -0.06
Ticket 28244.04 1.45
FCI 5411.56 1.37
R-sq 0.37
Obs 120

My results are a little different from the Star-Telegram estimates, but this can be explained away by some small specification issues. But importantly, winning percentage is not statistically different from having no effect on attendance, so I feel like I am on the right track. In fact, only team payroll is statistically significant. So I decided to investigate further. Check out the bivariate relationship between attendance and winning percentage.

This seems pretty real to me, but it might not be. What could be wrong? Well, let’s take a look at the original specification. The model includes winning percentage along with runs and era. This is not much different than including runs scored and runs allowed, which would correlate with the Pythagorean win percentage. This is almost like putting win percentage in the regression twice. This creates a problem known as multicollinearity, which can bias the standard errors upwards and lower t-scores. While I am normally hesitant to drop potentially collinear variables, I think it is the right thing to do in this instance. When these redundant variables are dropped from the analysis the coefficient changes very little and the t-statistic rises. This gives me even more confidence that this is the right thing to do.

Variable Coefficient T-stat
Win% 1680622 3.16
Payroll 0.0085859 3.3
HR 373.26 0.32
Ticket 27977.48 1.46
FCI 5660.11 1.46
R-sq 0.37
Obs 120

Now, that is a little more like it. To put the winning percentage estimate in perspective, a 0.1 increase in winning percentage (e.g., going from .500 to .600 team) is associated with increasing season attendance by about 170,000 fans per year, or about 2,000 fans per game. So, I don’t think it is time to throw out the notion that winning improves attendance just yet.

*(Note: I would like to point out that I did not include a dummy for whether or not a team made the playoffs like the Star-Telegram study, because I think this would only exacerbate multicollinearity the problem…plus it would be a pain to gather.)

Update: Skip takes the critique further. The Star-Telegram study is looking even worse. Check it out. He also calls me a “regression maestro.” Thanks Skip, but you can just call me Bob Cobb. ;-)

One Response “Winning and Attendance in MLB”

  1. John Z. Smith Jr. says:

    You are essentially estimating a demand function, where quantity (season attendance) is a function of own price, demographic variables, and measures of team quality.

    The Ft. Worth article, and your models, are most likely misspecified and suffer from omitted variables problems. You are soaking up fixed effects with team dummies, but you are actually losing some valuable information there. (e.g. You are soaking up the effects of market size (population), income, ballpark effects (capacity, age of park, number of games actually played) and cost-of-living differences by simply including team dummies. For example, teams playing in new parks, especially the first season, see a large bump in attendance, and also see much higher ticket prices in the new park. The novelty effect of the new park is market-specific, but not time invariant, so team dummies do not adequately control for it. If you include population (MSA population), you will find it to be statistically significant.

    Another problem is that attendance is a function of advance ticket sales and walk-up sales. Advance sales are largely a function of last-year’s winning percenetage, while walk-up sales are a function of current-year winning percentage. These variables will most likely be significant. Since payrolls and winning percentage are positively correlated, you might be picking up some sort of superstar effect with payrolls.

    I wouldn’t expect production variables such as HRs, team ERA and runs to be relevant in a regression that includes winning percentage.

    Since ticket prices are in the FCI, why include both variables in the same regression? Most recent studies of such demand functions for ticket prices find the price coefficient to be positive and insignificant. One explanation is that attendance is a function of the total cost of game attendance rather than the simple ticket price. If you believe this, use the parking cost and one representative concession price from the Fan Cost Index as separate variables in the regression. (It is not clear that the reported beer and soda prices are the same size in ounces across different parks.) The other problem with ticket prices is that they are most likely measured with error (promotions such as packages, group ticket discounts, etc.) Finally, you are estimating a pooled regression across geographic markets and different years, which means you have general price inflation across time and cost-of-living differences across markets. You are soaking up the cost-of-living differences across markets with fixed effects, but not the upward trend in prices. Are you at least deflating nominal ticket prices by the CPI?

    At the very least, you need to deflate ticket prices and include population and median household income measures as explanatory variables in the demand equation.

    If you estimate a ln-ln specification, you’ll be able to directly interpret the estimated coefficients as elasticities. You might also want to adjust the units of variables (attendance, payroll, population) to control for scale differences. Finally, an alternative approach to controlling for fixed effects is to estimate a first-difference specification, which will parse out fixed effects.

    In the sports economics literature, there is at least twenty years of work on demand functions for baseball attendance, with a common specification being ln(season attendance) regressed on ln(wtd. avg. ticket price), ln(MSA pop), ln(median household income in MSA), and ln(current-yr. winning percent) and ln(previous-season winning percent).