## Age and Pitching Performance

After looking into the aging patterns of hitters, the next step is to look at pitchers. How does pitcher performance improve and decline with age? I used the same regression technique I used for hitters to estimate the effect of age on pitching performance using this model.

ERA+ or K/BB+ = X(Age) + B1(Lag of ERA+ or K/BB+) + B2 (# Batters Faced by Pitcher) + V (player constants) + e

ERA+ is the pitcher’s season ERA divided by the league ERA that season multiplied by 100; where 100 is a league average pitcher for that season. K/BB+ is the pitcher’s strikeout-to-walk ratio for the season divided by the league average and multiplied by 100. The “plus” method is a good way to pull out year-to-year differences in the observations. ERA is the normal standard by which most fans judge pitcher performance. I admit I could have used the DIPS ERA, but I did not want to calculate it going back to 1980. I think when you look at performance over time there is not that much of a need to make the correction, so I will not expend the effort. Instead I will use another good metric of pitching performance, the strikeout-to-walk ratio. As Skip discussed the other day, in 1974 Gerald Scully first noticed this metric to be an important measure of pitcher quality, and Bill James agrees. X is a vector of coefficients for different degrees of polynomials of Age (Age, Age^2, Age^3, etc.). V is a vector for individual player constants to factor out any individual player characteristics not included in the model (i.e. this is a fixed-effects model). The two control variables I include are the previous year’s performance in ERA+ or K/BB+ to proxy pitcher quality and # of batters faced by pitchers in a given season to proxy for injury.

For a sample I use individual players by season from 1980-2003. Data is from the Lahman Database. I include only pitchers who start at least 10 games in any season of observation. I tried using a wider sample of pitchers initially, but the inclusion of relief pitchers seems to make estimating the model very difficult. This is probably a good thing since starting pitchers and relief pitchers have almost completely distinct roles. It is also important to note that the league averages I use to calculate ERA+ and K/BB+ are the average of all players starting 10 games or more in that particular season. I estimate the model using the xtregar command in Stata, which basically estimates the coefficients using OLS but corrects for serial correlation.

Here are the fitted plots on three samples of pitchers for both measures of pitching performance: the entire sample of pitchers, those pitchers with below 100+ careers , and those with above 100+ careers.

The best fit for ERA+ is quartic, or adding the polynomials from Age to Age ^4 ; thereforefore, the minimums I report are rough visual estimates. Many thanks to an altruistic reader who tried to help me minimize the function by hand, but it was too much of a pain (we need Mathematica). Pitcher ERA+ is minimized at about 28-29 for the good (below 100) pitchers, 26-27 for the entire sample, and seems to be ever rising for the not-so-good (above 100) pitchers.

The best fit for K/BB+ was quadratic, and therefore easy to maximize. Pitcher K/BB+ is maximized at 29.67 for the good (above 100) pitchers, 28.58 for the entire sample, and 25.66 for the not-so-good (below 100) pitchers.

From this, I think it is safe to say that the best estimate of peak pitching performance is a little more than 28. It is a little higher for good pitchers, and a little earlier for lower quality pitchers. This is not surprising since high-quality pitchers will have more opportunities to pitch as they get older than low-quality pitchers. Below I include the regression tables for those who are interested. I do not report the standard errors. All of the statistics are statistically significant at the 1% level in the K/BB+ model. For ERA+ the coefficients are statistically significant for the entire sample model at the 5% level or less; however, when I break the sample up, some of the coefficients on the higher-orders of age are not statistically significant. I am still working on this a bit, but I just want to post what I have. Please feel free to lend me your thoughts or suggestions.

Variable ALL ERA+ < 100 ERA+ > 100
Age 24.22148 22.96265 32.37705
Age^2 -1.369466 -1.25484 -2.218376
Age^3 0.0315969 0.0275669 0.0604417
Age^4 -0.0002519 -0.0002059 -0.0005627
Lag(ERA+) -0.2734865 -0.2851918 -0.280756
BFP -0.0430742 -0.0340235 -0.0566215
R-sq. 0.28 0.24 0.37
Obs. 1911 1251 660
Players 448 234 214
Peak Age 26-27 28-29 early 20s
Variable ALL K/BB+ > 100 K/BB+ < 100
Age 7.265849 8.569231 5.124458
Age^2 -0.127119 -0.1444294 -0.0998337
Lag(K/BB+) -0.1721541 -0.1464887 -0.2262057
BFP 0.041634 0.0500489 0.0318835
R-sq. 0.13 0.12 0.18
Obs. 1911 987 924
Players 448 144 264
Peak Age 28.58 29.67 25.66