Archive for Sabermetrics
I have an article up on aging at Baseball Prospectus today.
How Do Baseball Players Age? Investigating the Age-27 Theory
Recently, there’s been a decent amount of chatter regarding how baseball players age, and I have to admit that it’s mostly my fault. In a study that was recently published in Journal of Sports Sciences, I find that players tend to peak around the age of 29; this finding has been met with resistance from some individuals in the sabermetric community, where 27 has long been considered the age when players peak. Will Carroll and Christina Kahrl graciously asked if I would be willing to defend my findings on Baseball Prospectus. I agreed, and I thank Will and Christina for the opportunity to do so.
It seems that I have upset a few people with one of my interview answers at Chop-n-Change, involving a paper by Jahn Hakes and Skip Sauer. Here’s a brief response that covers the criticism the paper has received (cross-posted in the comments).
1) The goal of Hakes and Sauer was to test the Moneyball hypothesis that OBP was undervalued relative to SLG; hence, the title of the paper.
2) This test must include OBP and SLG in the model. The concept can be broken down and testing further, which they did, but what is interesting is if this central tenet of Moneyball is true. The exercise is not about designing the perfect model for predicting salaries. I vividly recall discussing this fact with the authors at the time the paper was written when I asked them about alternate specifications of the model. They responded that they had done this and this analysis would be a part of another paper, which it was, but were focused on Moneyball itself for this exercise. This then creates the problem of adjusting for playing time. This could be controlled for in ways other than plate appearances (e.g., interaction terms), but the authors ultimately decided the parsimony of their specification made it the right choice. Adding in the impact of all sectors of the labor market is another tough issue. Ideally, you would like to separate the labor classifications, but they are trying to estimate the market price for the entire labor market—reserved and arbitration-eligible players are a part of that market. So, they include dummies to act as a control. Again, interaction terms or some other correction could have been used, but they felt that their final specification was best. And they were able to convince many other economists (colleagues, editors, and referees) at different levels of review that what they produced was the best choice.
3) The goal of the study was to identify if the market was out of whack at the time the book was written. The findings indicate the pre-Moneyball models don’t predict as well as the post-Moneyball sample based on what we would expect them to be. That is a point in favor of the paper, not an objection. Furthermore, in 2001 the labor market was especially out of whack, and I find it odd that it was the specification chosen for close examination. The regression equation was designed to pick up information from real-world data, the values are not something presupposed by the authors. The coefficient on OBP is negative—higher OBP lowers your salary. You don’t need to plug in any values to see that this is counter-intuitive. Part of the reason why the salaries remain so stable when Tangotiger adjusts the inputs is that the higher value for OBP cuts into the impact of SLG. As Hakes and Sauer acknowledge in the text, the coefficients on OBP are not even statistically significant—the market appeared to be ignoring the relevance of OBP at the time. That’s their argument.
4) So, the Hakes and Sauer papers may be imperfect, joining the ranks of every other empirical study ever written. If you think you can do better, here is a solution. Take the freely available data and run alternate specifications. As it stands, the critique is that the perfect is the enemy of the good. If further testing reveals the labor market was not out of whack, then we have an argument.
As a follow-up to my little clutch hitting study, I thought it would be interesting to look at clutch pitching using the same methodology. Though I don’t believe there is good reason to expect clutch performance among hitters, I think it’s plausible that pitchers may have some clutch skill. Pitchers have to regulate their effort throughout the game and often change the way they pitch with runners on base (employing the stretch). Theses factors leave room for pitchers to perform differently when the stakes of the game change. Pitching better with runners in scoring position (RISP) may not be “clutch” in the Platonic sense of rising to the occasion, but it’s a skill worthy of examination.
I looked at individual RISP plate appearances in 1992 and estimated the impact of past clutch performance controlling for the overall pitcher performance in each area (allowed AVG, OBP, SLG, strikeout rate, walk rate, home run rate), the skill of the batter in each area, and the platoon effect (platoon = 1; 0 = otherwise). I used RISP performance in 1989–1991 to proxy clutch ability—if pitchers have clutch skill, past clutch performance should correlate with present clutch performance.
The table below lists the coefficients (reported as marginal effects) and robust z-statistics of regression estimates in seven performance areas. I used the probit method to estimate binary outcomes (outcome = 1 if an event occurred, and 0 otherwise) of individual plate appearances for hits, on-base (hits + walks + hbp), strikeouts, walks, non-intentional walks, and home runs. I used the negative binomial method to estimate the impact of the variables on the number of total bases resulting from a plate appearance.
Hit On Base TB K BB BB-IBB HR Overall 1.1801 0.8815 0.9702 0.8695 0.7657 0.6890 0.5808 [8.61] [7.20] [8.06 ] [12.34] [8.78] [8.66] [6.31] RISP -0.0239 -0.0022 -0.1192 0.0845 0.1305 0.0797 -0.0383 [0.29] [0.03] [-1.69] [1.39] [2.16] [1.43] [0.73] Batter 1.0148 1.0737 0.9816 0.9242 1.1108 0.9558 0.5831 [13.61] [17.42] [15.62] [25.56] [23.54] [22.49] [16.13] Platoon 0.0165 0.0332 0.0428 -0.0246 0.0266 0.0039 0.0071 [2.77] [5.35] [4.29 ] [5.30] [6.73] [2.79] [1.96] Obs. 21,096 23,872 21,096 23,872 23,872 23,872 23,329 Probit Probit Neg Bin Probit Probit Probit Probit
Past RISP performance is not a statistically significant predictor of 1992 RISP performance. Walk rate appears to be an exception—with pitchers consistently performing worse with runners on base (and having a z-stat > 2)—but the higher probability of walks seems to be caused by the increase in intentional walks issued with the hope of turn out a double play. When IBBs are removed, pitcher RISP walk performance loses its statistical significance.
The results do hide one thing: pitchers perform better in RISP than non-RISP situations, except when walks are involved. The table below shows the average of outcomes for all events. All differences are statistically significant.
No RISP RISP Hit 0.255 0.249 TB 0.380 0.368 BB 0.075 0.089 On Base 0.315 0.321 K 0.150 0.157 HR 0.021 0.019
The numbers remove intentional walks, therefore the worse performance in preventing walks, which also shows up for on-base probability, could represent “intentional unintentional-walks” or pitchers losing control a bit when runners are in scoring position. But if the latter were true, I would expect the numbers to be worse in the other areas. Also, because the numbers below are the percentages of all outcomes, the better numbers in RISP may also reflect better relievers entering the game for such situations.
The main story here is that the regression estimates indicate that after controlling for several relevant factors pitchers don’t appear to have any special skill over other pitchers in performing in RSIP situations. A pitcher’s overall performance level does a fine just of predicting performance, and knowing past clutch performance doesn’t appear to add useful information.
I’ve been getting a few hits for the term “productive outs” lately. I blame TBS (so does Steve Goldman). When the stat came into being five years ago, I did a little study of its impact, and I thought I’d repost my findings.
If there is anything of use in POP it must be in addition to the impact of OBP and SLG, not an alternative measure. Olney’s argument ought to be: all else being equal, teams that have a higher percentage of productive outs will score more runs than those that do not. This means that when two teams have identical OPSs the one with a higher POP will score more runs. So, what happens when I run a regression including both OPS and POP, which allows me to control for the run-scoring abilities of teams due to OBP and SLG, to capture any additional POP effect? Well, not much. Using the 2004 team data provided by ESPN.com I find that POP has no effect on run-scoring. Though the coefficient is negative it is not statistically significant.
So, why doesn’t it have an effect? I mean, clearly logic dictates that productive outs are preferred to non-productive outs. The problems lies in the fact that productive out situations are also productive at-bat situations. While productive outs are preferred to non-productive outs, non-outs are even better. A team that is producing productive outs is still producing outs.
Last week’s post on clutch ability got me thinking about another way to identify clutch hitting. Instead of comparing performance in aggregate data, I wanted to look at the probability that a hitter would perform in an individual plate appearance using past performance metrics as predictors. The degree to which past clutch performance predicted actual performance would tell us something about clutch ability, while controlling for other factors.
So, as I watched Nick Punto surpass Lonnie Smith for the most-memorable baserunning error in Metrodome history, I pulled up an old data file (via Retrosheet) that Doug Drinen and I had used to study protection. I had a four-year sample of individual plate appearances from 1989–1992. I estimated each player’s performance with runners in scoring position (RISP) from 1989–1991 to see how it predicted 1992 performance in RSIP plate appearances. The idea is that if players have any ability to perform with higher stakes, then past performance in this area should affect the probability of success during individual plate appearances. The nice thing about such granular data is that it is possible to control for factors such as pitcher quality and the platoon advantage—effects that are difficult to tease out of aggregate data.
I used probit models to estimate the likelihood that a player would get a hit (1 = hit; 0 = otherwise), or get on base (1= hit, walk, or hbp; 0 = otherwise) controlling for the player’s seasonal performance in that area (AVG or OBP), RISP 1989–91 performance in that area, whether the the platoon advantage was in effect (1 = platton; 0 = otherwise), and the pitcher’s ability in that area. To test hitting power, I used the count regression negative binomial method to estimate the expected number of total bases during the plate appearance and used his RSIP SLG 1989–1991 as a proxy for clutch skill in this area.
The table below lists the marginal effect (X) of a change in the explanatory variable on the dependent variable. For example, a one-unit change in the explanatory variable is associated with an X-unit change in the dependent variable. For the probit estimates, this represents a change in probability. For the negative binomial estimates, this represents the expected change in total bases.
Variable Hit On Base Total Bases Overall 1.04 0.98 0.93 [9.58] [11.84] [10.8] RISP -0.06162 0.00018 0.00012 [1.02] [3.65] [1.32] Pitcher 1.152 1.031 0.983 [12.94] [12.51] [12.83] Platoon 0.014 0.040 0.039 [2.41] [6.74] [3.82] Observations 23,197 26,820 23,197 Method Probit Probit Neg. Binomial Absolute robust z-statistics in brackets.
The brackets below the variable list the z-statistics, where a statistic of 2 or above generally indicates a statistically meaningful relationship. In samples of this size, statistical significance isn’t difficult to achieve; therefore, it isn’t surprising that in all but two instances the variables are significant. The two that are insignificant are the past RISP performance in batting average and slugging average. Thus, clutch ability doesn’t appear to be strong here.
However, the estimate of a clutch effect is statistically significant for getting on base. Is this evidence for clutch ability? Well, let’s interpret the coefficient. Every one-unit increase in RISP OBP is associated with a 0.00018 increase in the likelihood of getting on base; thus, a player increasing his RISP OBP by 0.010 (10 OBP points) increases his on-base probability by 0.0000018. For practical purposes, there is no effect.
This study is by no means perfect, but the striking magnitude of the impacts between overall and clutch ability (just look at the differences in the Overall and RISP coefficients) in such a large sample shows why it’s best to remain skeptical regarding clutch ability. If players did have clutch skill, I believe it would show up in this test.
This morning, I ran across an article by Alan Barra in the WSJ that reminded me of a blog post that I have been meaning to write for several years. Barra discusses the ability of players to perform in “clutch” moments. In closing, Barra cites Bill James as an agnostic regarding clutch ability, citing the last line of his article Underestimating the Fog, “Let’s not be too sure that we haven’t been missing something important.”
James’s article caused a bit of a stir when it was first published. Here was James arguing that several common notions among sabermetricians—including that clutch ability is a myth—were not necessarily so. As James metaphorically stated the problem, clutch hitting lay in a fog beyond a sentry, on the lookout for approaching forces. In a thick fog, the enemy may be invisible despite existing in strong numbers. The fog that obscures the view of the guard is similar to the fog that shrouds the randomness in baseball that makes it difficult for us to identify ability from chance. While we have methods for disentangling luck from ability, there exists the possibility that clutch ability is real and we just haven’t found a way to see through the fog properly. Therefore, we shouldn’t be so quick to believe an idea to be true, even when the bulk of the evidence we have indicate that it is true. Maybe the truth is just lost in the fog.
No one can deny this. Of course it is possible that clutch ability exists and we just haven’t found a way to measure it properly. But we dismiss lots of other possible events as likely outcomes everyday with good reason. And the tradeoff of acting with too little evidence must be balanced against not acting with sufficient evidence. It’s a dilemma familiar to all scientists. This is explained with the distinction between type I and type II errors.
Let’s begin with the null hypothesis that player performance in clutch situations is identical to performance in non-clutch situations. A type I error occurs when we reject a correct null hypothesis. Studies of clutch hitting find that performance differences in these situations are small and often not statistically meaningful. The null stands and clutch-hitting skill is seen as a myth. A type II error occurs from not rejecting an incorrect null hypothesis. When James advocates agnosticism towards clutch-hitting as a skill, it is because that despite the studies showing little evidence of clutch-hitting he wants to avoid committing type II error. The problem is, this choice between type I and II errors isn’t free. By raising the decision criterion to avoid type II error, you necessarily increase the chance of committing type I error.
Identifying clutch hitting is practical problem that requires a decision involving real costs. Should a team factor in clutch ability when choosing between free agents. Should it matter for the manager choosing among pinch hitters? Should a historically big-game pitcher start the playoff series over your regular season ace? Based on the available evidence, if I had to decide between Jeter or A-Rod it’s not even close: Alex Rodriguez is a far superior player to Derek Jeter, and that’s what is relevant. And in cases were the players’ performances are more similar, I wouldn’t consider clutch performance for even a moment. If clutch ability exists, it would show up in bunches using the empirical methods already employed by researchers seeking to study the question.
In my view, the fog is a distraction: something to bring up to keep the argument going. But arguing takes time, which is valuable. Let’s stop it with the fog, already. Of course it’s possible that something exists that just hasn’t been discovered yet (e.g.the Loch Ness Monster, Sasquatch, ergogenic effects of HGH); but the evidence we have says these things don’t exist, and hanging hopes on the possible isn’t a very persuasive argument.
“Nate McLouth is still a fourth OF masquerading as a starting CF.”
season OPS+ +/- SB/CS 2007 110 -9 22/1 2008 126 -37 23/3 2009 109 +3 19/6
There is no arguing that McLouth is an above-average hitter, and when you add in his contributions on the basepaths it’s clear that he is a valuable offensive player. His lone deficiency is on defense, where he drew the ire of many saber-minded commentators for winning a Gold Glove while having the worst Plus/Minus in the league. He also was the Pirates lone All-Star representative in 2008, because someone had to go. But the justifiable backlash against his mainstream overrating doesn’t justify relegating him to part-time status.
So, let’s tackle the defense. In 2007, he had a poor defensive season with a -9 Plus/Minus that when translated to a full season of work would have been a -16. Not good, but not in -37 territory. In 2009, he seems to have corrected the problem, becoming league average. Maybe it’s a blip, and he hasn’t improved. After watching him for half a season, I don’t really understand how anyone could have awarded him a Gold Glove. Yet, I thought he was adequate and a defensive upgrade over the supposedly solid Jordan Schafer, who posted a -5 Plus/Minus for one-third of the season (Yes, I get it: small sample and he’s young. Just pointing out that the metric that damns McLouth says he was better defensively than Schafer in 2009).
But that -37 may not adequately capture his defensive ability, and the Plus/Minus creator John Dewan seems to agree, “All in all, I no longer think of McLouth as the worst center fielder in baseball. It means something that at least some of the managers and coaches think highly of him.” In addition, Dewan examined McLouth’s performance at a more granular level in The Fielding Bible: Volume II and found McLouth’s biggest weakness: deep balls, especially those near the wall. Does this have something to do with defensive positioning, the park in Pittsburgh, or McLouth’s ability? This is difficult to answer, but the Plus/Minus of McLouth’s replacement in Pittsburgh Andrew McCutchen reveals something interesting. In two-thirds of a season he posted a Plus/Minus of -17. I think BIS and Pittsburgh need to get together and see it there is a measurement or coaching problem that needs to be addressed. McLouth and McCutchen might both have been poor fielders in Pittsburgh, but I think there was also be something else going on.
Even if you take the Plus/Minus values at face value and compare it with his Adjusted Batting Runs per 162 for the past three season is 15 runs above average. His average Plus/Minus for the past 3 years (stretching 2007 out to his 2008 playing time) is -17, which you can multiply times .56 to get about -10 runs. So, he’s still a player who is five runs above average.
In conclusion, I think there is very little evidence to support the claim that McLouth is a fourth outfielder. He may not be an All-Star or a Gold-Glover, but he’s a starting center fielder for most major-league teams.
In fact, in the absence of other stats, Wins is a very good, if not great, indicator of a pitcher’s value. So next time you hear somebody say Wins is a crappy way to evaluate a pitcher, throw a drink in their face and then make them read this post.
As someone who would need a towel if readers followed this advice, I believe a response is in order. Now, the author Cork Gaines (“The Professor”) does acknowledge that Wins is not the best statistic to use for evaluating pitchers, but that’s not really news. When ever is there a situation when anyone is going to have to choose using Wins or nothing to value a pitcher? After reading the post, I maintain that Wins is a poor statistic to use for valuing pitchers. In fact, the statistical evidence used in the article shows the opposite of what the author thinks it shows.
Gaines uses regression estimates of Wins and Win% on ERA+, finding R2 values of 0.51 and 0.54 to justify the usefulness of Wins.* Those values are indeed statistically significant and reveal a real positive correlation between Wins and run prevention. But more so, they reveal why Wins are such a bad statistic to use for valuing pitching quality. How is showing that good pitchers get more Wins than bad pitchers busting a myth? Greg Maddux didn’t luck his way to 355 Wins, and no one who pooh-poohs Wins thinks his Win total is a result of randomness, unrelated to his ability. It’s the magnitude of the correlation that is important here.
The R2 reveals the percentage of the change in the dependent variable (ERA+ in this example) explained by changes in the independent variable(s) (Wins or Wins%). The remainder is due to explanatory factors not included in the model. Now, R2 can be tricky to interpret and it is sensitive to sample size; but, in general, the results indicate that 50% of the difference in ERA+ across pitchers can be explained by differences in Wins. That’s the problem, not evidence to the contrary. The main knock against Wins is that pitchers have control over only one half of the game: half the game is defense (50%) and the other half is offense (50%). An R2 of close to 0.50 confirms rather than debunks this notion.
When choosing performance metrics, it is important to use three criteria:
1) How well does it correlate with output? — Wins doesn’t do so bad here: Wins are correlated with run prevention. Still, other metrics of pitcher performance are far superior, and the life-boat circumstances when someone might need Wins to value a pitcher don’t happen. Why bring this up? No one has suggested that Wins and ability are uncorrelated.
2) How well does it measure ability? — It measures ability, but it is heavily polluted by outside factors (offense and fielding). This is the criterion used to justify using DIPs over ERA. If you want to know the statistic that most strongly correlates with run prevention for pitchers, it’s ERA by a longshot. It is almost a pure recording of the runs pitchers give up, so of course the correlation will be strong. The problem is that pitchers themselves don’t have much control over a major component of ERA: balls that are put into play. ERA fluctuates significantly from season to season for pitchers because it is so dependent on balls in play. DIPS measures are preferred over ERA because they more accurately capture actual pitcher contributions to run prevention, not because they correlate more strongly with run prevention. Similarly, Wins capture some aspects of pitcher ability, but a huge chunk of the contribution is determined by something beyond pitcher control. And the regression estimates that the explained variance of ERA+ are consistent with Wins reflecting half of what pitchers contribute to generating this metric.
3) How well does it match our intuition as to what matters? — -This criterion isn’t all that relevant in this case, and is reflected in the analysis in criterion 2. I use this rule in situations where correlations yield counterintuitive values. For example, strikeouts and home-run hitting are positively correlated; however, suggesting that a hitter should strike out more to increase his power would be wrong.
Gaines is right that Wins includes some useful information regarding pitchers, but the pollution impacts of outside factors are so large that in cases where we see Wins deviate from ERA or DIPs performance expectations that it is Wins that contains the misleading information. There is no reason to use Wins to evaluate pitcher ability. It is neither a very good nor great indicator of a pitcher’s value.
* A footnote to the article states that R2 ranges from -1 to 1 with greater positive (negative) values indicating a stronger correlation. This is incorrect. R2 ranges from 0 to 1. I was curious if the author was using a correlation coefficient R, which does range from -1 to 1 but has a different interpretation in terms of measuring explained variance. However, the graphs and intuition make it look as though the descriptive footnote is incorrect, not the main text of the analysis.
In the comments to my previous post regarding Ken Rosenthal’s criticism of sabermetric groupthink (SGT), a thoughtful reader posted Rob Neyer’s response to Rosenthal (not me). Though I responded in the comments, I think it’s worthy of a post of its own.
“In fact, in sabermetrics there’s really no such thing as groupthink. If you’ve spent any real time with sabermetricians, you know exactly what I mean.
Is there a consensus among sabermetricians that Joe Mauer deserves the MVP? Yeah, probably. But “consensus” is not the same as “groupthink.”
Not nearly the same. Groupthink (according to The Big W) is “a type of thought exhibited by group members who try to minimize conflict and reach consensus without critically testing, analyzing, and evaluating ideas.”
That’s the exact opposite of sabermetrics, which at its very heart is nothing but critically testing, analyzing, and evaluating ideas.”
I like Neyer, but “in sabermetrics there’s really no such thing as groupthink”? What sabermetrics is and what it strives to be are two different things. All groups suffer from groupthink, sabermetrics is no different than other groups. Rosenthal isn’t denying advances made by sabermetrics—he seems to agree that Mauer is his choice for MVP (as is mine)—but taking on the unnecessarily arrogant tone with regard to the correctness of certain tenets that are pushed by its club members. Flooding the inboxes of sports writers with VORP-laden snarky commentary doesn’t help the movement. Sabermetrics includes some science, but it is not all objective analysis immune from clubish behavior motivated by social aspects.
I think Rosenthal’s message was a polite and important statement that explains why many members of the mainstream media are hostile to sabermetrics. I don’t follow Rosenthal closely, but I have found him to be one of baseball’s more-knowledgeable writers. He may not agree with every tenet of sabermetrics, but he acknowledges the community and ideas; certainly he has not summarily eschewed sabermetric ideas. I read a lot of dumb things by established baseball writers who deserve to be called out. But when you start inundating people who have lived and breathed baseball for much of there lives—no less than active sabermetricians—with new acronyms that are not in their lexicon, don’t be surprised when they are confused. And getting snooty about it doesn’t help. Baseball already has a language, and there is nothing too complex in sabermetics that cannot be explained through terms and statistics understood by little-leaguers.
The honor goes to Ken Rosenthal.
Don’t get me wrong. Sabermetricians have significantly broadened our understanding of baseball — and by “our,” I mean fans, media and club personnel, virtually everyone in the game. Advanced statistics reveal not only tendencies, but also greater truths. Smart teams effectively combine sabermetric principles with scouting orthodoxy. Very few, if any, disregard the numbers entirely.
Here’s the problem: Sabermetricians were ignored for so long, they had to shout to be heard. Now they are getting heard — properly heard in the highest levels of baseball media and front offices. But some continue to shout, dismissing those who disagree as ignorant dolts….
Baseball sparks the liveliest discussions of any sport, invites a myriad of perspectives. Slavishly adhering to sabermetric dogma reduces the level of discourse. We’re talking about an MVP race, not geopolitics. We’re supposed to debate. Good, old- fashioned quarrels are part of what makes the game fun.