On Other Methods for Estimating Aging

My recent posts on my aging study have received a fair amount of criticism. There is no doubt that my study is imperfect, but all empirical researchers must make tradeoffs when selecting samples or choosing estimators. Small samples are sometimes preferred over large samples for expediency. Complicated models are sometimes necessary when simple methods mask biases. The goal of any researcher is to settle on a sample and method that has the greatest net benefits, because there is no such thing as a perfect study; and I’m satisfied with what I have found. I didn’t want the peak estimate to be 29 any more than I want it to be 27, but that is the outcome of a study that I feel was well designed.

But while I’ve presented and defended my study, I’ve said very little about why I selected the method I did over other methods. So, I want to discuss a few problems with alternative methods commonly employed to study aging that I was trying to avoid. These problems don’t necessarily doom these methods—they too have their own benefits and costs, and it would be wrong to pick on one side of the ledger—but, the fact that I could avoid these problems was a plus. So, I want to discuss three methods that are sometimes used to measure aging: the annual-change or “delta” method, the bucket method, and the mode method. I also discuss the importance of controlling for the run environment when estimating age effects.

The Annual-Change Method: This method takes a sample of players and averages how much players tend to change as they increase in age every season. For example, we look at players who played consecutive seasons and see how their performance changed. For some it will go up, some will go down. The data will have significant noise, but averaging all players changes reports a general trend. But, this method is subject to a bias in sample selection from who gets to play.

Playing time is a function of present performance and past performance. Because of this, past performance affects the sample in a way that highlights declines. Managers are trying to identify the best players to play. A good performance in the past will keep you in the lineup even if you slump through the short term. Bad performance in the past will prevent playing in the future. To have a two-year sample you have to reach the playing-time minimum in both seasons. To keep this simple, let’s assume that players can have two types of seasons (good and bad), generating the following combinations of seasons in a two year sample: good-good, good-bad, bad-good, and bad-bad. We’ll get plenty of the first two types of seasons, but the latter two will happen less. The draws from year1 and year2 talent pools are not random, because the lucky-good can go from good to bad, but the lucky-bad don’t get the opportunity to go bad to good. I’ll call this phenomena the survivor effect (Fair (2005) notes something similar).

Imagine we have two players who are both true .750 OPS hitters. PlayerA hits .775 in Year1, and PlayerB hits .725 in Year1, because of normal random fluctuations. PlayerB doesn’t get the opportunity to have a Year2 to have a corresponding upward rebound. PlayerA gets to play in Year2 and his performance falls to .725. Possibly in the next round, his Year2 and Year3 won’t be recorded because he’s deemed incapable of playing (unless you’re the Braves and you build an advertising campaign around him). Thus, when we average in the change, we will be averaging in more declines than is reflected by aging.

So why do we see any positive improvement up to the mid-20s at all (26 is where Nate Silver finds that it ends)? The survival effect ought to be less relevant when players are younger, because the aging function is steeper at this point (meaning improvements are larger and likely to overcome bad luck) and managers expect improvement and will be more tolerant of one bad year (”Tough year, kid. Hang in there.”) For older players the effect is the opposite. Being PlayerB at 36 may cause teams to disallow a bounce-back year because they observe may be a sign that his career is over. [Republished from my comments where it was not particularly visible.]

The Bucket Method: This method involves looking at player performances and organizing them into age “buckets.” After doing so, we can compare which buckets have the highest level of performance. This the main method used by Bill James in his initial study of aging in his 1982 Abstract.

The problem with the bucket method depends on what you’re dropping into buckets. If it’s total performance, then the buckets will be filled at younger ages by players who are not necessarily good. If they don’t cut it they will leave the game and replaced by more young players. Therefore, the bucket totals are full from sheer numbers, not difference in performance by players of those ages. And this is biased towards the young, because while many players drop out of baseball, they almost never drop in. A player in his mid-to-late 20s who hasn’t made it will leave the game to start building his human capital for his non-baseball life or the team may give up on him. Older players rarely stick around for the chance to get a few at-bats in their 30s.

If you are looking at average rate performance by age bucket, so as to avoid the summation problem, you have a new issue. Baseball players have to meet a minimum threshold of performance to play. If you meet that, you normally get to stick around, young and old. 20-year-olds, 30-year-olds, and 40-year-olds who can play, will; thus, their average performances will look quite similar. What’s happening is that good players enter early and leave late, making their buckets appear more productive than they are. In his original aging article, James criticizes Pete Palmer for a study that uses such a method and finds very little aging effect at all. James points out that Palmer can’t measure the “white space”—there is not performance to measure, just white space—where players who have declined don’t get a chance to perform. Aging in the white space what I am trying to capture by using multiple regression analysis to control for player quality.

The Mode Method: This method uses the most common age at which players typically have their best season to identify peaks. This method is used not only in sabermetrics but in academic studies of aging. For example, researchers of looked at the most common age at which track and field athletes set world records. An interesting finding from these studies is that while records tend to fall over time, the age at which the athletes break the records does not.

The mode method is imperfect as well, with estimated peak ages biased downward. The reason for this is that there are two main factors that cause players to decline: aging and random non-aging-related injuries. An example of the former is when a player’s reflexes slow and he can’t get around on a fastball. An example of the latter is when a player blows out his ACL sliding hard into a base and he never heals to reach his original potential. Players decline and leave the sport for both reasons, but the latter is definitely not aging. When we look at the mode, we are not differentiating from the cause of deterioration. Because of non-aging attrition, more players will have an opportunity to have peak ages earlier than later. The thing is, it isn’t predictable who will suffer these injuries (though some injuries are associated with age). The attrition isn’t aging, and players who avoid injuries should improve beyond the mode best season. In my study, I identified the mode being less than the mean and median for all players even for those who stayed in the sample. The reason for this is likely unpredictable injury shocks, because plenty of players have good seasons after their 30th birthdays. The mode method is also not very helpful for measuring aging rates: we can find peaks, but can’t track the path to and from them.

Another Important Factor: Aside from the problems with methods above, any study of aging using players over time must account for changes in the environment in which players perform. Baseball’s run environment can shift quite dramatically over the course of a player’s career. Colin Wyers provides an example of how not accounting for changes in run environment can lead to erroneous conclusions about aging.

Wyers looks at all players in 2008 who were 29 and sees how their performance differs from 2006, when they were 27. He finds that players the players had OPS 0.014 lower at 29 than at 27, and declares “That’s not what we should expect to see if the average peak age is in fact 29.” It’s a cherry-picked sample from which no one should draw any conclusions, but the data doesn’t reveal what Wyers thinks it does. In fact, it reveals the exact opposite of what he claims. The run environments were vastly different in 2008 and 2006. In 2006, the league-average OPS was 0.768; in 2009, league-average OPS was 0.749. Thus on average, if players didn’t age at all we would expect their OPS to decline by 0.019. That the sample only declined by 0.014 means that they improved, not declined. The improvement is also consistent with the aging estimates listed in my paper (about a 0.65% difference from the peak). Wyers also argues that from 1997–2008 the general trend was a decline from 27 to 29, but the trend of runs was also declining during this period.

Runs/Game 1997--2008

But I would also like to point out that no matter what you think the true peak might be, the real finding here is the flatness of aging. Good players tend to remain good and bad players tend to remain bad over a range, and will perform slightly better and worse than expected from their late-20s to early-30s. I find that hitters play within two percent of their peaks from age 26 — 32; for an .800 OPS peak player that’s a range from about .780 to .800 for about seven years. I find it interesting that James ends his own study that started it all with a statement with which I agree:

Good hitters stay around, weak hitters don’t. Most players are declining by age 30; all players are declining by age 33. There are difference in rates of decline, but those differences are far less significant for the assessment of future value than are the differing levels of ability.

It’s best not to get caught up in thinking about exact peaks; instead, focus on peak range, which is flat over many years. A player hitting his 27th birthday isn’t starting on the downside of his career, he’s approaching his peak and will be there for many years. A guy who’s thirty may still be playing his best baseball, too. But the most important factor to consider when evaluating a player is the innate talent of the player himself.

4 Responses “On Other Methods for Estimating Aging”

  1. Colin Wyers says:

    I am working on a revised set of figures that includes an accounting for average yearly OPS. (I may also throw in some park factors and present both analysis.)

    I do want to quibble with the characterization of the samples presented as “cherry-picked.” The 2007-to-2009 sample was picked because it was the only data I had in the modern era that wasn’t included in your study – the idea being to test the estimates on out-of-sample data. I hope to have 2009 data added to my database this weekend, and I can add that as well. The other sample was picked because it covered the years I had a full position breakdown handy for, in order to present the position-switch data along with the hitting data.

  2. Colin Wyers says:

    I reran the query on “normalized” OPS, which is OPS divided by the league average that year and multiplied by the average OPS from 1995 to 2008. Instead of finding a dropoff of .006 points, I found a dropoff of .007 points. If you are interested, I can share with you both the results (in an Excel spreadsheet) as well as the SQL code I used to pull the data from the Baseball Databank.

    Also, you have yet to address the other points, about position changing and attrition rates.

  3. JC says:

    I don’t need the spreadsheet. My study is on 86 years of data—also breaking the sample into eras—which normalizes performance and controls for many other factors. I don’t find what you’ve presented to be compelling evidence for me to abandon my findings as mistaken.

    The paper wasn’t about fielding aging, and even if that was at issue, I’m not sure what you did says much.
    On attrition, I removed the sample restrictions and found that the peak did not change.

  4. Yaramah says:

    Just out of curiosity, is there any evidence from your study or others that PED had any effect on performance as it related to aging? I thought one of the purported effects of PED was that it allowed you to maintain your highest level of performance for longer, and it allowed you to overcome niggling injuries that effected other players. Just curious if the data set from the 90′s and beyond is any different to those from earlier eras.