## Predicting No-Hitters

On Rob Neyer’s page, Bill James tackles an interesting question (Also, see the discussion on Primer):

Rob, when we were at Fenway you posed the question, Who was the most likely pitcher to have thrown a no-hitter not to have thrown one. On the plane home I thought of what I assumed had to be the correct answer, which is Roger Clemens. Clemens has never thrown a no-hitter at any level: majors, minors, college, high school, amateur, little league.

But actually, Clemens is not the answer to the question, amazingly enough.

To find the answer to his question, James looks at the likelihood that a pitcher will throw a no-hitter based on his career out percetage [ (3 * IP) / ((3 * IP) + hits)] and the number of starts. The simple logic behind this is that the higher a pitcher’s out percentage and the more starts, the more no-hitters a pitcher should throw. Using this method James finds that Don Sutton, not Roger Clemens, is the answer to the question. Clemens is number three, behind Pedro Martinez.

But, I thought of another method to predict no-hitters. No-hit games are simply the product of the out percentage; however, the likelihood of a pitcher throwing a no-hitter may depend on how pitchers tend to get those outs. Players who strike out lots of batters (like Nolan Ryan, Randy Johnson, and Roger Clemens) are less dependent on their fielders to get outs. Might not these guys have a better chance of throwing no- hitters than pitchers with identical out percentages but lower K-rates? Well, I decided to check it out using a Poisson regression procedure. Using DIPS/FIP as my motivator for pitchers’ abilities to prevent runs I estimated the number of no-hitters as a function of strikeouts, walks, home runs, and games started. Ks and HRs are obviously indicators of pitcher success in generating outs, with the former generating outs and the latter preventing them. Walks are a bit iffy. I included it largely because it is one of the “big 3″ stats in FIP, but pitchers who walk more batters pitch more out of the stretch and open up infield holes. But, in the end walks did not seem to have much of relationship with no-hitters.

Here is the list of the Top-25 predicted no-hitters:

Rank | First | Last | Debut | Predicted | Actual |
---|---|---|---|---|---|

1 | Nolan | Ryan | 1966 | 6.97 | 7 |

2 | Randy | Johnson | 1988 | 2.06 | 2 |

3 | Roger | Clemens | 1984 | 2.03 | 0 |

4 | Walter | Johnson | 1907 | 1.95 | 1 |

5 | Cy | Young | 1890 | 1.74 | 3 |

6 | Steve | Carlton | 1965 | 1.72 | 0 |

7 | Don | Sutton | 1966 | 1.23 | 0 |

8 | Tom | Seaver | 1967 | 1.21 | 1 |

9 | Bert | Blyleven | 1970 | 1.20 | 1 |

10 | Pedro | Martinez | 1992 | 1.15 | 0 |

11 | Tim | Keefe | 1880 | 1.14 | 0 |

12 | Gaylord | Perry | 1962 | 1.12 | 1 |

13 | Rube | Waddell | 1897 | 1.04 | 0 |

14 | Greg | Maddux | 1986 | 0.98 | 0 |

15 | Christy | Mathewson | 1900 | 0.95 | 2 |

16 | Eddie | Plank | 1901 | 0.95 | 0 |

17 | Sam | McDowell | 1961 | 0.95 | 0 |

18 | Phil | Niekro | 1964 | 0.89 | 1 |

19 | Bob | Gibson | 1959 | 0.89 | 1 |

20 | Pud | Galvin | 1875 | 0.80 | 2 |

21 | Tommy | John | 1963 | 0.78 | 0 |

22 | David | Cone | 1986 | 0.72 | 1 |

23 | Pete | Alexander | 1911 | 0.70 | 0 |

24 | Bob | Feller | 1936 | 0.69 | 3 |

25 | Sandy | Koufax | 1955 | 0.67 | 4 |

And here are the Top-27 pitchers without no-hitters (Why 27? To get Tom Glavine on the list, of course.):

No Nohitters Rank | Overall Rank | First | Last | Debut | Predicted |
---|---|---|---|---|---|

1 | 3 | Roger | Clemens | 1984 | 2.03 |

2 | 6 | Steve | Carlton | 1965 | 1.72 |

3 | 7 | Don | Sutton | 1966 | 1.23 |

4 | 10 | Pedro | Martinez | 1992 | 1.15 |

5 | 11 | Tim | Keefe | 1880 | 1.14 |

6 | 13 | Rube | Waddell | 1897 | 1.04 |

7 | 14 | Greg | Maddux | 1986 | 0.98 |

8 | 16 | Eddie | Plank | 1901 | 0.95 |

9 | 17 | Sam | McDowell | 1961 | 0.95 |

10 | 21 | Tommy | John | 1963 | 0.78 |

11 | 23 | Pete | Alexander | 1911 | 0.70 |

12 | 27 | J.R. | Richard | 1971 | 0.66 |

13 | 28 | Bob | Veale | 1962 | 0.63 |

14 | 30 | Mickey | Welch | 1880 | 0.60 |

15 | 31 | Jerry | Koosman | 1967 | 0.60 |

16 | 33 | John | Smoltz | 1988 | 0.58 |

17 | 34 | Chuck | Finley | 1986 | 0.58 |

18 | 35 | Lefty | Grove | 1925 | 0.57 |

19 | 36 | Mickey | Lolich | 1963 | 0.56 |

20 | 37 | Fergie | Jenkins | 1965 | 0.55 |

21 | 40 | Early | Wynn | 1939 | 0.54 |

22 | 42 | Rick | Reuschel | 1972 | 0.54 |

23 | 44 | Toad | Ramsey | 1885 | 0.53 |

24 | 45 | Tony | Mullane | 1881 | 0.53 |

25 | 46 | Frank | Tanana | 1973 | 0.53 |

26 | 48 | Kid | Nichols | 1890 | 0.53 |

27 | 49 | Tom | Glavine | 1987 | 0.52 |

This method has Clemens not just the most likely no-hit pitcher never to throw a no-hitter, but he is third on the all-time list of predicted no-hitters. This is a little more supportive of James’s intuition. And I suspect this intuition is nurtured by a belief that Rocket’s pitching style is conducive to no-hit games.

One thing I like about the model is how well it predicts Ryan. Even when I throw Ryan out of the sample when estimating the regression, it still predicts Ryan should throw about 7 no-hitters. However, my model misses Koufax badly, but so does James’s. What that tells me is not that either model is bad, but that he was really lucky to throw 4 no-hitters.

On a final note, I estimated the regressions based on all pitchers with at least 100 games started, but that ended up kicking out a few guys with no-hitters. That may have affected the results. Also, I have not double-checked myself as much as I would have liked due to my busy schedule. I really would like to have put a little more time into double-checking my numbers, but I just don’t have the time right now. I would be happy to share my data with anyone who wants to proof what I did. Finally, I did run some of the standard tests used for Count regressions and, generally, a straight-up Poisson model seemed the right way to go. The results with a negative binomial regression were not much different.

As always I welcome thoughts and suggestions.

Perhaps the Koufax model is off because of the enormous disparity in his statistics from his early career to his 5 year dominant period. He was basically a mediocre pitcher until the early/mid 60s when he became an entirely different animal altogether.

Figuring out the probability of no-hitters has been a source of interest to me for some time. I had an article published in the Baseball Research Journal back in 1993 that came to the same conclusion about Ryan being near his expected number of no hitters, but that was because I messed up in my calculation of the binomial distribution. (What I actually showed was the probability that Ryan would pitch at least one no hitter in seven hypothetical careers — of some interest, I suppose but not what I was intending to show.) In any event, my approach to estimating A no hitter in a career was a lot sounder. My approach was similar to James’s, with two differences. First, I assumed 27 outs. Second, instead to calculating the predicted number of no hitters, I calculated the probability of getting at least one. I used the same method to compare Clemens with Sutton, and under this method, Clemens came out on top. Clemens had a probability of at least one no hitter of .505 while Sutton was at .4975. Using the 26 out assumption, Clemens beat out Sutton by .6134 to .5901. It’s also interesting to note that Clemens lack of a no hitter, is not, statistically speaking, much of an upset at all.

I have two other observations. Although the probability of Ryan pitching at least 7 no hitters under a binomial distribution assumption was a long shot (more than 100 to one), he was much, much more likely to do it that anyone else. Although Clemens is a great pitcher with a long career, Ryan was 1000 times more likely to pitch 7 no hitters in his career than Clemens would have been. (Moderate differences can get magnified at the extremes.)

I also wonder how valid these random approaches are to esimtaing something like a no hitter. It seems that no hitters are more common relative to one hitters than they should be under probability theory. (I haven’t seen any hard data on the relative frequency of the two, it’s just an observation based on anectdotal evidence. For an average pithcher (opponents bat .250 against him), a one hit is 9 times as likely. (Any individual combination is 1/3 as likely but there are 27 times as many ways to get a one hitter than a no hitter.) I’d be surprised if one hitters were nine times as frequent. Even for Ryan, the ratio works out to something like 6.75 to one, and Ryan didn’t have 40-odd one hitters. Maybe pitchers just give it something extra when they are close to a no hitter. (That’s a little like the idea that 99 year olds hang on to make it to 100 more often than actuarial tables say they should.

Figuring out the probability of no-hitters has been a source of interest to me for some time. I had an article published in the Baseball Research Journal back in 1993 that came to the same conclusion about Ryan being near his expected number of no hitters, but that was because I messed up in my calculation of the binomial distribution. (What I actually showed was the probability that Ryan would pitch at least one no hitter in seven hypothetical careers — of some interest, I suppose but not what I was intending to show.) In any event, my approach to estimating A no hitter in a career was a lot sounder. My approach was similar to James’s, with two differences. First, I assumed 27 outs. Second, instead to calculating the predicted number of no hitters, I calculated the probability of getting at least one. I used the same method to compare Clemens with Sutton, and under this method, Clemens came out on top. Clemens had a probability of at least one no hitter of .505 while Sutton was at .4975. Using the 26 out assumption, Clemens beat out Sutton by .6134 to .5901. It’s also interesting to note that Clemens lack of a no hitter, is not, statistically speaking, much of an upset at all.

I have two other observations. Although the probability of Ryan pitching at least 7 no hitters under a binomial distribution assumption was a long shot (more than 100 to one), he was much, much more likely to do it that anyone else. Although Clemens is a great pitcher with a long career, Ryan was 1000 times more likely to pitch 7 no hitters in his career than Clemens would have been. (Moderate differences can get magnified at the extremes.)

I also wonder how valid these random approaches are to esimtaing something like a no hitter. It seems that no hitters are more common relative to one hitters than they should be under probability theory. (I haven’t seen any hard data on the relative frequency of the two, it’s just an observation based on anectdotal evidence. For an average pithcher (opponents bat .250 against him), a one hit is 9 times as likely. (Any individual combination is 1/3 as likely but there are 27 times as many ways to get a one hitter than a no hitter.) I’d be surprised if one hitters were nine times as frequent. Even for Ryan, the ratio works out to something like 6.75 to one, and Ryan didn’t have 40-odd one hitters. Maybe pitchers just give it something extra when they are close to a no hitter. (That’s a little like the idea that 99 year olds hang on to make it to 100 more often than actuarial tables say they should.