From The Hardball Times: A Decision Tree Approach to Pitch Prediction

A Decision Tree Approach to Pitch Prediction” is an article about the use of decision trees to predict pitch type using PITCHf/x data. The problem here is that the bar is set very high to produce useful results. For all pitchers, the first obvious step for predicting pitch type is to look at pitch percentages. If a pitcher throws fastballs (2 or 4 seamers, not cut) 65% of the time, then this sets the acceptable rate of prediction for any method. There are some good comments at the end of the article that emphasize this point. I left the following comment:

‘This is a very interesting and well written article. It shows that there is a lot of value in PITCHf/x data. I have a few suggestions for refining your research.

1. I second Matt’s suggestion: “On another note, one possible way to improve the error rate of the model would be to default to fastball unless the alternative had a large enough sample and high enough percentage of non-fastball use. While it looks like you already did that to a certain degree, raising the minimum requirements for a non-fastball guess could improve the accuracy (even if it means the model predicts fastballs 80% of the time even when we know the pitcher only throws fastballs 60% of the time).” Such a threshold will also help with the overfitting problems.

2. You should also consider using a random forests algorithm. This is an ensemble of decision trees. There are free packages for this algorithm available in various computer languages. Also, ( has a web based implementation (note that it is not free).

3. I suspect that your modeling of batters faced is causing problems. You may want to use something simpler such as batter’s place in the order (top third, middle third, bottom third), a category of pinch hitters, and discard pitchers from the data.

4. For situations in which your model does not meet the threshold mentioned in #1, you should consider using another random forests algorithm with this data and shallow trees and less refined attributes. The goal here is to generate predictions that are statistically relevant yet better than using pitch % data.’

For those who have knowledge of machine learning algorithms and baseball, there are some obvious methods that may be more promising as well as improving the analysis of the author. I have a project using machine learning for baseball stats that has unfortunately been relegated to the back burner due to other machine learning projects that consume most of my time. However, this article gives me some new ideas to pursue.

Below is an excerpt from the article.

The examples I’ve mentioned suggest that the patterns are there, and that if we look hard enough, I think we’ll find them.

We also have a ton of information that gives us the appropriate context to identify the patterns we’re looking for. The PITCHf/x database contains records for pitch selection, situational information, events preceding a pitch, and pitch outcomes. We can’t really ask for much more than that.

All of this information initially can seem overwhelming. Using it to identify patterns by hand would be time-consuming, and we would end up missing things.

Sometimes the patterns we see are obvious, as was the case with Greinke. But what if a pitcher is extremely predictable in a situation we aren’t prone to notice? How often do you think Greinke throws fastballs in two-strike counts when he’s just thrown a breaking ball and there’s a runner on third?

Instead of doing things by hand, we can use a model to do our pattern recognizing for us. This model should be flexible, allowing us to throw in many different bits of information, and it would use the most important factors we provide it with to make predictions. Once we have this model in place, we’ll show it a bunch of data specific to one pitcher. The model will arrange the data in the way that best predicts the next pitch to be thrown.

After asking around*, I decided to work with a decision tree. Decision trees are great at taking a bunch of data, picking up on trends, and displaying the data in a way that allows its viewer to follow these trends.

One clear benefit to a decision tree, as opposed to other machine learning techniques, is that its mechanics are pretty easy to understand. The data start at the top the tree and get filtered through the tree’s branches. At each level, the tree sorts the data through various yes/no questions as it refines its prediction. The most important questions are asked at the top of the tree, and the questions asked toward the bottom refine the tree’s initial guesses.

A quick example: Let’s say the first branch of Greinke’s tree is the handedness of the hitter he’s facing. If the hitter is a righty, Greinke’s overall pitch distribution changes a bit. He doesn’t throw his change-up much (about 4.5 percent of the time, vs. 12 percent overall**), and he throws his slider more often. Against lefties, the opposite is true. After filtering the data through this first branch, our guess improves as we move from his overall distribution to his handedness-specific distribution.

Another advantage to the decision tree is that it doesn’t allow useless information to skew our results. For instance, if I included jersey color as a variable, the model’s suggestions wouldn’t change, and the important patterns still would be doing the predicting.

If the model was reliable, you could put it to use right away. A big league coach, with a single sheet of paper in his hand, could follow the game and signal in pitch guesses if the situation calls for it. Pretty cool, right?

The entire article can be read here.

This entry was posted in Machine Learning (Narrow Artificial Intelligence), Sports and tagged , . Bookmark the permalink.