Data Policy: public, requires attribution

2018 MLB Pitch/fx data

PITCHf/x, created and maintained by Sportvision, is a system that tracks the speeds and trajectories of pitched baseballs. This system, which made its debut in the 2006 MLB playoffs, is installed in every MLB stadium. The data from the system is often used by broadcasters to show a visual representation of the pitch and whether or not a pitch entered the strike-zone. PITCHf/x is also used to determine the type of pitch thrown, such as a fastball, curve, or slider. MLB uses the data from PITCHf/x in its Zone Evaluation System which is used to grade and provide feedback to umpires. Sabermetric analysts note that umpire accuracy has improved after the technology was introduced to MLB.

This dataset was collected from baseball savant and contains every pitch thrown during the 2018 baseball season.

In total, this dataset contains:

Total Games 2,397
Unique Pitchers 799
Unique Batters 990
Total Pitches 721,190
Total At Bats 182,352
Total Strikeouts 41,043

Each pitch contains a number of different attributes including:

The full data dictionary can be found here

Analysis Goals

What made a pitch an outlier last year? In 2018, there were more strikeouts than hits for the first time ever in the history of baseball. 2018 also saw over 1000 pitches recorded at 100mph or more and breaking balls continued to take up an increased percentage of balls thrown.

So, hitters are striking out more. Could this be a result of an increase in the pitch speed, an unexpected over-reliance on breaking balls, or something else? In this analysis, we will try to detemine what pysical factors best make a pitch unique. Along the way, we will investigate individual pitchers and highlighted at-bats.

We will be focusing on the following features:

Data Exploration

Load and Clean

First, load the data from files into a pandas dataframe. We will perform some preprocessing on the raw data:

This dataset was scraped from baseball savant, and it does not include pitchers' names. So, the first step after loading the data will be to map pitcher ids to pitcher names.

Now, our dataset is ready for analysis.

Exploratory Data Analysis

Now that our data is loaded properly, let's look at overall pitch statistics by pitch type.

Every baseball game features hundreds of pitches from 60 feet, 6 inches, each serving one defined purpose: to defeat a hitter. Of course, the players those pitches are designed to conquer have an entirely different goal in mind.

Virtually every Major League pitcher throws a combination of pitches, with starting pitchers often owning an arsenal of three or more offerings. Relief pitchers, who infrequently face the same batter more than once in a game, have historically succeeded with the help of one or two pitch types that are thrown with maximum effort.

The types of pitches recorded by statcast are the following:

What makes a pitch hard to hit? There are several physical factors including the pitch's movement, its speed, and its rotation and non-physical factors like the pitcher's intimidation and the catcher's ability to predict what the batter might be looking for.

First, let's look at these physical factors split by pitch type.

Interestingly, we see that left handed pitchers have a lower average release speed than right handed pitchers. Spin rate averages, pfx_x, and pfx_z appear to be split between lefties and righties. What could cause this relatively large discrepancy in speed? One theory is that left handed pitchers are highly coveted in baseball, since there are far fewer lefties than there are righties. This shortage could potentially lead to two things:

  1. naturally right handed people learning to throw with their left hand since it gives them a better chance of making the big leagues. Throwing with their non-dominant hand could result in decreased performance.
  2. Similarly, since there are fewer left handed pitchers, the bar for making the bigs could be lower for southpaws.

Let's interpret these charts. A few things to note:

  1. Four Seam Fastballs FF are the straightest and the fastest pitches. By straightest, I mean that they have the highest pfx_z. That is, they have the least drop.
  2. Curveballs (CU, KC), as their name implies, have the most drop.
  3. Breaking Pitches (SL, CU, FC, KC) have more horizontal movement than other pitches. That is, they break towards or away from the batter.
  4. Look at the changeup (CH). It looks like a Two-Seam fastball (FT) in terms of horizontal and vertical break, but is thrown at a much slower speed. This makes it a very effective pitch when the pitcher can determine that the batter is expecting a fastball, causing him to be early on his swing.

At Bats

Using the data we can replay individual at bats from 2018. An at bat is a single batter's turn to hit against a pitcher. During an at bat, the pitcher throws pitches to the batter, with each pitch being one of the following:

The at bat continues until one of the following occurs:

* a batter cannot strikeout on a foul ball

Chris Sale vs Aaron Judge

Now, let's look at an individual at bat and see what we can get out of the data. Let's look at Aaron Judge's at bat against Chris Sale on June 30th. Judge struck out on Sale's fastball.

The data has the location of the ball as it crosses the batters box. To define the vertical extent of the "strike zone" we use the mean values for sz_top and sz_bot across the entire year. The data does not contain the horizontal extent of the strike zone, so I have set it to be [-1,1]. In the graphic below, we show all pitches of the at-bat, colored by the result of the pitch, where {'red': 'strike', 'green': 'ball', 'blue': 'ball in play'} and the batter is indicated by the ellipse. Also, whiffs are outlined in black. A whiff is a pitch that was swung at by the batter, but no contact was made.

Looking at this, we see that Judge didn't swing at any of Sale's sliders, choosing to wait on a fastball. He swung at the changeup, most likely thinking it was a fastball, and then ultimately struck out a fastball.

Notice that there are 4 strikes here. That means that Aaron Judge fouled off a pitch.

Blake Snell vs Aaron Judge

We can also look at an individual pitcher's set of pitches as they compare to other pitchers.

Now, let's add whiff rate to our analysis - Whiff% divides the number of pitches swung at and missed by the total number of swings in a given sample. This is usually a great indicator of the effectivness of a pitcher's pitch.

ES Data Frame Analytics

Now, let's index the data into elasticsearch.

Index Data

Create Data Frame and Analytics

Analysis

All Pitches

Let's look at the top outliers for all pitchers' pitches.

We see that there are a good mix of pitches in here, with CH, FS, SL, SI, and CU all appearing in the top 10.

Garrett Richards - Curveball

The most outlying pitch of all 2018 is Garret Richard's Curveball, with the highest contributing factors being its vertical movement and it's spin rate.

In fact, Garrett Ricahrds's curveball has, on average, more movement than any other pitch in the game.

Pat Venditte

SegmentLocal

A switch-pitcher -- very rare

Partition Analyses by Pitch Type

Let's create separate dataframes for each pitch type so we can run analytics separately.

Tyler Glasnow

SegmentLocal

Mike Fiers

To quote forbes:

Hands down, the worst qualifying pitch of 2018 was Mike Fiers' slider, which earned a D grade.

Why was this one so bad?

Comparing Mike Fiers and Tyler Glasnow, we see that Glasnow's sliders are thrown with a much greater rotation, and a much more signifcant vertical drop.

Wherefore Cy Young Winners?

At the end of the year, the MLB hands out two Cy Young awards for the best pitcher in each league, last year the awards were won by Jacob deGrom of the Mets and Blake Snell of the Tampa Bay Rays. Let's see how their arsenal compares to the rest of the league.

Jacob deGrom

Jacob deGrom is known for his fastball (FF).

However, his fastballs are perfectly average! His dominance can't be explained with the features we've sleected. deGroms' success most likley comes from his impeccable command and his ability to mix up pitches and beating batters in the mental game.

Blake Snell

Blake Snell's curveball was known to be one of the most dominant pitches of 2018, but according to our analysis, it doesn't seem all that outlying. This likely has to do with the fact that an outstanding pitch needs to be contextualized. So much of pitching is about control, mixing up pitches, and throwing the right pitch at the right time.

"I think it's just mainly because of how hard he throws his fastball," Blue Jays catcher Luke Maile said. "It comes out of the same slot, he's got that angle and that kind of upshot heater, and that breaking ball kind of starts out as something you need to hit. Then, before you know it, he threw it 55 feet."

Results

In this analysis, we've managed to highlight a few pitchs both good, and bad. Some that are completely unique, and some thrown by pitchers with rare abilities. However, we learned that the Cy Young winners we're not all that different from the rest of the pack when looking at pysical factors of their pitches.

So, there's more to pitching that unusual physical factors of the pitch. To quote Yogi Berra:

Baseball is 90% mental, the other half is physical.

There are also a lot of important situational details that make a pitcher effective. To name a few:

Identifying a truly outstanding pitch and pitcher take into account all of these features, and looks at how they progress over time. In a subsequent analysis, we can dig deeper into this rich dataset and try to uniquely identify Blake Snell and Jacob deGrom's award winning seasons.