import os
import time
import json
import requests
import pandas as pd
from elasticsearch import Elasticsearch, helpers
from process import process_data
from graphing import MLBGrapher
PITCHf/x, created and maintained by Sportvision, is a system that tracks the speeds and trajectories of pitched baseballs. This system, which made its debut in the 2006 MLB playoffs, is installed in every MLB stadium. The data from the system is often used by broadcasters to show a visual representation of the pitch and whether or not a pitch entered the strike-zone. PITCHf/x is also used to determine the type of pitch thrown, such as a fastball, curve, or slider. MLB uses the data from PITCHf/x in its Zone Evaluation System which is used to grade and provide feedback to umpires. Sabermetric analysts note that umpire accuracy has improved after the technology was introduced to MLB.
This dataset was collected from baseball savant and contains every pitch thrown during the 2018 baseball season.
In total, this dataset contains:
Total Games | 2,397 |
Unique Pitchers | 799 |
Unique Batters | 990 |
Total Pitches | 721,190 |
Total At Bats | 182,352 |
Total Strikeouts | 41,043 |
Each pitch contains a number of different attributes including:
The full data dictionary can be found here
What made a pitch an outlier last year? In 2018, there were more strikeouts than hits for the first time ever in the history of baseball. 2018 also saw over 1000 pitches recorded at 100mph or more and breaking balls continued to take up an increased percentage of balls thrown.
So, hitters are striking out more. Could this be a result of an increase in the pitch speed, an unexpected over-reliance on breaking balls, or something else? In this analysis, we will try to detemine what pysical factors best make a pitch unique. Along the way, we will investigate individual pitchers and highlighted at-bats.
We will be focusing on the following features:
First, load the data from files into a pandas dataframe. We will perform some preprocessing on the raw data:
float
to int
pitch_type
to N/A
pitcher
is nullThis dataset was scraped from baseball savant, and it does not include pitchers' names. So, the first step after loading the data will be to map pitcher ids to pitcher names.
try:
df = pd.read_csv('data/2018-season.csv')
df = df.drop('fielder_2.1', axis=1)
df = df.drop('Unnamed: 0', axis=1)
df = df.drop('pitcher.1', axis=1)
except:
process_data()
grapher = MLBGrapher()
Now, our dataset is ready for analysis.
print("Total Games:\t\t", len(df.groupby(['game_date', 'home_team', 'away_team'])))
print("Unique Pitchers:\t", len(df['pitcher'].unique()))
print("Unique Batters:\t\t", len(df['batter'].unique()))
print("Total Pitches:\t\t", len(df))
print("Total At Bats:\t\t", len(df.groupby(['game_date', 'home_team', 'away_team', 'at_bat_number'])))
print("Total Strikeouts:\t", len(df[df['events']=='strikeout']))
Total Games: 2397 Unique Pitchers: 799 Unique Batters: 990 Total Pitches: 721190 Total At Bats: 182352 Total Strikeouts: 41043
Now that our data is loaded properly, let's look at overall pitch statistics by pitch type.
Every baseball game features hundreds of pitches from 60 feet, 6 inches, each serving one defined purpose: to defeat a hitter. Of course, the players those pitches are designed to conquer have an entirely different goal in mind.
Virtually every Major League pitcher throws a combination of pitches, with starting pitchers often owning an arsenal of three or more offerings. Relief pitchers, who infrequently face the same batter more than once in a game, have historically succeeded with the help of one or two pitch types that are thrown with maximum effort.
The types of pitches recorded by statcast are the following:
What makes a pitch hard to hit? There are several physical factors including the pitch's movement, its speed, and its rotation and non-physical factors like the pitcher's intimidation and the catcher's ability to predict what the batter might be looking for.
First, let's look at these physical factors split by pitch type.
valid_pitches = ['FF','SL','FT','CH','CU','SI','FC','KC','FS','KN','EP','PO','FO','SC']
pd.DataFrame(df.groupby(['p_throws', 'pitch_type'])['pfx_x', 'pfx_z', 'release_speed', 'release_spin_rate']
.mean()
.round(3)).unstack(level=0).reindex(valid_pitches)
pfx_x | pfx_z | release_speed | release_spin_rate | |||||
---|---|---|---|---|---|---|---|---|
p_throws | L | R | L | R | L | R | L | R |
pitch_type | ||||||||
FF | 0.649 | -0.605 | 1.301 | 1.308 | 92.234 | 93.492 | 2233.405 | 2276.274 |
SL | -0.473 | 0.465 | 0.127 | 0.152 | 83.158 | 84.812 | 2343.282 | 2414.638 |
FT | 1.176 | -1.209 | 0.818 | 0.879 | 90.834 | 93.054 | 2122.852 | 2163.811 |
CH | 1.139 | -1.103 | 0.614 | 0.553 | 83.159 | 84.722 | 1826.391 | 1749.678 |
CU | -0.796 | 0.797 | -0.822 | -0.740 | 76.398 | 78.978 | 2469.768 | 2519.998 |
SI | 1.199 | -1.259 | 0.797 | 0.600 | 91.525 | 91.813 | 2136.795 | 2111.806 |
FC | -0.147 | 0.218 | 0.687 | 0.711 | 87.408 | 89.295 | 2248.436 | 2392.251 |
KC | -0.367 | 0.711 | -0.622 | -0.906 | 79.937 | 81.079 | 2203.125 | 2558.581 |
FS | 1.034 | -0.887 | 0.391 | 0.383 | 83.853 | 85.325 | 1364.367 | 1429.855 |
KN | NaN | -0.046 | NaN | 0.048 | NaN | 75.773 | NaN | 1552.714 |
EP | -0.581 | 1.131 | 0.591 | -1.048 | 63.800 | 67.606 | 1860.000 | 2405.475 |
PO | 0.866 | -0.703 | 1.055 | 1.208 | 86.300 | 88.187 | 1963.500 | 2119.322 |
FO | NaN | -1.153 | NaN | 0.756 | NaN | 86.337 | NaN | 1677.253 |
SC | 1.301 | NaN | 0.569 | NaN | 77.495 | NaN | 1946.730 | NaN |
Interestingly, we see that left handed pitchers have a lower average release speed than right handed pitchers. Spin rate averages, pfx_x, and pfx_z appear to be split between lefties and righties. What could cause this relatively large discrepancy in speed? One theory is that left handed pitchers are highly coveted in baseball, since there are far fewer lefties than there are righties. This shortage could potentially lead to two things:
# loop through columns and add grouped box plots
grapher.pitch_boxes(df, 'release_speed');
grapher.pitch_boxes(df, 'release_spin_rate');
grapher.pitch_boxes(df, 'pfx_z');
grapher.pitch_boxes(df, 'pfx_x');
Let's interpret these charts. A few things to note:
FF
are the straightest and the fastest pitches. By straightest, I mean that they have the highest pfx_z
. That is, they have the least drop.CU
, KC
), as their name implies, have the most drop.SL
, CU
, FC
, KC
) have more horizontal movement than other pitches. That is, they break towards or away from the batter.CH
). It looks like a Two-Seam fastball (FT
) in terms of horizontal and vertical break, but is thrown at a much slower speed. This makes it a very effective pitch when the pitcher can determine that the batter is expecting a fastball, causing him to be early on his swing.Using the data we can replay individual at bats from 2018. An at bat is a single batter's turn to hit against a pitcher. During an at bat, the pitcher throws pitches to the batter, with each pitch being one of the following:
The at bat continues until one of the following occurs:
* a batter cannot strikeout on a foul ball
Now, let's look at an individual at bat and see what we can get out of the data. Let's look at Aaron Judge's at bat against Chris Sale on June 30th. Judge struck out on Sale's fastball.
The data has the location of the ball as it crosses the batters box. To define the vertical extent of the "strike zone" we use the mean values for sz_top
and sz_bot
across the entire year. The data does not contain the horizontal extent of the strike zone, so I have set it to be [-1,1]
. In the graphic below, we show all pitches of the at-bat, colored by the result of the pitch, where {'red': 'strike', 'green': 'ball', 'blue': 'ball in play'}
and the batter is indicated by the ellipse. Also, whiffs are outlined in black. A whiff is a pitch that was swung at by the batter, but no contact was made.
player = 'Aaron Judge'
pitcher = 'Chris Sale'
date = '2018-06-30'
ab_num = 9
at_bat = df[(df['player_name']== player)
& (df['pitch_name']== pitcher)
& (df['game_date']== date)
& (df['at_bat_number']== ab_num)]
grapher.at_bat(at_bat)
Looking at this, we see that Judge didn't swing at any of Sale's sliders, choosing to wait on a fastball. He swung at the changeup, most likely thinking it was a fastball, and then ultimately struck out a fastball.
Notice that there are 4 strikes here. That means that Aaron Judge fouled off a pitch.
player = 'Aaron Judge'
pitcher = 'Blake Snell'
date = '2018-06-14'
ab_num = 6
at_bat = df[(df['player_name']== player)
& (df['pitch_name']== pitcher)
& (df['game_date']== date)
& (df['at_bat_number']== ab_num)]
grapher.at_bat(at_bat)
We can also look at an individual pitcher's set of pitches as they compare to other pitchers.
ip = df.groupby(['pitch_name', 'game_date', 'inning']).size()
ip = ip.reset_index().groupby('pitch_name').size()
qualified = ip[ip>100]
pitcher = 'Garrett Richards'
col = 'release_spin_rate'
data = pitch_pitcher[pitch_pitcher[col].notnull()]
data = data.merge(df[['pitch_name', 'p_throws']].drop_duplicates().set_index('pitch_name'), left_on='pitch_name', right_index=True)
q = pitch_pitcher[pitch_pitcher[col].notnull()].reset_index().set_index('pitch_name').loc[qualified.index].reset_index().set_index('pitch_type')
valid = q.loc[q.groupby(q.index).size()>10].index.unique()
grapher.ridgeplot(data, valid, col, pitcher)
Now, let's add whiff rate to our analysis - Whiff% divides the number of pitches swung at and missed by the total number of swings in a given sample. This is usually a great indicator of the effectivness of a pitcher's pitch.
outcomes = df['description'].unique()
swings = ['hit_into_play_score', 'foul', 'hit_into_play','swinging_strike', 'hit_into_play_no_out',
'foul_tip', 'swinging_strike_blocked', 'swinging_pitchout', 'foul_pitchout']
missed = ['swinging_strike', 'swinging_strike_blocked', 'swinging_pitchout', 'foul_tip']
ppitch_types_swings = pd.DataFrame(
df[df['description'].isin(swings)]
.groupby(['pitch_type', 'pitch_name']).size()
.sort_values(ascending=False), columns=['swings'])
ppitch_types_miss = pd.DataFrame(
df[df['description'].isin(missed)]
.groupby(['pitch_type', 'pitch_name']).size()
.sort_values(ascending=False), columns=['missed'])
ppitch_types_count = pd.DataFrame(
df.groupby(['pitch_type', 'pitch_name']).size()
.sort_values(ascending=False), columns=['count'])
ppitch_types_pct = pd.DataFrame(
df.groupby(['pitch_type', 'pitch_name']).size()
.sort_values(ascending=False)/len(df), columns=['pct'])
pmovement = pd.DataFrame(
df.groupby(['pitch_type', 'pitch_name'])['pfx_x', 'pfx_z', 'release_speed', 'release_spin_rate', 'babip_value']
.apply(lambda c: c.mean()))
pitch_pitcher = pd.concat(
[ppitch_types_swings, ppitch_types_miss, ppitch_types_count, ppitch_types_pct, pmovement], axis=1, sort=False)
pitch_pitcher['whiff_pct'] = pitch_pitcher['missed'] / pitch_pitcher['swings']
Now, let's index the data into elasticsearch.
host = 'http://localhost:9200'
es = Elasticsearch(hosts=[host])
# index data
def send_pitches():
for m in df.reset_index().to_dict(orient='records'):
yield {
"_index": "mlb-2018",
"_source":{
k:v
for k,v in m.items() if pd.notnull(v)
}
}
helpers.bulk(es, send_pitches())
# create the aggreagtor field
url = "/mlb-2018/_update_by_query"
config = {
"script": {
"lang": "painless",
"source": "ctx._source.agger = ctx._source.pitch_type +'-' + ctx._source.pitch_name"
},
"query": {
"match_all": {}
}
}
es.update_by_query(body=config, index='mlb-2018')
# add swings and misses so we can calculate whiff
url = "/mlb-2018/_update_by_query"
config = {
"script": {
"lang": "painless",
"source": "ctx._source.miss = 0"
},
"query": {
"match_all": {}
}
}
es.update_by_query(body=config, index='mlb-2018')
url = "/mlb-2018/_update_by_query"
config = {
"script": {
"lang": "painless",
"source": "ctx._source.swing = 0"
},
"query": {
"match_all": {}
}
}
es.update_by_query(body=config, index='mlb-2018')
# updates swings and misses
url = "/mlb-2018/_update_by_query"
config = {
"script": {
"lang": "painless",
"source": """
if (['hit_into_play_score', 'foul', 'hit_into_play','swinging_strike', 'hit_into_play_no_out', 'foul_tip', 'swinging_strike_blocked', 'swinging_pitchout', 'foul_pitchout'].contains(ctx._source.description))
{ ctx._source.swing += 1}
"""
}
}
es.update_by_query(body=config, index='mlb-2018')
# updates swings and misses
url = "/mlb-2018/_update_by_query"
config = {
"script": {
"lang": "painless",
"source": """
if (['swinging_strike', 'swinging_strike_blocked', 'swinging_pitchout', 'foul_tip'].contains(ctx._source.description))
{ ctx._source.miss += 1}
"""
}
}
es.update_by_query(body=config, index='mlb-2018')
def make_df(pitch, q=None):
print("{} - creating df".format(pitch))
host = 'http://localhost:9200'
url = "/_data_frame/transforms/pitch_pitcher_{}".format(pitch)
config = {
"source": {
"index": "mlb-2018",
"query": {}
},
"dest": {
"index": "pitch-pitcher-mlb-2018-{}".format(pitch)
},
"pivot": {
"group_by": {
"agger": {
"terms": {
"field": "agger.keyword"
}
}
},
"aggregations": {
"mean_pfx_x": {"avg": {"field": "pfx_x"}},
"mean_pfx_z": {"avg": {"field": "pfx_z"}},
"release_speed": {"avg": {"field": "release_speed"}},
"release_spin_rate": {"avg": {"field": "release_spin_rate"}},
"total_pitches": {"value_count": {"field": "pitch_number"}},
"swings": {"sum": {"field": "swing"}},
"misses": {"sum": {"field": "miss"}},
"babip": {"avg": {"field": "babip_value"}}
}
}
}
if pitch == 'all':
config['source']['query'] = {'match_all': {}}
if pitch == 'all' and q:
config['source']['query'] = {"terms": {"pitch_name.keyword": [p for p in q]}}
if pitch.startswith('qual-'):
config['source']['query'] = {
"bool": {
"must": [
{"match": {"pitch_type": pitch.replace('qual-','').upper()}},
{"terms": {"pitch_name.keyword": [p for p in q]}}
]
}
}
else:
config['source']['query'] = {'match': {"pitch_type": pitch.upper()}}
print(requests.put(host+url, json=config).json())
def start_df(pitch):
print("{} - starting df".format(pitch))
host = 'http://localhost:9200'
url = "/_data_frame/transforms/pitch_pitcher_{}/_start".format(pitch)
print(requests.post(host+url).json())
def make_analytics(pitch):
print("{} - starting analytics".format(pitch))
host = 'http://localhost:9200'
url = "/_ml/data_frame/analytics/nasty_pitches_{}".format(pitch)
config = {
"source": {
"index": "pitch-pitcher-mlb-2018-{}".format(pitch),
"query": {
"range": {"total_pitches": {"gte": 100}}
}
},
"dest": {
"index": "pitch-pitcher-mlb-2018-outliers-{}".format(pitch)
},
"analysis": {
"outlier_detection": {}
},
"analyzed_fields": {
"includes": ["mean_pfx_x", "mean_pfx_z", "release_speed", "release_spin_rate"]
}
}
print(requests.put(host+url, json=config).json())
def calculate_whiff(pitch):
print("{} - calculating whiff".format(pitch))
host = 'http://localhost:9200'
es = Elasticsearch(host)
url = "/pitch-pitcher-mlb-2018-{}/_update_by_query".format(pitch)
config = {
"script": {
"lang": "painless",
"source": """
if (ctx._source.misses > 0)
{ctx._source.whiff = ctx._source.misses / ctx._source.swings}
"""
}
}
es.update_by_query(body=config, index='pitch-pitcher-mlb-2018-{}'.format(pitch))
def start_analytics(pitch):
print("{} - starting analytics".format(pitch))
host = 'http://localhost:9200'
url = "/_ml/data_frame/analytics/nasty_pitches_{}/_start".format(pitch)
print(requests.post(host+url).json())
def clean_up(pitch):
"""We need to do the following:
1. delete transforms
1. delete data frames
2. delete outlier dataframes
4. delete data frame analytics
"""
# we need to delete data frames
host = 'http://localhost:9200'
# delete transforms
tr_stop = "/_data_frame/transforms/pitch_pitcher_{}/_stop".format(pitch)
tr_delete = "/_data_frame/transforms/pitch_pitcher_{}".format(pitch)
print("{} - stopping transform".format(pitch))
print(requests.post(host+tr_stop).json())
time.sleep(10)
print("{} - deleting transform".format(pitch))
print(requests.delete(host+tr_delete).json())
time.sleep(10)
# delete dataframe
df_delete = "/pitch-pitcher-mlb-2018-{}".format(pitch)
print("{} - delete df".format(pitch))
print(requests.delete(host+df_delete).json())
time.sleep(10)
# delete outlier dataframe
o_delete = "/pitch-pitcher-mlb-2018-outliers-{}".format(pitch)
print("{} - deleteing outlier df".format(pitch))
print(requests.delete(host+o_delete).json())
time.sleep(10)
# delete dataframe analytics
a_stop = "/_ml/data_frame/analytics/nasty_pitches_{}/_stop".format(pitch)
a_delete = "/_ml/data_frame/analytics/nasty_pitches_{}".format(pitch)
print("{} - stopping analytics".format(pitch))
print(requests.post(host+a_stop).json())
time.sleep(10)
print("{} - deleting analytics".format(pitch))
print(requests.delete(host+a_delete).json())
# create dataframe and analytics
make_df('all')
time.sleep(10)
start_df('all')
time.sleep(10)
calculate_whiff('all')
time.sleep(10)
make_analytics('all')
time.sleep(10)
start_analytics('all')
Let's look at the top outliers for all pitchers' pitches.
res = es.search(index='pitch-pitcher-mlb-2018-outliers-all', sort=["ml.outlier_score:desc"], size=10000)
top = []
for i in res['hits']['hits']:
pitch = {
"pitch": i['_source']['agger'],
"score": i['_source']['ml'],
"pitches": i['_source']['total_pitches'],
"babip": i['_source']['babip'],
"whiff": i['_source']['whiff']
}
top.append(pitch)
def highlight_row_max(s):
s = s[:-2]
is_max = s == s.max()
sr = pd.Series([False, False])
sr.index = ['whiff', 'babip']
is_max = is_max.append(sr)
return ['background-color: #eef44299' if v else '' for v in is_max]
outliers_all = pd.io.json.json_normalize(top)
outliers_all.columns = ['babip', 'pitch', 'pitches', 'fi.mean_pfx_x', 'fi.mean_pfx_z', 'fi.release_speed',
'fi.release_spin_rate', 'outlier_score', 'whiff']
outliers_all = outliers_all[['pitch', 'pitches', 'fi.mean_pfx_x', 'fi.mean_pfx_z', 'fi.release_speed',
'fi.release_spin_rate', 'outlier_score', 'whiff', 'babip']]
outliers_all = outliers_all.round(5)
outliers_all = outliers_all.set_index(['pitch', 'outlier_score', 'pitches'])
outliers_all.head(20).round(4).style.apply(highlight_row_max, axis=1).background_gradient(cmap='magma', subset=['whiff'], axis=0, low=0, high=1)
fi.mean_pfx_x | fi.mean_pfx_z | fi.release_speed | fi.release_spin_rate | whiff | babip | |||
---|---|---|---|---|---|---|---|---|
pitch | outlier_score | pitches | ||||||
CU-Garrett Richards | 0.99576 | 139.0 | 0.0111 | 0.5404 | 0.0096 | 0.4389 | 0.3036 | 0.0645 |
CH-Anibal Sanchez | 0.9924 | 126.0 | 0.0346 | 0.1016 | 0.7784 | 0.0854 | 0.3867 | 0.0781 |
SL-Pat Venditte | 0.98705 | 127.0 | 0.1273 | 0.2211 | 0.3387 | 0.3129 | 0.3115 | 0.1724 |
FS-Tony Sipp | 0.95698 | 109.0 | 0.0532 | 0.1026 | 0.4619 | 0.3824 | 0.5161 | 0.1379 |
CU-Blaine Hardy | 0.94055 | 126.0 | 0.1916 | 0.2987 | 0.1437 | 0.3661 | 0.1304 | 0.1111 |
CH-Alex Claudio | 0.93909 | 388.0 | 0.1958 | 0.2475 | 0.4361 | 0.1207 | 0.3264 | 0.2451 |
FS-Blake Parker | 0.92651 | 356.0 | 0.0271 | 0.0897 | 0.3512 | 0.532 | 0.4068 | 0.092 |
FF-Marco Estrada | 0.91759 | 1232.0 | 0.0081 | 0.7035 | 0.1762 | 0.1121 | 0.1688 | 0.194 |
FS-Ryne Stanek | 0.91684 | 140.0 | 0.0325 | 0.0232 | 0.371 | 0.5733 | 0.5735 | 0.0444 |
CU-Jackson Stephens | 0.89795 | 103.0 | 0.2801 | 0.08 | 0.2678 | 0.3721 | 0.3864 | 0.1852 |
CU-Brad Ziegler | 0.85543 | 197.0 | 0.0799 | 0.0131 | 0.8906 | 0.0163 | 0.427 | 0.1207 |
SL-Sergio Romo | 0.84671 | 655.0 | 0.0416 | 0.7034 | 0.1881 | 0.0668 | 0.3468 | 0.1894 |
CU-Seth Lugo | 0.83873 | 513.0 | 0.1208 | 0.3017 | 0.0294 | 0.548 | 0.2533 | 0.1544 |
SL-Tyler Glasnow | 0.81024 | 190.0 | 0.3549 | 0.2871 | 0.2274 | 0.1306 | 0.5652 | 0.0563 |
SL-Adam Morgan | 0.8066 | 354.0 | 0.0965 | 0.1683 | 0.1147 | 0.6205 | 0.3819 | 0.1687 |
CU-Rich Hill | 0.79706 | 758.0 | 0.5148 | 0.0458 | 0.0706 | 0.3688 | 0.2105 | 0.1831 |
CH-Marco Estrada | 0.77417 | 924.0 | 0.1213 | 0.4368 | 0.3731 | 0.0688 | 0.3239 | 0.1956 |
FS-Jake Faria | 0.77202 | 208.0 | 0.0432 | 0.0833 | 0.4788 | 0.3946 | 0.3238 | 0.1774 |
SL-Kyle Crick | 0.75823 | 270.0 | 0.1321 | 0.095 | 0.0711 | 0.7018 | 0.4364 | 0.0781 |
CU-Ryan Pressly | 0.73495 | 306.0 | 0.071 | 0.6242 | 0.0565 | 0.2484 | 0.3542 | 0.1667 |
We see that there are a good mix of pitches in here, with CH
, FS
, SL
, SI
, and CU
all appearing in the top 10.
The most outlying pitch of all 2018 is Garret Richard's Curveball, with the highest contributing factors being its vertical movement and it's spin rate.
In fact, Garrett Ricahrds's curveball has, on average, more movement than any other pitch in the game.
from IPython.display import HTML
HTML('<iframe width="800" height="500" src="https://www.mlb.com/angels/video/statcast-richards-filthy-curve-c2134014183?autoplay=false" frameborder="0"></iframe>')
pitcher = 'Garrett Richards'
p_throws = 'R'
pitch = 'CU'
grapher.hexes(df, pitch_pitcher, p_throws, pitch, pitcher)
Let's create separate dataframes for each pitch type so we can run analytics separately.
# get all unique pitches
body = {
"size": 0,
"aggs": {
"unique_pitches": {
"terms": {
"field": "pitch_type.keyword",
"size": 100
}
}
}
}
res = es.search(index="mlb-2018", body=body)
pitches = []
for hit in res['aggregations']['unique_pitches']['buckets']:
if hit['doc_count'] > 1000:
pitches.append(hit['key'].lower())
# create dataframe and run analytics for each
for pitch in pitches:
print("{}...analyzing".format(pitch))
clean_up(pitch)
time.sleep(10)
make_df(pitch)
time.sleep(10)
start_df(pitch)
time.sleep(10)
calculate_whiff(pitch)
time.sleep(10)
make_analytics(pitch)
time.sleep(10)
start_analytics(pitch)
time.sleep(10)
# inspect results in pandas dataframes
pitch_outliers = {}
for p in pitches:
pitch_outliers[p] = True
res_top_20 = es.search(index='pitch-pitcher-mlb-2018-outliers-{}'.format(p), sort=["ml.outlier_score:desc"], size=10000)
top_20 = []
for i in res_top_20['hits']['hits']:
pitch = {
"pitch": i['_source']['agger'],
"score": i['_source']['ml'],
"pitches": i['_source']['total_pitches'],
"babip": i['_source']['babip'],
"whiff": i['_source']['whiff']
}
top_20.append(pitch)
pitch_outliers[p] = pd.io.json.json_normalize(top_20)
pitch_outliers[p].columns = ['babip', 'pitch', 'pitches', 'fi.mean_pfx_x', 'fi.mean_pfx_z', 'fi.release_speed',
'fi.release_spin_rate', 'outlier_score', 'whiff']
pitch_outliers[p] = pitch_outliers[p][['pitch', 'pitches', 'fi.mean_pfx_x', 'fi.mean_pfx_z', 'fi.release_speed',
'fi.release_spin_rate', 'outlier_score', 'whiff', 'babip']]
pitch_outliers[p].head(10)
pitch_outliers[p] = pitch_outliers[p].round(5)
pitch_outliers[p] = pitch_outliers[p].set_index(['pitch', 'outlier_score', 'pitches'])
pitch_outliers['sl'].head(15).round(4).style.apply(highlight_row_max, axis=1).background_gradient(cmap='magma', subset=['whiff'], axis=0, low=0, high=0.5)
fi.mean_pfx_x | fi.mean_pfx_z | fi.release_speed | fi.release_spin_rate | whiff | babip | |||
---|---|---|---|---|---|---|---|---|
pitch | outlier_score | pitches | ||||||
SL-Pat Venditte | 0.99523 | 127.0 | 0.0551 | 0.0091 | 0.8871 | 0.0488 | 0.3115 | 0.1724 |
SL-Tyler Glasnow | 0.98278 | 190.0 | 0.0316 | 0.589 | 0.0165 | 0.3628 | 0.5652 | 0.0563 |
SL-Joe Smith | 0.89858 | 238.0 | 0.026 | 0.7961 | 0.1756 | 0.0023 | 0.3438 | 0.0702 |
SL-Alex Claudio | 0.83676 | 124.0 | 0.0494 | 0.3912 | 0.4086 | 0.1508 | 0.451 | 0.2857 |
SL-Dellin Betances | 0.76117 | 192.0 | 0.0208 | 0.7181 | 0.234 | 0.0271 | 0.3736 | 0.0682 |
SL-Clayton Kershaw | 0.73304 | 968.0 | 0.3084 | 0.4437 | 0.0584 | 0.1894 | 0.2589 | 0.1492 |
SL-Sergio Romo | 0.71216 | 655.0 | 0.0139 | 0.0374 | 0.0648 | 0.8838 | 0.3468 | 0.1894 |
SL-Kyle Crick | 0.68792 | 270.0 | 0.0396 | 0.0428 | 0.0121 | 0.9055 | 0.4364 | 0.0781 |
SL-Adam Morgan | 0.68696 | 354.0 | 0.3318 | 0.3685 | 0.0505 | 0.2493 | 0.3819 | 0.1687 |
SL-Cory Gearrin | 0.64134 | 346.0 | 0.0242 | 0.5216 | 0.1587 | 0.2955 | 0.3974 | 0.1222 |
SL-Jordan Hicks | 0.62949 | 280.0 | 0.4356 | 0.4129 | 0.1242 | 0.0273 | 0.5182 | 0.0933 |
SL-Mike Fiers | 0.5929 | 465.0 | 0.0265 | 0.6359 | 0.311 | 0.0265 | 0.1351 | 0.2155 |
SL-Noah Syndergaard | 0.58858 | 503.0 | 0.01 | 0.0346 | 0.9367 | 0.0187 | 0.4613 | 0.1942 |
SL-Ray Black | 0.58207 | 126.0 | 0.0828 | 0.2415 | 0.102 | 0.5736 | 0.4348 | 0.0588 |
SL-Brett Anderson | 0.57186 | 303.0 | 0.054 | 0.6515 | 0.1481 | 0.1464 | 0.2981 | 0.2353 |
To quote forbes:
Hands down, the worst qualifying pitch of 2018 was Mike Fiers' slider, which earned a D grade.
Why was this one so bad?
pitcher = 'Mike Fiers'
pitcher2 = 'Tyler Glasnow'
p_throws = 'R'
pitch = 'SL'
grapher.hexes(df, pitch_pitcher, p_throws, pitch, pitcher, pitcher2)
Comparing Mike Fiers and Tyler Glasnow, we see that Glasnow's sliders are thrown with a much greater rotation, and a much more signifcant vertical drop.
At the end of the year, the MLB hands out two Cy Young awards for the best pitcher in each league, last year the awards were won by Jacob deGrom of the Mets and Blake Snell of the Tampa Bay Rays. Let's see how their arsenal compares to the rest of the league.
Jacob deGrom is known for his fastball (FF
).
outliers_all.filter(like='Jacob deGrom', axis=0)
fi.mean_pfx_x | fi.mean_pfx_z | fi.release_speed | fi.release_spin_rate | whiff | babip | |||
---|---|---|---|---|---|---|---|---|
pitch | outlier_score | pitches | ||||||
SL-Jacob deGrom | 0.04279 | 770.0 | NaN | NaN | NaN | NaN | 0.36090 | 0.17105 |
CU-Jacob deGrom | 0.02486 | 255.0 | NaN | NaN | NaN | NaN | 0.34444 | 0.22917 |
FT-Jacob deGrom | 0.01812 | 295.0 | NaN | NaN | NaN | NaN | 0.12179 | 0.20408 |
FF-Jacob deGrom | 0.01605 | 1374.0 | NaN | NaN | NaN | NaN | 0.31946 | 0.17231 |
CH-Jacob deGrom | 0.00878 | 516.0 | NaN | NaN | NaN | NaN | 0.33948 | 0.11852 |
pitcher = 'Jacob deGrom'
p_throws = 'R'
pitch = 'FF'
grapher.hexes(df, pitch_pitcher, p_throws, pitch, pitcher)
However, his fastballs are perfectly average! His dominance can't be explained with the features we've sleected. deGroms' success most likley comes from his impeccable command and his ability to mix up pitches and beating batters in the mental game.
outliers_all.filter(like='Blake Snell', axis=0)
fi.mean_pfx_x | fi.mean_pfx_z | fi.release_speed | fi.release_spin_rate | whiff | babip | |||
---|---|---|---|---|---|---|---|---|
pitch | outlier_score | pitches | ||||||
FF-Blake Snell | 0.14203 | 1500.0 | 0.20837 | 0.05387 | 0.59692 | 0.14084 | 0.25000 | 0.16199 |
SL-Blake Snell | 0.13572 | 266.0 | 0.48171 | 0.18924 | 0.27722 | 0.05183 | 0.48276 | 0.06897 |
CH-Blake Snell | 0.08952 | 559.0 | NaN | NaN | NaN | NaN | 0.31047 | 0.14483 |
CU-Blake Snell | 0.07638 | 588.0 | NaN | NaN | NaN | NaN | 0.53430 | 0.11644 |
pitcher = 'Blake Snell'
p_throws = 'R'
pitch = 'CU'
grapher.hexes(df, pitch_pitcher, p_throws, pitch, pitcher)
Blake Snell's curveball was known to be one of the most dominant pitches of 2018, but according to our analysis, it doesn't seem all that outlying. This likely has to do with the fact that an outstanding pitch needs to be contextualized. So much of pitching is about control, mixing up pitches, and throwing the right pitch at the right time.
"I think it's just mainly because of how hard he throws his fastball," Blue Jays catcher Luke Maile said. "It comes out of the same slot, he's got that angle and that kind of upshot heater, and that breaking ball kind of starts out as something you need to hit. Then, before you know it, he threw it 55 feet."
In this analysis, we've managed to highlight a few pitchs both good, and bad. Some that are completely unique, and some thrown by pitchers with rare abilities. However, we learned that the Cy Young winners we're not all that different from the rest of the pack when looking at pysical factors of their pitches.
So, there's more to pitching that unusual physical factors of the pitch. To quote Yogi Berra:
Baseball is 90% mental, the other half is physical.
There are also a lot of important situational details that make a pitcher effective. To name a few:
Identifying a truly outstanding pitch and pitcher take into account all of these features, and looks at how they progress over time. In a subsequent analysis, we can dig deeper into this rich dataset and try to uniquely identify Blake Snell and Jacob deGrom's award winning seasons.