There has been a lot of hype about the new MLBAM StatCast system, a player-tracking/raw data machine. With all of this new data will come a need for more data analysis, and most likely, a better way to store and track data. I have manually compiled every piece of StatCast data currently available to the public through the various videos published on MLB.com, demonstrating some of the impressive capabilities of the new system.

The data was comprised from a few 2013-2014 regular season games, the 2014 All-Star Game, and the 2014 Playoffs. Below I have added links to downloadable spreadsheets demonstrating a few of the key fields that might be collected for each play in a major league baseball game using StatCast. The database that I created for this new StatCast data includes seven tables connected to the Lahman database, which I use to query players’ past statistics. Of those seven tables, four hold information that I predict will become the future talking points of not only front offices and statistical baseball writers, but the casual fan as well. The four tables holding all of the fancy new statistics are the Pitching, Batting, Fielding, and Running tables.

This StatCast database is meant to store every play within each game of a season using a play ID to connect plays from table to table. Using the player ID’s from the Lahman database seemed to me to be the easiest way to implement the new statistics, since it will be helpful in the future to query stats from both the Lahman files and the new StatCast files. This setup will also allow me to use counting and rate SQL formulas to easily understand a players season and career StatCast statistics.

As you look over the numbers, you will see some stars like Mike Trout, Andrew McCutchen, and Troy Tulowitzki. As I stated before, I was limited to the stats that have been released by MLB from 2013 through 2014, so the data on some of these players are incomplete or non-existent. This was more of a project about using the data we know can be tracked to create workable tables that can be fused with other different databases; in my case, I am morphing the new data with the Lahman baseball files. While we have little data to work with now, in the future I will be ready to incorporate lots of play-by-play StatCast stats into my database.

As you can see there are lots of null values. This is due to the incomplete information available for each play. In theory all of these fields would be filled if and when StatCast data becomes available to the public.

I suggest that you browse each spreadsheet to get a feel for the data…..

 

Batting – Download the full Batting table

Batting_StatCast

 

Fielding – Download the full Fielding table

StatCast_Fielding1

 

Pitching – Download the full Pitching table

Pitching_StatCast

 

Running – Download the full Running table

Running_StatCast

 

OK, now that you have played around with the spreadsheets, you might be thinking of unique ways to use these numbers to help evaluate players. Personally, I have an ongoing brainstorming journal that lists ways in which teams/management can use StatCast to test the overall performance of players. It might be a good idea for a future crowd sourcing post.

Just for fun, let’s see who ranks highest in some of these new statistical categories based on the micro amount of data we have:

 

Batters

Greatest Exit Velocity (off bat): Eric Hosmer, KC, 106.1 mph

Longest Fly Time: Juan Perez, SFN, 5.01 sec

Shortest Fly time: Kolten Wong, STL, 0.95 sec

 

Fielding

Quickest Acceleration: Anthony Recker, NYN  4.27 ft/sec2

Greatest Max Speed: Billy Hamilton, CIN and Ruben Tejada, NYN, 23.3 mph

Highest Route Efficiency: Omar Quintanilla, NYN, 100%

Quickest Release: Tony Cruz, STL, 0.37 sec

Fastest Velocity: Andrew McCutchen, PIT, 78.8 mph

Quickest First Step: Travis d’Arnaud, NYN, -1.7 sec

 

Base Running

Quickest First Step: Jhonny Peralta, STL -1.18 sec

Quickest Acceleration: Omar Infante, KC 9.99 ft/sec²

Greatest Max Speed: Jarrod Dyson, KC, 22.3 mph

Largest Lead Length: Pablo Sandoval, SFN, 17 ft

Largest Secondary Lead Length: Brandon Crawford, SFN, 21 ft

 

Pitching

Longest Extension: Yusmiero Petit, SFN, 92 in

Highest Actual Velocity: Kevin Gausman, BAL, 99.6 mph

Highest Perceived Velocity: Kevin Gausman, BAL, 100.7 mph

Largest Difference between Perceived and Actual Velocity: Francisco Rodriguez, MIL, 2.9 mph

Greatest Spin Rate: Sergio Romo, SFN, 3002 rpm

 

These stats really don’t mean much since they’re only taken from a few plays, but imagine what we could come up with if we had every games’ stats. Also, think about how we could correlate some of this data with other metrics. How does a pitcher’s Spin Rate affect his Fly Ball or Ground Ball rate? How does a player’s Lead Length or First Step affect his Stolen Base percentage? Does a batter’s average Exit Velocity or Launch Angle have any correlation with his BABIP or OPS? No more just eyeballing whether a player is quick out of the box, or if he consistently takes a good route to the ball. This could also help quantify areas that players need to work on. A batter will now know if he needs to work on his acceleration out of the box, and a pitcher will know if his extension is causing him to throw more balls.

All of these things will be dealt with as soon as we get more data. I am trying to increase my “First Step” rate by creating an Access database to house the new data before it is available. By no means do I think I have hit the nail on the head with this first attempt to store the new stats, but I at least wanted to get the ball rolling.

Stephen writes about Major League Baseball at BP Bronx and Banished To The Pen. He also informs readers about college baseball at the blog Underground Baseball. Follow Stephen on Twitter at @steve21shaw

Next post:
Previous post:

7 Responses to “A Compilation of Public MLB StatCast Statistics”

  1. Matt Jackson

    Great stuff, Stephen. This is a tremendous resource.

    Any thoughts on how to handle negative values in the first step field? Perhaps they’re fine to leave as is as they reflect anticipation? I’d be interested to hear your thoughts on that and any possible limitations of the data you’ve noticed.

    Reply
    • Stephen Shaw

      The negatives values shouldn’t be too much of a problem for calculating rate statistics. You can run basic regression with negative values.

      From my understanding the first step starts when the body starts moving. For example, if a player was tagging up on a fly ball and had a negative first step it doesn’t necessarily mean he left early but that his body started moving before the ball was caught.

      I might put together a piece on how I would use the data and the limitations that we might have to deal with.

      Reply
    • Stephen Shaw

      We do not know for sure how much data will be released to the public. If I had to guess, for the upcoming season we will probably only see more of the same type of videos being periodically released on the MLB.com website.

      With that being said, since all of the teams will have these tracking systems installed in their ballpark I do not see why they wouldn’t release the data to the public. If they do it will probably come in some type of XML format much like the PITCHf/x data.

      Reply
  2. Matt

    I am excited to get into this. My thinking is the defensive metrics are about to take on an entirely different meaning. Brush up on your Ordinary Differential Equations and head back into the books for Vector analysis and Vector Calculus. Fortunately I work with this everyday so I am chomping at the bit to get some raw data even if I have to glean it from video for my own person use.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.