Translating Retrosheet to JSON in Java

Happy Earth Day

Take a look at a sample from a Retrosheet event file:

com,"call was upheld by replay"
play,7,1,donaj001,00,,NP
sub,dysos001,"Sam Dyson",0,0,1
play,7,1,donaj001,20,.BBX,46(1)/FO/P.3-H(UR);2-3
play,7,1,bautj002,11,FBX,HR/78/L.3-H(UR);1-H(UR);B-H(TUR)
com,"ej,buehm001,P,scotd901,Coming onto the field (not on roster)"
com,"$Jose Bautista stood at HP admiring his homer"

Now if I hadn’t watched 4 replays of this game already, a lot of this CSV file would be unclear to me. This format is also difficult to use in a web API or mobile app which why I was surprised when I couldn’t easily find a JSON version of the Retrosheet Database.

The JSON

GitHub Link

Download the repository, and see the Readme.md file for instructions on how to run the tool. You have to provide the Retrosheet files yourself; just extract them into the JSONRetro/data folder after you download them from retrosheet.org

Here’s a sample JSON from a different play in that game:

 "play" : {
     "inning" : 7,
     "team" : 0,
     "batter" : "choos001",
     "count" : {
         "balls" : 1,
         "strikes" : 2
     },
     "pitches" : "FBFB",
     "event" : "OA/UREV.3-H(UR)(E2/TH)(NR)"
 },
 "play" : {
    "inning" : 7,
    "team" : 0,
    "batter" : "choos001",
    "count" : {
        "balls" : 2,
        "strikes" : 2
    },
    "pitches" : "FBFB.FS",
    "event" : "K"
 }

I still haven’t decided on a way to parse “event” and “pitches” but that’s why the tool is open source.

The Java

I made heavy use of Apache Commons libraries to simply iterate the CSVRecords in a CSVParser and build a string that gets written to a .json file. There is a 1:1 mapping of event files to json files and the output is 3 times the size but a single json file is still typically less than 1 megabyte. That means you can easily store a large number of these in memory at one time even on a mobile device.

The tool can probably be made much quicker with some multithreading and maybe by not writing out the StringBuilder to file on every iteration. What it does do is provide 100% compliant JSON as validated by JSONLint in about a minute.

 

I have MLB Gameday Data, the Lahman Database in JSON, and now a JSON version of Retrosheet. This is the starting point to being able to parse every MLB game down to the pitch level with more recent years including details such as the spin rate of every pitch and a description of every play in both Spanish and English.

Mapping MLBAM Player IDs to the Lahman Database

After Downloading Gameday Data, I wanted to make a short post about translating the Lahman database into JSON. The purpose is so that I can compare season stats from Lahman with at-bat outcomes from MLB Gameday.

If you just want to download the JSON translations, check out JSONLahman on GitHub. The

The JSON

Here’s an example of what a player’s entry looks like:

{"item":{,"KEY_RETRO":"tulot001","KEY_MLBAM":"453064","KEY_BBREF":"tulowtr01","KEY_FANGRAPHS":"3531","THROWS":"R","BATS":"R","NAMEFIRST":"Troy","NAMELAST":"Tulowitzki","KEY_LAHMAN":"tulowtr01","BIRTHYEAR":"1984","BIRTHMONTH":"10","BIRTHDAY":"10"}}

The properties with format “KEY_XYZ” are the player IDs from a variety of websites. The ones that I’m using are KEY_MLBAM and KEY_LAHMAN but I’ve included Fangraphs, BBRef, and Retrosheet for the fantasy baseball enthusiasts.

A Bit of Java Code

The first step was to get a list of all players in the MLBAM database. The Chadwick Bureau has an exhaustive and reliable list of these players, but unfortunately doesn’t include the Lahman player IDs in their database. I then wrote a bit of R code that merges this csv file with the Master.csv file from Lahman based on the Retrosheet ID.

people <- read.csv('people.csv')
master <- read.csv('master.csv')
people <- people[, c("key_mlbam", "key_bbref", "key_retro", "key_fangraphs")]
master <- master[, c("retroID", "throws", "bats", "nameFirst", "nameLast", "key_lahman", "birthYear", "birthMonth", "birthDay")]
merged <- merge(x=people, y=master, by.x = "key_retro", by.y = "retroID")
write.csv(merged, file="merged.csv")

I then imported the CSV into Razor SQL to build a Derby database that I can embed into a Java application. I could have used SQLite but I love Apache’s Java libraries.

ResultSet res = stmt.executeQuery("SELECT * FROM PLAYERS");
 File file = new File(baseLocalURL+"\\players.json");
 writeTableToFile(res, file);

You can see all the code in the GitHub repo linked above. It’s not the best code but the goal was just to get the output.

Downloading MLB Gameday Data in Java

I have a project that I’m working on that involves downloading an entire season’s worth of MLB Gameday data. These files include per-pitch data for all at bats since 2007, with increasingly detailed information being recorded every season. I’ve written a tool in Java that downloads the pertinent files for my needs.

Here’s a sample from a Rangers @ Jays game from the 2015 ALDS:

The last 3 files are the Pitch F/X data and the first two are used for at-bat and game metadata.

How to do it

If you have Java installed, the quick way is to download this and run it from your terminal or command line. The command takes two parameters: 1) the year to download and 2) the local directory to save the files.

java -jar Downloader.jar 2014 "c:\Users\majorsaber\data"

I think it took about 25 minutes for the 2014 season. The tool will continue where it left off if you have to cancel it during the download but it might leave the 1 currently downloading file corrupted. This could probably be improved by multithreading the download which is why I made the code open source.

[UPDATE: The tool is now multithreaded and can download an entire season in 5 minutes!]

How it works

GitHub Repository

The URL template for a single game’s worth of data is

http://gd2.mlb.com/components/game/mlb/year_2015/month_10/

and this is the contents of that directory. By fetching the DOM for that HTML directory listing and parsing it using jsoup, I get a list of all folders that begin with “gid_”. I can then fetch the 5 files that I need from each game’s directory using apache commons-io’s very useful copyURLToFile method.

That’s it. My next task is to map the batter and pitcher IDs from the Gameday data to that in the Lahman Database and Retrosheet so I can cross reference player stats with at-bat level data.

*bat flip*