How Do You Find Data that Doesn't Exist?

jdweck42

Jul 5, 20234 min read

Updated: Jul 13, 2023

Joshua Dweck

July 5, 2023

Over the last few years, the analytics communities of many sports at different levels have complied and published large swaths of data going back deep into the past. From my research, tennis has one primary resource: Tennis Abstract’s Jeff Sackman and his Match Tagging Project have crowdsourced the compilation of over 30 different variables from each point of every tagged match.

Jeff Sackman’s compilation, while an incredible service to tennis, is imperfect. Of course, there are some sampling issues involved in the tagging, but there is no other publicly available source of anywhere near as much data. However, we cannot avoid the fact that the tagged data is not available until someone decides to work on the match and it has been tagged. This creates a time lag that makes it impossible for an analyst to create and distribute content relevant to a tournament while it is being played. I used that data for doing the math involved in this project, but I needed to find another way of getting at current data.

With the data not being publicly available in an easily accessible format, what comes next? The typical next step would be to scrape it directly from the source. I started this process attempting to pull down data from Roland Garros and their data provider, InfoSys. InfoSys provides data for the Australian Open and Roland Garros, and each of those websites have easily accessible, extensive archives of past matches stored on the tournament websites. When scraping Roland Garros data, I was able to find point-level data for all but a few singles matches – both main draw and qualifiers – from the last six tournaments. For the year’s first two Slams, you can stop here. But that only gets you to the middle of June and the middle of the Slam season.

The other two Grand Slams, Wimbledon and the US Open, have IBM as their data provider. The websites for neither of those tournaments provide archives the way the others do. The US Open website, however, does have the matches – point-by-point data and all – for 2022. At least for now. But now, in July 2023, the tournament struggling through the rain and a lack of lighting on its outdoor courts in London’s SW19 is Wimbledon. The Championships, alongside IBM, do not provide archives of any match. In fact, as soon as a match ends, its point-by-point data becomes unavailable on the Wimbledon website. So, among the four Grand Slams, timely Wimbledon data is the Holy Grail. If the data goes away before you can go get it, how can you get at it anyway?

After failing to find a way in at the source, the logical next step is to ask someone who might have the data and be able to give it to you. I reached out to a connection working in data with a National Governing Body of tennis who confirmed that regardless of whether they have that data, they could not share it with an outside party.

I was determined to find a way to get this data. I wanted the statistic that I am developing to see the light of day before the run-in to the US Open in late August. The data must be SOMEWHERE, right? Otherwise, how could ESPN reliably maintain the scorebug? How would anyone be able to follow the match from the website? So, I dug further, got a little bit creative, and found a way.

With every public resource I had exhausted, and with the people who might have the data obligated not to give it to me, I had to think outside the box. The solution I found can best be described as the computerized version of sticking your foot in a closing door. While Wimbledon does not keep the data stored in a publicly accessible place, the data does go out during the match in a tab that disappears as soon as the match ends. While the tab is not clickable after the match ends, if it is already open, it will not close on you. That’s where you stick your foot. My solution: have that tab open in a way that can be scraped for every match out on court simultaneously. By running the function that gets the data for each match in a different thread, I figured out how to hold that tab open for every match at the same time. While there are no second chances this way and errors in the wrong places are unrecoverable, this method gives you exactly what you need, most of the time.

I began building an algorithm and writing a program to scrape this data on Friday, the day after the last day of qualifying. Testing began on Monday, Day 1 of the main draw. And, as of midday on Wednesday, Day 3 of the main draw, I am pleased to announce that it is working properly. Whenever a match ends, a file of all the data from that match that I need to do my calculations appears in a folder. In honor of that achievement, this post will be accompanied in a few days by a primer on the new statistic that I am creating using the data retrieved from the Grand Slams and Jeff Sackman’s Match Tagging Project. Watch this space – let’s see if everything keeps working.

Update 07/09/2023: Everything still works well. In the days since this was posted, the Wimbledon website has started leaving the tab open with all the data. I am continuing to use the program, since it does everything automatically, even when I am not near my computer. But if anything goes wrong, there is an issue with the internet, or something looks off, I am now able to get that match off the website.

How Do You Find Data that Doesn't Exist?

Recent Posts

Comments