DFS Moneyball: Beating daily fantasy sports with predictive analytics

  • Print
  • Connect
  • Email
  • Facebook
  • Twitter
  • LinkedIn
  • Google+
By Michael A. Henk, Nicholas Blaubach | 24 March 2016

As Major League Baseball’s opening day approaches, everyone watching sporting events can expect to see advertisements for daily fantasy sports (DFS) contests. This phenomenon continues to sweep the globe despite well documented and yet to be resolved concerns regarding its legality. Most sports fans are familiar with fantasy sports, but this take on a long-existing game is relatively new and offers huge cash prizes for entrants that can conquer the most elite competitions. The million-dollar question (sometimes literally) is “How can I win the game?”

Before we can answer that question, we’ll need a brief history lesson on where this DFS phenomenon originated. Then we’ll analyze the strategy of one known million-dollar winner. Finally, we’ll see if we can use our predictive analytics, data science, and actuarial forecasting background to “win” the game and make DFS a rational investment option for advanced modelers.


For those unfamiliar with the basic concept, a “fantasy sport” is defined as a game where participants build teams that compete against each other based on the performance of the real-life individual athletes or teams of a professional sport. In 2015, over 55 million people in the United States and Canada played some form of fantasy sports.

While fantasy sports have been around for decades, daily fantasy sports have been gaining popularity over the last few seasons. The concept behind them is relatively simple and differs from traditional fantasy sports only in the length of the commitment: participants enter into a one-day (or one-week) contest where they attempt to assemble the optimal lineup. “Points” are accrued by a drafted athlete’s performance, where more points are allocated for more favorable outcomes (a hypothetical baseball example may be 10 points for a home run versus 1 point for a walk). Participants must also stay within a predetermined salary cap, which is essentially a “budget,” preventing participants from selecting the most elite, and thereby most expensive, athletes across the board. Salaries are determined by the DFS provider and can be considered representative of the “expected” number of points that particular athlete will accrue during the game. After the real-life games are complete, the lineup that scores the most points versus other entries in that particular contest wins.

A typical DFS contest involves participants entering into “pools” with a specific number of other participants, each contributing a specified dollar amount upfront. The DFS provider, of course, will take a small portion of each entry into the pool in order to generate revenue and cover the costs of operating the contest. Legally, the prizes for each pool must be stated up front. In the event that the contest doesn’t fill to capacity, the DFS provider must still pay out the stated prize. In the DFS industry, this is referred to as “overlay” and is a common problem for new DFS providers (as it was for the current industry providers at their own inceptions).


In 2006, a federal law commonly known as the “Unlawful Internet Gambling Enforcement Act” (UIGEA) was adopted that “prohibits gambling businesses from knowingly accepting payments in connection with the participation of another person in a bet or wager that involves the use of the Internet and that is unlawful under any federal or state law.” This act contains subsections that specifically contain language on the legality of fantasy sports. These subsections contain a series of guidelines that Internet-based fantasy sports must comply with to continue to operate legally. Essentially, the law defines fantasy sports as a game of skill rather than a game of chance. This is far from universally accepted, and this ruling has come under fire recently, with a number of states (Arizona, Iowa, Louisiana, Missouri, Nevada, and Washington) banning DFS already, while others are currently challenging the DFS laws.

The legality of DFS continues to be questioned and cannot be accepted as a given going forward. However, until the laws written surrounding fantasy sports are clarified or terms such as “game of skill” or “game of chance” are better defined, it’s anticipated that DFS will continue to operate.

Winning the game

Now that we’ve touched on the history and (assumed) continued legality of DFS, we can start to discuss how to win the game. In 2015, the top market share providers paid out a combined $3 billion. How do we get our share?

There’s a reason for this, and the constant blast marketing campaigns are only a part of it. These companies have been very highly valued and have been invested into heavily. To generate a return on this investment, these companies must grow. To grow, they need new participants. This is the reason for the significant investment in almost every advertising medium. The ads showcase the big winners, focusing on the lack of commitment and the potential to get rich quickly, an approach that is very similar to a traditional lottery advertisement.

Like traditional lottery advertisements, these ads never focus on the downside of the contests. The fact of the matter is most players lose. In addition to paying out a huge sum of money in 2015, DFS providers generated more losers than ever before. True, any participant can get lucky with a handful of entries and walk away with tens of thousands or even millions of dollars. In reality, almost all of the prize money is flowing to a small few who have built elaborate statistical models and use automated tools to generate hundreds of entries at once. These models not only build an optimized roster for competition, they’re also being used to identify the weakest opponents, further allowing them to cash in.

However, these elite DFS participants are a threat to growth in the industry. It’s hard to imagine that a lot of participants will stick around and continue to pay entry fees if they feel like they can’t win. On the other hand, state-run lotteries still have a significant number of people who keep coming back, so maybe there’s no threat to growth after all. That remains to be seen.

One of the “elite” winners of fantasy sports is Saahil Sud,1 who now plays fantasy sports full time, pulling in over $3 million in prizes in 2015 (net of entry fees). He’s now one of the top-ranked DFS participants (according to Rotogrinders, an online DFS tracker). He spends up to 15 hours a day “working” on his daily fantasy sports job. During the baseball season, he puts about 200 entries into tournaments each night and puts more than 1,000 entries a week into football contests during the NFL season.

So how did he do it, and more importantly, can we do it? His first step is something we can all replicate, and something actuaries do on a regular basis. He pulls data from various public resources online. As with any predictive modeling exercise, one of the first steps is procuring the data needed to calibrate our models.

Sud then takes this publicly available data and inputs it into his custom-built predictive models, which generate hundreds of different lineups based on his forecasts. There are plenty of websites available that offer to do this for you, but if they’re willing to do it for you, they’re willing to do it for everyone. Much like investment advice, a “hot tip” becomes less and less valuable the more publicly available it is. This is why Sud created his own software and makes sure that no one else can access his personalized database.

So how many participants have the secret sauce, the models that will win the most on average, and just how badly are they dominating the field? Analysis done by Rotogrinders for Bloomberg shows that the top 100 players enter more than 300 winning lineups per day, and the top 10 players (combined) win an average of 873 times a day. The rest of the 20,000 players tracked by Rotogrinders? They average 13 wins a day (combined).

Through the summer of 2015, players such as Sud came under scrutiny from other participants in a debate centered on automation. Some took issue with participants using software to change hundreds of lineups at a time, generally in response to last-minute injury announcements or other roster-related developments. In response, the two major providers clarified their policies, now requiring participants to get permission to run automation scripts on their sites. They do still allow automation, which represents a new area for potential growth in the DFS space.

So truly “winning” at DFS on a consistent basis is an exercise in big data management and predictive analytics. Those who identify the most predictive variables, the best “value” players, and the best overall matchups, in the shortest amount of time, stand to win the most. However, it’s important to think of “winning” at DFS the same way we think of “winning” at the stock markets. It’s generally a long-term investment, and the odds of striking it rich on any given play are low. Even Sud says his return on DFS investments is about 8%. The magnitude of his investments are what helps him generate these returns consistently.

When playing the stock market or poker or DFS, any information you have that others don’t gives you a competitive edge and a greater chance at winning, whether it's from a predictive model, prior knowledge, or elsewhere. The better the information, the better your odds of winning. However, if this information becomes public, it also increases the odds of a split pot, which dramatically reduces potential returns. So if you come up with a predictive model that consistently beats the odds and consistently wins you money, keep it to yourself.

Building our own predictive model

Building a predictive model to improve our odds at DFS is something far beyond the scope of an article of this nature (and as we said, if we develop a good one, we’re keeping it to ourselves), but it does give us a frame of reference in which we can talk about some basics of predictive modeling and data science. It’s important to note that, unlike most work in the corporate world, DFS predictive models are generally based on publicly available data. Thus, if you have an idea for a model, there is no harm in trying it out as the data is readily available and free statistical software packages (such as R) require little to no front-end investment.

At the core, and as you’ve likely guessed by now, predictive modeling leverages statistical data in order to forecast outcomes of an event. Predictive modeling ranges from things as simple as a linear regression model to something as complex as neural networks (or beyond). As an example, one predictive model that most everyone in the United States is familiar with is a credit score, where models process a person’s credit history, loan application, and customer data (among other things) in order to rank-order individuals by likelihood of making future credit payments on time.

There are some basic steps that serve as general “rules of thumb” when we set out to develop our predictive model to make us millions in DFS.

First, we need an objective. We want our model to optimize our roster, giving us the most potential points. In our DFS example, we’d want a predictive model that will help us identify the best players for the cost (in order to stay under the salary caps) for any given contest.

Next, we gather our data. In the insurance space, this typically takes the form of a claim or policy history database and is occasionally supplemented by aggregated industry data. As we mentioned above, in the DFS space, all of this information is readily available online. Gathering the data and getting it into a proper format for our predictive model is another story, but historical sports data is easy to find online. One thing to consider here is the traditional actuarial concern of credibility. If the data isn’t credible, it’s highly unlikely that we’ll be able to build a successful model from it. We also are concerned with what data to use. Baseball Reference has MLB box score data from 1914 to the present. Do we think that player performance during World War 1 is predictive of the 2016 MLB season? Probably not. But where do we cut it off? 1990? 2010? This is part of the “art” of predictive modeling, testing and recalibrating not only our models, but also the data being used to calibrate them.

After we choose the data to use, we need to select and transform the specific variables in the data set. The structure of the predictive (or independent) variables in relation to the target (or dependent) variable determines how well a model works. We can transform variables (by taking logarithms, for example) or bucket variables to see what gives us the best fit. Sports data can have hundreds (or even thousands) of variables. There are programs and algorithms available to quickly identify the best candidate variables as well as software suites that automatically segment and transform the most powerful variables to get the best fit.

Next, we process and evaluate our model. The key to good model performance is obviously getting the best fit. If we’ve done the other steps up to this point well, this step should run smoothly. Here we identify the ideal number of variables and use performance metrics to evaluate the model fits. Goodness-of-fit statistics are easily calculable with a spreadsheet package or something more advanced, such as R. Examples include C-Statistics, Information Criterion, Gini Coefficients, and Coefficients of Determination (R-squared).

Once we have a model with the data selected properly and categorized appropriately and which passes our goodness-of-fit tests, we’re not quite done. It’s very important that models be validated. By definition, our model is going to perform well on the data that we used to calibrate it. Even if we used a random holdout sample, which is standard practice, the model performance should have similar results (assuming the holdout sample was truly random). The true test is how well it performs on different data.

For example, if we calibrate a DFS model on MLB data from 2005 to 2012 and have what appears to be a good model, passing the tests, and making intuitive sense, we could consider applying the model to similar data from 2013 to 2015 to truly gauge performance. Scoring alternate data is the best way to tell if our model will perform well. Other options for model validation include bootstrapping, which uses resampling in order to find confidence intervals around our model output, and “variable analysis,” which calculates statistics on the predictive/independent variables as they are affected by the model. These steps are crucial to making sure we generate reasonable results.

Once all of that is done, it’s important to not merely implement the model and ignore it. It requires routine maintenance. As time goes by and data continues to emerge, we need to take time to reinvestigate the data, update the models, and challenge some of our initial assumptions. The best models are continually updated and recalibrated, audited on a regular basis, and replaced when they are no longer effective. How often should we update our models? Sports data comes in rapidly and is generally processed in an easy to download format generally within 12 to 24 hours of the conclusion of the real-life games. This leads to a data set that’s constantly increasing in size. In addition, as technology continues to be implemented in sports analytics, the data is becoming more and more complex.

For example, in MLB stadiums, it’s now possible to get pitch-by-pitch information, not only on the type of pitch, but by pitcher release point and spin rates as well as launch angles and speeds of pitches that batters make contact with. Is this data predictive and relevant to DFS? That remains to be seen, but new variables cannot be discounted simply because they are new. Constantly reevaluating all of the possible relationships between data and results (while maintaining goodness-of-fit and credibility) is key in keeping our predictive model ahead of the game.


No, you’re probably not going to win any significant sum of money playing daily fantasy sports. You’re far more likely to lose. As we’ve seen, only a handful of sophisticated participants win big using their predictive models. Their goal is to win “most” of the time and generate a return on an investment, not to win every game. In addition, generating a consistent return requires participants to put a relatively large amount of money on the line (essentially participants hedge their bets by entering a large number of contests). This amount of money may be too large for the casual participant to ever consider risking.

Consider Sud’s purported 8% return on investment. That sounds impressive until you consider that, since 1928, the annual return on investment on the S&P 500 has been greater than 8% in more than 70% of calendar years through 2015. Yes, there’s money to be made in daily fantasy sports, but only if you’re willing to be an educated investor and willing to take a long-term point of view. Further, as predictive models get better and better and more available to the general public, it will be increasingly difficult to separate yourself from the pack, even if you have a well-performing model.

Just as with any financial modeling, when it comes to a DFS model, it’s important to constantly update and validate the model and make sure you’re reacting to changes in underlying data, and not confusing the signal and noise.

1Nickisch, C. (November 23, 2015). Meet a Bostonian who's made $3 million this year playing daily fantasy sports. WBUR News. Retrieved March 16, 2016, from http://www.wbur.org/2015/11/23/dfs-power-player-profile.