This is a summary of my academic research which aimed to show whether the success of Steam Games can be evaluated before their release, without knowing anything about their reception. The idea was to evaluate a concept rather than a nearly-finished or even a released game.
One could probably get very nice predictions of a game’s sales after several days or weeks following its release and looking at YouTube, Twitch, reviews, social etc. But I wanted to know whether it’s possible to estimate a game’s potential early on, making it possible to e.g. suggest changes during development which would make the final product more successful.
When I started my research, I had only one measure of success available – the average number of players from Steam Charts. I later obtained data about owners from Steam Spy hoping to simply infer owner numbers from player numbers. However, there is not enough correlation between these two types of metric. I don’t believe it’s due to an error. There are, for example, games sold in bundles or just very cheap games. These get high sale figures but not necessarily profit or player base.
It’s been almost a year since I started my Steam games analysis so here’s a little summary.
Investigate how well post-release success of Steam games can be predicted from basic information about games (available before release). My first intention was to use reviews, YouTube view numbers, social networks activity etc. These are, however, hard or even impossible to obtain from the past. Hence, I decided to completely omit this sort of data and focus solely on the idea of how much a game itself is important.
I used to think that it was media coverage, YouTube and Twitch view numbers etc. that drive sales. But maybe we shouldn’t ask how much coverage a game got in order to understand its sale figures. Perhaps the right question is why it got this amount of coverage. This is not random – we don’t usually hear about games that don’t have anything interesting to offer but rather games that do something right, bring something new to the table, were developed by experienced devs etc.
Since the beginning, I’ve been trying various machine learning models as it’s almost impossible to predict which is going to provide the best results. And since the beginning, Random Forest has been out-performing other models. I started to like Random Forest and didn’t really care much about the others. That’s generally a bad idea unless you’re 100% confident. I wasn’t.
I tend to think of data mining as a two-stage process: data acquisition, and preprocessing is the more boring and often quite time-consuming part. And then there are the actual predictions. That’s where we use machine learning. If you don’t know anything about that, you might want to scroll through this ridiculously beautiful introduction. However, this is also the part where you might realize you did something wrong earlier and so it usually iterates between tuning the data and training models.
I purposefully did not include data about reviews and overall reception to see if games are predictable based on what kind of games they are. It’s hard to tell what the result should be. Some might say it’s impossible to predict anything from that but I definitely expected at least some correlation.
Let’s look at the data we have available. On October 17th, 2015, I downloaded info about more than 6,000 games released on Steam. Thankfully, Steam offers an API which was enough to get most of the info I wanted. Sadly, there are some missing or even incorrect pieces such as release dates. This led to some games being left out of the dataset and some having an incorrect release date attached to them.
Besides the API, I parsed user tags directly from HTML. This is one of the questionable attributes which changes a lot after a games’s release but I’ll get to that at some point. In order to retrieve launch prices, I had to use something else than Steam as it only shows the current price. steamsales.rhekua.com has really good historical data regarding sales and I might actually find more use for that.
I filtered out free-to-play and Early Access titles, and left games released since August 1st, 2012 till June 31st, 2015. I needed data from steamcharts.com which was launched in July, 2012. After some cleaning, I obtained a dataset of 3,021 games. There are some games missing but that should have close to no effects on the results.
On 30th August 2012, Steam launched Greenlight with the idea of bringing promising games on Steam with added exposure. However, Greenlight has since received a lot of criticism for not doing its job very well. Because I gathered info about games released from August 2012 to July 2015 (as discussed in this post; for those who won’t read it – I excluded free-to-play games and those in Early Access), I decided to take a look at how successful all these games have been.
Unfortunately, I wasn’t able to find out which of the 3,021 games in my dataset have been Greenlit as Valve doesn’t really like to mention Greenlight (I wonder why). But it can be safely assumed that most of the games released since the beginning of 2014 have gone through Greenlight.
How does one measure success of games? Since Valve doesn’t make sale numbers public, this is quite a difficult task. steamspy.com has been gathering the number of owners for each game but not for long enough. I decided comparing games between each other will have to suffice for now, hence I used steamcharts.com to obtain the average number of players in the first two months after a game’s release. In addition to not being a reliable measure, I had to compensate for the fact that the provided tables only contain numbers for calendar months. Still, it should be enough to give us an idea as to how popular each game was after its launch.