Let’s look at the data we have available. On October 17th, 2015, I downloaded info about more than 6,000 games released on Steam. Thankfully, Steam offers an API which was enough to get most of the info I wanted. Sadly, there are some missing or even incorrect pieces such as release dates. This led to some games being left out of the dataset and some having an incorrect release date attached to them.
Besides the API, I parsed user tags directly from HTML. This is one of the questionable attributes which changes a lot after a games’s release but I’ll get to that at some point. In order to retrieve launch prices, I had to use something else than Steam as it only shows the current price. steamsales.rhekua.com has really good historical data regarding sales and I might actually find more use for that.
I filtered out free-to-play and Early Access titles, and left games released since August 1st, 2012 till June 31st, 2015. I needed data from steamcharts.com which was launched in July, 2012. After some cleaning, I obtained a dataset of 3,021 games. There are some games missing but that should have close to no effects on the results.
Here are all the attributes (the names should be mostly self-explanatory):
6 Required Age
14 Local Coop
15 Steam Achievements
16 Steam Trading Cards
17 Steam Workshop
25 Launch Price
26 Price Group
44 Massively Multiplayer
47 Name Length
48 Description Length
49 User Tags
Some are just for statistical analysis (like 24 Year-Month which I used in my previous post). Players is the average number of players in the first two months after release, and Class is just groups of games with similar number of players (again, see my previous post). I still need to figure out how exactly I should use the user tags, if at all. But all-in-all, the dataset is in a pretty good shape.
Now we have the data we need and the numbers we want to predict. Whoa, that means we can make some predictions!