Closer Look at Available Data

Let’s look at the data we have available. On October 17th, 2015, I downloaded info about more than 6,000 games released on Steam. Thankfully, Steam offers an API which was enough to get most of the info I wanted. Sadly, there are some missing or even incorrect pieces such as release dates. This led to some games being left out of the dataset and some having an incorrect release date attached to them.

Besides the API, I parsed user tags directly from HTML. This is one of the questionable attributes which changes a lot after a games’s release but I’ll get to that at some point. In order to retrieve launch prices, I had to use something else than Steam as it only shows the current price. has really good historical data regarding sales and I might actually find more use for that.

I filtered out free-to-play and Early Access titles, and left games released since August 1st, 2012 till June 31st, 2015. I needed data from which was launched in July, 2012. After some cleaning, I obtained a dataset of 3,021 games. There are some games missing but that should have close to no effects on the results.

Here are all the attributes (the names should be mostly self-explanatory):

1 id
2 Name
3 Description
4 Developer
5 Publisher
6 Required Age
7 Controller
8 Windows
9 Mac
10 Linux
11 Singleplayer
12 Multiplayer
13 Coop
14 Local Coop
15 Steam Achievements
16 Steam Trading Cards
17 Steam Workshop
18 Screenshots
19 Trailers
20 Year
21 Month
22 Day
23 Weekday
24 Year-Month
25 Launch Price
26 Price Group
27 English
28 French
29 German
30 Italian
31 Japanese
32 Polish
33 Portuguese
34 Russian
35 Spanish
36 RPG
37 Strategy
38 Adventure
39 Action
40 Simulation
41 Racing
42 Casual
43 Sports
44 Massively Multiplayer
45 Education
46 Indie
47 Name Length
48 Description Length
49 User Tags
50 Players
51 Class

Some are just for statistical analysis (like 24 Year-Month which I used in my previous post). Players is the average number of players in the first two months after release, and Class is just groups of games with similar number of players (again, see my previous post). I still need to figure out how exactly I should use the user tags, if at all. But all-in-all, the dataset is in a pretty good shape.

Now we have the data we need and the numbers we want to predict. Whoa, that means we can make some predictions!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s