Summary and Future Plans

It’s been almost a year since I started my Steam games analysis so here’s a little summary.

Goal

Investigate how well post-release success of Steam games can be predicted from basic information about games (available before release). My first intention was to use reviews, YouTube view numbers, social networks activity etc. These are, however, hard or even impossible to obtain from the past. Hence, I decided to completely omit this sort of data and focus solely on the idea of how much a game itself is important.

I used to think that it was media coverage, YouTube and Twitch view numbers etc. that drive sales. But maybe we shouldn’t ask how much coverage a game got in order to understand its sale figures. Perhaps the right question is why it got this amount of coverage. This is not random – we don’t usually hear about games that don’t have anything interesting to offer but rather games that do something right, bring something new to the table, were developed by experienced devs etc.

Data

I need data ideally from before the release of each game. I assume that the information I’m interested in doesn’t change over time on the Steam page. Price is an obvious exception so I had to look for launch prices at different sites while the majority of the data has been downloaded via Steam API. To make things easier, I omitted Early Access and free-to-play titles.

I would eventually like to predict the sale numbers but the only source of this kind of data is SteamSpy which doesn’t even have sufficiently old history to begin with. Hence I went with steamcharts.com and calculated an estimate of average concurrent players in the first two months after a game’s release – not the best measure of success but the best one available, sadly. This limited the games I can work with for releases after July, 2012 (in the end, I only kept games released after August, 2013 when Greenlight fully kicked in).

In the end, I ended up with a table of 3,000 games and attributes listed below. Most should be self-explanatory (e.g. “Linux” says whether a game is available on Linux – true/false).


RequiredAge - numeric
Controller - true/false
Mac - true/false
Linux - true/false
Singleplayer - true/false
Multiplayer - true/false
Coop - true/false
LocalCoop - true/false
SteamAchievements - true/false
SteamTradingCards - true/false
SteamWorkshop - true/false
Screenshots - numeric
Trailers - numeric
Month - this says how many months passed since a certain time
              point until the release of the game
Day - day of month of release
Weekday - Mon-Sat (no game in my database has been released on
              Sunday, surprise!)
LaunchPrice - numeric
French - true/false
German - true/false
Italian - true/false
Japanese - true/false
Polish - true/false
Portuguese - true/false
Russian - true/false
Spanish - true/false
RPG - true/false
Strategy - true/false
Adventure - true/false
Action - true/false
Simulation - true/false
Racing - true/false
Casual - true/false
Sports - true/false
MassivelyMultiplayer - true/false
Indie - true/false
NameLength - numeric (number of characters)
DescriptionLength - numeric
DescIsInf - true/false (I classified descriptions as 
              (non-)informative in my earlier little experiment)
DevPrevGamesCount - numeric (number of games previously
              released by the same developer)
DevPrevGamesMax - numeric (average players of the best
              previously released game by the same developer)
PubPrevGamesCount - numeric (analogically for publishers)
PubPrevGamesMax - numeric
PubExp - true/false (similar to PubPrevGamesCount, just says
              whether they released any game in the past)
PubBig - true/false (whether the game was published by a large 
              publisher - those were hand-picked)
TagPuzzle - true/false (I decided to include some user tags 
              picked by hand and then filtered using feature 
              selection magic)
TagPlatformer
Tag2D
TagRemake
TagPointClick
TagTurnBased
TagTowerDefense
TagJRPG
Tag4X
TagSpace
TagScifi
TagSteampunk
TagBoardGame
TagShort
TagFirstPerson
TagFPS
TagThirdPerson
TagThirdPersonShooter
TagStoryRich
TagFemaleProtagonist
TagHorror
TagSurvival
TagOpenWorld
TagRoguelike
TagFlight
TagWorldWarII
TagSuperhero
TagZombies
TagRTS
TagRhythm
TagTurnBasedStrategy
TagFantasy
TagStealth
TagMedieval
TagCityBuilder
TagSandbox
TagParkour
TagFighting
TagPixelGraphics
TagHiddenObject
TagRetro
TagWalkingSimulator
TagCardGame
TagCyberpunk
TagNudity
TagVisualNovel
TagNoir
TagEpisodic
TagSurvivalHorror
TagFamilyFriendly
TagDatingSim
TagRoguelite
DescriptionComp[1-10] - numeric (top 10 principle components
              of the term-document matrix built from
              descriptions*)
IsSequel - true/false
IsCustomizable - true/false (Does the game allow the player
              to customize something according to
              the description?)
LanguagesNum - numeric (number of supported languages)
Players - average concurrent players in the first two months
              after a game's release - what I'm predicting

*Take all the text descriptions and extract all words that
occur at least several times (I narrowed it down to about
1,500 words). These words are put into a table's columns.
Each game has a row which says how many times each word
occurs. So you get a giant matrix on which you apply some
PCA magic. The result is just a couple of columns instead
of the 1,500 but they're about as useful as the original 1,500.

This is pretty much all you need to make predictions. That’s what I hate about data mining by the way – “This is all I’ve done. Doesn’t look like much. But it took me months.”

Results

I’ve been trying two approaches to predictions: 1) predict directly the number of players, 2) divide the games into 10 groups (almost no players, very few players etc.) and guess in which group a game will belong. Recently, I’ve had more success with the former approach – regression. So I’ll show just that.

Below are results from Support Vector Machines which gave me about the best numbers and behavior. Since the average numbers of players range from nearly zero to around 115,000, I applied logarithm which brought it down to [0-16] narrowed large differences. I evaluated the models using a separate test set in case you wondered.

SVM (polynomial kernel) Baseline
Correlation coefficient 0.79
MAE 1.19
RMSE 1.62
within +-1 from actual 52.9 % 58.2 % *
within +-2 from actual 81.8 % 80.1 %
within +-3 from actual 93.7 % 89.3 %

* Thanks to the distribution, this can be taken as all games from [0-2]

The interpretation partially depends on what you expect. If you believe that it’s solely reviews, YouTube, and other “social” factors that determine how popular a game will be, then these results show that the game itself matters, too – a lot. It’s nowhere near accurate, however, and I definitely couldn’t recommend these predictions for any business decisions.

Ideally, all predictions should be very close to the actual value. “Very close” could be in this case defined as +-1. As shown above, about 53 % of games are well predicted. That may not sound too bad but notice the baseline. If I say that all games will have very few players, I’ll get 58 % of them right. (I wrote about what Steam Greenlight caused earlier by the way.) This gets better as I allow a larger margin of error. However, +-3 basically means “It might sell 100,000 copies but maybe just 1,000”.

Below is my preferred look at regression results which tells me more than any numbers. It says how far each individual prediction is from its actual value.

reg5_svm

You can se a high concentration of games that don’t receive much attention in the bottom. In an ideal scenario, all points should be on the diagonal. It doesn’t look very nice in this regard but definitely far from random. Notice that the actually successful games get mostly underestimated – this is where you can say “those reviews and social media matter”.

Fun fact: when trying out different models, some gave me pretty much exact prediction for GTA V despite not seeing such a high-value example before.

Future Work

I’m planning to re-download everything. There were some errors with release dates in the API so I’d like to resolve that. Also, I didn’t consider the minimum HW requirements before. Those actually say quite a lot about a game. Another thing I’d like to add are screenshots. I can’t really do any in-depth analysis but perhaps dominant colors or color depth might be useful.

Hopefully, this should also give me over 1,000 additional games. That itself should improve the predictions and give me some room for better evaluation. I’ll see if I can translate the average players numbers into sold copies. I’ll never get exact predictions so some deviation isn’t a huge deal at this point.

Finally, after several frustrating realizations during my research, I feel like I can accomplish something. It probably won’t be an accurate prediction system. Maybe I’ll just separate some games that are predictable. Basically, I’ll be looking for the right sub-task which will give good results. I’m positive there’s something exciting and useful waiting to get dug up.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s