It’s been almost a year since I started my Steam games analysis so here’s a little summary.
Investigate how well post-release success of Steam games can be predicted from basic information about games (available before release). My first intention was to use reviews, YouTube view numbers, social networks activity etc. These are, however, hard or even impossible to obtain from the past. Hence, I decided to completely omit this sort of data and focus solely on the idea of how much a game itself is important.
I used to think that it was media coverage, YouTube and Twitch view numbers etc. that drive sales. But maybe we shouldn’t ask how much coverage a game got in order to understand its sale figures. Perhaps the right question is why it got this amount of coverage. This is not random – we don’t usually hear about games that don’t have anything interesting to offer but rather games that do something right, bring something new to the table, were developed by experienced devs etc.
I need data ideally from before the release of each game. I assume that the information I’m interested in doesn’t change over time on the Steam page. Price is an obvious exception so I had to look for launch prices at different sites while the majority of the data has been downloaded via Steam API. To make things easier, I omitted Early Access and free-to-play titles.
I would eventually like to predict the sale numbers but the only source of this kind of data is SteamSpy which doesn’t even have sufficiently old history to begin with. Hence I went with steamcharts.com and calculated an estimate of average concurrent players in the first two months after a game’s release – not the best measure of success but the best one available, sadly. This limited the games I can work with for releases after July, 2012 (in the end, I only kept games released after August, 2013 when Greenlight fully kicked in).
In the end, I ended up with a table of 3,000 games and attributes listed below. Most should be self-explanatory (e.g. “Linux” says whether a game is available on Linux – true/false).
RequiredAge - numeric Controller - true/false Mac - true/false Linux - true/false Singleplayer - true/false Multiplayer - true/false Coop - true/false LocalCoop - true/false SteamAchievements - true/false SteamTradingCards - true/false SteamWorkshop - true/false Screenshots - numeric Trailers - numeric Month - this says how many months passed since a certain time point until the release of the game Day - day of month of release Weekday - Mon-Sat (no game in my database has been released on Sunday, surprise!) LaunchPrice - numeric French - true/false German - true/false Italian - true/false Japanese - true/false Polish - true/false Portuguese - true/false Russian - true/false Spanish - true/false RPG - true/false Strategy - true/false Adventure - true/false Action - true/false Simulation - true/false Racing - true/false Casual - true/false Sports - true/false MassivelyMultiplayer - true/false Indie - true/false NameLength - numeric (number of characters) DescriptionLength - numeric DescIsInf - true/false (I classified descriptions as (non-)informative in my earlier little experiment) DevPrevGamesCount - numeric (number of games previously released by the same developer) DevPrevGamesMax - numeric (average players of the best previously released game by the same developer) PubPrevGamesCount - numeric (analogically for publishers) PubPrevGamesMax - numeric PubExp - true/false (similar to PubPrevGamesCount, just says whether they released any game in the past) PubBig - true/false (whether the game was published by a large publisher - those were hand-picked) TagPuzzle - true/false (I decided to include some user tags picked by hand and then filtered using feature selection magic) TagPlatformer Tag2D TagRemake TagPointClick TagTurnBased TagTowerDefense TagJRPG Tag4X TagSpace TagScifi TagSteampunk TagBoardGame TagShort TagFirstPerson TagFPS TagThirdPerson TagThirdPersonShooter TagStoryRich TagFemaleProtagonist TagHorror TagSurvival TagOpenWorld TagRoguelike TagFlight TagWorldWarII TagSuperhero TagZombies TagRTS TagRhythm TagTurnBasedStrategy TagFantasy TagStealth TagMedieval TagCityBuilder TagSandbox TagParkour TagFighting TagPixelGraphics TagHiddenObject TagRetro TagWalkingSimulator TagCardGame TagCyberpunk TagNudity TagVisualNovel TagNoir TagEpisodic TagSurvivalHorror TagFamilyFriendly TagDatingSim TagRoguelite DescriptionComp[1-10] - numeric (top 10 principle components of the term-document matrix built from descriptions*) IsSequel - true/false IsCustomizable - true/false (Does the game allow the player to customize something according to the description?) LanguagesNum - numeric (number of supported languages) Players - average concurrent players in the first two months after a game's release - what I'm predicting *Take all the text descriptions and extract all words that occur at least several times (I narrowed it down to about 1,500 words). These words are put into a table's columns. Each game has a row which says how many times each word occurs. So you get a giant matrix on which you apply some PCA magic. The result is just a couple of columns instead of the 1,500 but they're about as useful as the original 1,500.
This is pretty much all you need to make predictions. That’s what I hate about data mining by the way – “This is all I’ve done. Doesn’t look like much. But it took me months.”
I’ve been trying two approaches to predictions: 1) predict directly the number of players, 2) divide the games into 10 groups (almost no players, very few players etc.) and guess in which group a game will belong. Recently, I’ve had more success with the former approach – regression. So I’ll show just that.
Below are results from Support Vector Machines which gave me about the best numbers and behavior. Since the average numbers of players range from nearly zero to around 115,000, I applied logarithm which brought it down to [0-16] narrowed large differences. I evaluated the models using a separate test set in case you wondered.
|SVM (polynomial kernel)||Baseline|
|within +-1 from actual||52.9 %||58.2 % *|
|within +-2 from actual||81.8 %||80.1 %|
|within +-3 from actual||93.7 %||89.3 %|
* Thanks to the distribution, this can be taken as all games from [0-2]
The interpretation partially depends on what you expect. If you believe that it’s solely reviews, YouTube, and other “social” factors that determine how popular a game will be, then these results show that the game itself matters, too – a lot. It’s nowhere near accurate, however, and I definitely couldn’t recommend these predictions for any business decisions.
Ideally, all predictions should be very close to the actual value. “Very close” could be in this case defined as +-1. As shown above, about 53 % of games are well predicted. That may not sound too bad but notice the baseline. If I say that all games will have very few players, I’ll get 58 % of them right. (I wrote about what Steam Greenlight caused earlier by the way.) This gets better as I allow a larger margin of error. However, +-3 basically means “It might sell 100,000 copies but maybe just 1,000”.
Below is my preferred look at regression results which tells me more than any numbers. It says how far each individual prediction is from its actual value.
You can se a high concentration of games that don’t receive much attention in the bottom. In an ideal scenario, all points should be on the diagonal. It doesn’t look very nice in this regard but definitely far from random. Notice that the actually successful games get mostly underestimated – this is where you can say “those reviews and social media matter”.
Fun fact: when trying out different models, some gave me pretty much exact prediction for GTA V despite not seeing such a high-value example before.
I’m planning to re-download everything. There were some errors with release dates in the API so I’d like to resolve that. Also, I didn’t consider the minimum HW requirements before. Those actually say quite a lot about a game. Another thing I’d like to add are screenshots. I can’t really do any in-depth analysis but perhaps dominant colors or color depth might be useful.
Hopefully, this should also give me over 1,000 additional games. That itself should improve the predictions and give me some room for better evaluation. I’ll see if I can translate the average players numbers into sold copies. I’ll never get exact predictions so some deviation isn’t a huge deal at this point.
Finally, after several frustrating realizations during my research, I feel like I can accomplish something. It probably won’t be an accurate prediction system. Maybe I’ll just separate some games that are predictable. Basically, I’ll be looking for the right sub-task which will give good results. I’m positive there’s something exciting and useful waiting to get dug up.