I tend to think of data mining as a two-stage process: data acquisition, and preprocessing is the more boring and often quite time-consuming part. And then there are the actual predictions. That’s where we use machine learning. If you don’t know anything about that, you might want to scroll through this ridiculously beautiful introduction. However, this is also the part where you might realize you did something wrong earlier and so it usually iterates between tuning the data and training models.
I purposefully did not include data about reviews and overall reception to see if games are predictable based on what kind of games they are. It’s hard to tell what the result should be. Some might say it’s impossible to predict anything from that but I definitely expected at least some correlation.
There’s one pretty huge problem I have. As I showed earlier, the majority of games don’t get much attention. Suppose I can give reasonable predictions for 80% of games. That may not sound too bad, quite the contrary in fact. The problem is, if I claim that all games will sell only a small(-ish) amount of copies, I will get right 85% of games!
This will always skew the results and I can’t do anything about it unless I figure out a way to filter these low-selling games out. That’s not the only problem, though. It also makes the models biased towards examples with the more frequent (in this case low) predicted values. I can help it by populating the dataset with copies of examples with higher values. This tells the models “Hey, there’s also us, the guys with high values!”
Anyway, let’s move on to the actual predictions. Since I’m predicting the average number of players, ranging from 0 (yes) to about 115,000, this is a regression task (predicting the actual value). However, I also split the data into 10 groups so I can perform classification (putting the examples into distinct groups). Each can perform differently so it’s worth trying both if possible.
First, regression. I transformed the values using logarithm to make the job easier. The algorithm used was Random Forest (many decision trees built on random data samples and random attribute samples; usually performs very well):
correlation coefficient: 0.75 Mean Absolute Error: 1.51 Root Mean Square Error: 1.85 Normalized RMSE 71.8 %
I prefer visualization when performing regression because you can clearly see how the model behaves across all the data. Here you can see it’s prone to overestimating the less-played games while mostly underestimating the games on the other side of the spectrum. Overall, nothing I can boast about.
Let’s look at classification on 10 classes, again using Random Forest:
Accuracy : 51.0% (baseline: 58.8 %) Correct or off by one class: 81.7 % (baseline 85.5 %) Correct or off by two classes: 93.5 % (baseline 94.6 %)
As you can see, the predictions are pretty much all over the place. The good thing is, I don’t have a lot of problems with severe overestimation. I’m mostly concerned about performance on games with many players. That’s where I get a lot of underestimation but not the opposite. Which can actually be expected as it suggests it doesn’t matter how feature-packed a game is if it gets a lot of attention for some reason (not captured by the data).
In the end, I learned that I probably need to stop looking at the games as a whole but rather find distinct groups for which the predictions work better. Perhaps try only two groups and focus on the games with a lot of potential. Another promising way to split the data may be by looking at the developer/publisher. Also, I can still keep tuning the data I have, and even add more data. There’s a lot to be done.