
Data trainees predict Champions League winner via machine learning [update April 2022]
A few weeks ago, the kick-off was given for the biggest football event of the year: the Champions League. Coincidentally, at around the same time, Ormit Talent’s brand new data traineeship kicked off. We seized this opportunity to combine our passion for data and football in an ambitious data project: predict the winner of the Champions League 2021-2022. Data trainees Jonathan Kemel and Tom Martens, nice to meet you!
Update [April 2022]
Since we are proceeding towards the finals of the champions league and to find out if our model is still on track, it is a good time to review the predictions we made.
When we compare the group stages we see that we managed to correctly predict 12 out of the 16 teams who advanced to the knock out phase. If we look at the ranking inside the group stage we see that 18 out of a total of 32 teams had the rank we forecasted them to have (also because if you miss predict the ranking of one team, it implies that you also wrongly predicted the ranking of another team in the same group).
But what does this tell us about which team you should put your money on?
The next phase of the tournament consists of the round of sixteen which is the first round of the knock out phase. During the time of the prediction we didn’t know who was going to proceed and what the draw of each team was going to be. Therefore, we randomly drew ourselves with the teams we forecasted to advance. We managed to correctly determine 5 out of 8 teams who proceeded to the quarter finals. In the semi-finals, 2 out of 4, where Villareal is the biggest surprise since they managed to knock out Bayern Munich in the quarter finals. But we still need to find out whether we managed to correctly predict the winner of the champions league 2022. Since Manchester City is still in the running this is still possible!
Our mission
Will Chelsea win for the second time in a row, will Manchester City finally live up to its title ambition, or will we get a surprise of some magnitude? Predicting the next Champions League winner is not something you do with a wet finger. The success of a team depends on several factors such as their past performance, market value and ratings. But also the population size and wealth of the country in which the team plays has its impact. That’s why we took two approaches: Jonathan based himself on the recent results, Tom developed two models based on the team profile. For our data analyses, we used various machine learning techniques.
Jonathan’s model: data prediction based on recent team results
Toms model: data prediction based on team profile
1. Data collection.
To determine the team profile, I combined several data sets. For example, I took into account population size and gross domestic product (GDP), since wealthier countries presumably invest more money in football. I measured the strength of each team by analyzing team ratings from all FIFA games and Champions League results from 2005 onwards. I did this for all teams that have reached at least the 8th finals of the Champions League since 2005. Some teams, like Shakhtar Donetsk for example, did not exist yet in the FIFA 2005 game so their FIFA rating from that year would be missing from the dataset. To solve this, I calculated for these teams the average rating of the years they did get a FIFA rating. With that average, I filled in the missing values. Clubs without any historical data like Sheriff, were excluded from the analysis. Once all the data was collected, I trained the models using WEKA, a data tool used for data mining (= finding relationships in data sets).
2A. Data analysis with linear regression.
For the first model, I poured the variables ‘team rating’, ‘population’, ‘number of spectators’ and ‘competition’ for each team into a formula. The result was a score that allowed me to nicely estimate the performance of each team. The higher the score, the further the team will advance in the tournament.
3A. And the winner is …. (pt 1)
According to this model, Bayern Munich may crown themselves football kings on May 28. The losing finalist is another German tradition club: Dortmund.
2B. Data analysis via nearest neighbors.
For the second model, I applied the machine learning method nearest neighbors. This involves the model searching for seven clubs with a similar profile for each team. Based on the scores, it then calculates the average and determines how strong a team will perform. For example, if team A has many similarities to teams that do well in the Champions League, the model predicts that team A will finish far.
3B. And the winner is …. (pt 2)
Manchester City wins over Dortmund by a large margin. This puts the result in line with Jonathan’s model that also predicted Manchester City as winners. The statistics are favorable for the English champion!
Who are we?
Jonathan Kemel
I am… a data geek with a big heart for the people behind the data. I value the input of colleagues and believe that the whole is always more than the sum of its parts.
My strongest dataskill is… visualizing data and making results clear to others at a glance.
Other assets are… my analytical mind and data-driven mindset. Translating numbers and data into understandable models and insights? That’s what I get a kick out of!
In addition… I’m a real sports freak, on and off the field. Beat Tom in a game of FIFA? Been there, done that, nailed it.
Tom Martens
