Data trainees predict Champions League winner via machine learning [update April 2022]
A few weeks ago, the kick-off was given for the biggest football event of the year: the Champions League. Coincidentally, at around the same time, Ormit Talent’s brand new data traineeship kicked off. We seized this opportunity to combine our passion for data and football in an ambitious data project: predict the winner of the Champions League 2021-2022. Data trainees Jonathan Kemel and Tom Martens, nice to meet you!
Update [April 2022]
Since we are proceeding towards the finals of the champions league and to find out if our model is still on track, it is a good time to review the predictions we made.
When we compare the group stages we see that we managed to correctly predict 12 out of the 16 teams who advanced to the knock out phase. If we look at the ranking inside the group stage we see that 18 out of a total of 32 teams had the rank we forecasted them to have (also because if you miss predict the ranking of one team, it implies that you also wrongly predicted the ranking of another team in the same group).
But what does this tell us about which team you should put your money on?
The next phase of the tournament consists of the round of sixteen which is the first round of the knock out phase. During the time of the prediction we didn’t know who was going to proceed and what the draw of each team was going to be. Therefore, we randomly drew ourselves with the teams we forecasted to advance. We managed to correctly determine 5 out of 8 teams who proceeded to the quarter finals. In the semi-finals, 2 out of 4, where Villareal is the biggest surprise since they managed to knock out Bayern Munich in the quarter finals. But we still need to find out whether we managed to correctly predict the winner of the champions league 2022. Since Manchester City is still in the running this is still possible!
Will Chelsea win for the second time in a row, will Manchester City finally live up to its title ambition, or will we get a surprise of some magnitude? Predicting the next Champions League winner is not something you do with a wet finger. The success of a team depends on several factors such as their past performance, market value and ratings. But also the population size and wealth of the country in which the team plays has its impact. That’s why we took two approaches: Jonathan based himself on the recent results, Tom developed two models based on the team profile. For our data analyses, we used various machine learning techniques.
Jonathan’s model: data prediction based on recent team results
1. Data collection.
At first I collected all league results of the last four years for each participant. Since the results of last year tell the most about the form a team is in, I gave them the most weight. But a goal in the Premier League means something different than a goal in the Jupiler Pro League. Therefore I calculated the market value of each league and integrated it into the model. For example, the Belgian league got a coefficient of 0.74 for goals scored and 1.21 for goals against. For the English league, the coefficient was 1.29 for goals scored and 0.79 for goals against. This means that English teams score more goals and concede fewer goals than average. For Belgian teams – unfortunately – the reverse is true.
2. Data analysis with Python
After all pools were formed and the schedule was known, I determined the probability of each team winning, drawing or losing each match. To do so, I first calculated how many goals each club in its own league scores and swallows on average. In this way, I obtained four values per team: the number of times they score in front of their own audience (“home scored”), the number of times the visiting team vibrates the nets (“home conceded”), the number of times they score on the road (“away scored”) and the number of times the opponent scores when they play on the road (“away conceded”). To maintain the distinction between the leagues, these variables were multiplied by the market value coefficient.
With those results, I moved back to the pools, where I calculated for each game for both teams how likely they were to score a certain number of goals. The limit was set at 10 goals per team. This allowed me to create a new dataset that reflected what the probability was for each outcome. For example, for the game PSG versus Manchester City, there was a 29% chance of scoring a draw, a 24% chance of PSG winning, and a 46% chance of Manchester City winning. By multiplying these probabilities by the number of points each team would receive (3 for a win, 1 for a draw, and 0 for a loss) and repeating this formula for each match, I was able to determine the group results in fine detail. Since a number of games have been played in the meantime, I have updated the group results on the picture with the actual results.
For the knockout phase – taking into account the current Champions League rules – I determined the draw myself. Per round, the chances of losing and winning were calculated based on the same formulas. In case of a draw, I gave each team a 50% chance of advancing, because penalties are a lottery.
3. And the winner is ….
The analysis shows that Manchester City will win against Bayern Munich in the final. This prediction corresponds nicely with what betting sites predict. The question remains, of course, what the knockout schedule will look like after the draw, as it is quite possible that Manchester City and Bayern Munich will meet earlier. Once I have that info, I can make even more accurate predictions. And who knows, maybe Club Brugge will continue to smash all the predictions. Anything is possible in football, the ball is still round 😉
Toms model: data prediction based on team profile
1. Data collection.
To determine the team profile, I combined several data sets. For example, I took into account population size and gross domestic product (GDP), since wealthier countries presumably invest more money in football. I measured the strength of each team by analyzing team ratings from all FIFA games and Champions League results from 2005 onwards. I did this for all teams that have reached at least the 8th finals of the Champions League since 2005. Some teams, like Shakhtar Donetsk for example, did not exist yet in the FIFA 2005 game so their FIFA rating from that year would be missing from the dataset. To solve this, I calculated for these teams the average rating of the years they did get a FIFA rating. With that average, I filled in the missing values. Clubs without any historical data like Sheriff, were excluded from the analysis. Once all the data was collected, I trained the models using WEKA, a data tool used for data mining (= finding relationships in data sets).
2A. Data analysis with linear regression.
For the first model, I poured the variables ‘team rating’, ‘population’, ‘number of spectators’ and ‘competition’ for each team into a formula. The result was a score that allowed me to nicely estimate the performance of each team. The higher the score, the further the team will advance in the tournament.
3A. And the winner is …. (pt 1)
According to this model, Bayern Munich may crown themselves football kings on May 28. The losing finalist is another German tradition club: Dortmund.
2B. Data analysis via nearest neighbors.
For the second model, I applied the machine learning method nearest neighbors. This involves the model searching for seven clubs with a similar profile for each team. Based on the scores, it then calculates the average and determines how strong a team will perform. For example, if team A has many similarities to teams that do well in the Champions League, the model predicts that team A will finish far.
3B. And the winner is …. (pt 2)
Manchester City wins over Dortmund by a large margin. This puts the result in line with Jonathan’s model that also predicted Manchester City as winners. The statistics are favorable for the English champion!
Who are we?
I am… a data geek with a big heart for the people behind the data. I value the input of colleagues and believe that the whole is always more than the sum of its parts.
My strongest dataskill is… visualizing data and making results clear to others at a glance.
Other assets are… my analytical mind and data-driven mindset. Translating numbers and data into understandable models and insights? That’s what I get a kick out of!
In addition… I’m a real sports freak, on and off the field. Beat Tom in a game of FIFA? Been there, done that, nailed it.
I am… a team player who likes to learn and who is always ready to support colleagues in word and deed. Helping others gives me energy!
My strongest data skill is… finding connections in large amounts of data. This makes it easier for me to predict relevant trends.
Other assets are… my solution-oriented approach and eagerness. In my data projects, I will often explore multiple options to arrive at the best solution.
In addition… I am a huge sports fan. Getting Jonathan to win during a game of FIFA? Been there, done that, nailed it.