Analysis of European Soccer Database¶

Erick Herrera, Hyejeong Kim, Kibreab Gebrehiwot

Introduction¶

As the popularity of machine learning has drastically increased over the last decade, so has the demand for data literacy as a skill in the job market. This tutorial aims to give a general introduction to data science by walking the data science lifecycle (Collecting data, Cleaning data, Exploring data, Building models, and Interpretation) as we try to analyze the data within our chosen database.

We will be using a dataset based on the European fútball 2008-2016 seasons. This dataset is comprised of the information on 11 different European fútball league and their teams. Specifically, the data contains 7 tables (Country, League, Player, player attributes, team, team attributes, and all the matches played during 2008 - 2016 seasons. The goal is to some insight on the beautiful game based on the data held by this dataset.

For this tutorial, We will be using the Pandas library to read and organize the data we are collecting, the Matplot library to render plots for our visualizations of the dataset, and SciKit-Learn library for building our models.

During our exploratory analysis & Data visualization step, some of the questions that we thought would be interesting to try to answer were:

Which team had highest amount of goals per league and per year?
Which team won the most games per league and per year?
How do the team attributes of the two top teams in each league differ from one another?
Are teams scoring more goal when playing home or away? How about the number of wins when home or away?
Is there a relationship between player attributes and their overall rating?
Which players have the highest overall rating per year?
What precent of player prefer their left foot? right foot? or both? Does this connect with their rating at all?
What is the relationship between number of goals scored and the win percentage of the teams?

For our models in this tutorial, we will be presenting two linear regression examples:

Determining if there is a linear relationship between a player's overall rating and the player's attributes documented in the dataset
Determining if there is a linear relationship between a team's winning percentage(team) and team's attributes documented in the dataset

As well, a logistical regression in the form of:

Determining if a team will be the winners of a match from the overall rating of each player on the team's starting line up

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import sqlite3
import matplotlib.pyplot as plt
import warnings
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import statsmodels.api as sm
import seaborn as sn

Collecting Data¶

The first step of the data lifescyle is the collecting phase. For this dataset, it is as simple as downloading the sql file from the database's host https://www.kaggle.com/hugomathien/soccer and performing SQL commands directly on the database to retrive the tables needed for the analysis of the data. The database contains 7 different tables, as such we created 7 variables to hold each the tables before we begin cleaning up.

sqlite_file = 'database.sqlite'
conn = sqlite3.connect(sqlite_file)    # connect to the database

Reading in tables from the database¶

query = "SELECT * FROM Country" 
countries = pd.read_sql(query, conn)      # read and store country table 

query = "SELECT * FROM League"               
leagues = pd.read_sql(query, conn)        # read and store league table

query = "SELECT * FROM Match"                    
matches = pd.read_sql(query, conn)       # read and store match table

query = "SELECT * FROM Player"              
players = pd.read_sql(query, conn)       # read and store player table


query = "SELECT * FROM Player_Attributes"          
player_Attributes = pd.read_sql(query, conn)    # read and store player_attributes table


query = "SELECT * FROM Team"                 
teams = pd.read_sql(query, conn)               # read and store team table

query = "SELECT * FROM Team_Attributes"        
team_Attributes = pd.read_sql(query, conn)     # read and store team_Attributs table

The tables have been stored in panda dataframes and each can be seen below.

countries.head()

leagues.head()

# Creating copy of match table for later use
matches_copy= matches.copy()

# getting the whole column names from match table 
my_list = list(matches)                      
my_list = matches.columns.values.tolist()

matches.head()

players.head()

player_Attributes.head()

teams.head()

team_Attributes.head()

Data Processing¶

Now with our data retrieved and stored, our next step (and the next step of the data lifecycle) is to tidy up the raw data to make it more readable and easier for us to use for our analysis and visualizations. To accomplish this, we will be merging a few of the tables together to make seeing useful insights in the data easier and we will be drop some of the columns from all the tables to remove irrelevant information.

cursor = conn.cursor()

Combining the country and league tables¶

Country and league tables can be combined into one so we can avoid having to link the two by their country id.

result = cursor.execute("""SELECT Country.id, Country.name, League.name FROM Country INNER JOIN
League ON Country.ID = League.ID""")
countryLeague = pd.DataFrame(result.fetchall())
countryLeague.columns = ['id', 'countryName', 'leagueName']
countryLeague.head()

Cleaning Matches Table¶

For tidying the matches table we remove the null columns in the table, we split the season of the match into two columns (season_start and season_end) and we insert the information we gained from merging the country and league tables above to make recognizing who the match belongs to easier

matches.dropna(axis = 1, inplace = True)
matches['date'] = pd.to_datetime(matches['date'])
matches['season_start'] = matches['season'].apply(lambda x: x.split('/')[0])
matches['season_end'] = matches['season'].apply(lambda x: x.split('/')[1])
matches.drop(columns = ['season'], axis = 1, inplace = True)
matchCountry = matches.merge(countryLeague, left_on = 'country_id', right_on = 'id')
matchCountry.drop(columns = ['league_id', 'id_y'], axis = 1, inplace = True)
matchCountry = matchCountry.rename(columns={"id_x": "id", "country_id": "country_league_id"})
matchCountry.head()

Taking it a step further, here we're including the actual names of the teams next to their team ids to make things more legible and introducing a column to keep track of who won the match in that entry.

team_Attributes['date'] = pd.to_datetime(team_Attributes['date'])

team_match = matchCountry.merge(teams, left_on = 'home_team_api_id', right_on = 'team_api_id')
team_match = team_match.rename(columns={"id_x": "id", "team_long_name": "home_team", "team_fifa_api_id": "home_team_fifa_api_id"})
team_match['home_team'] =  team_match['home_team'] + " (" + team_match['team_short_name'] + ")"
team_match.drop(columns = ['id_y', 'team_api_id', 'team_short_name'], axis = 1, inplace = True)

team_match = team_match.merge(teams, left_on = 'away_team_api_id', right_on = 'team_api_id')
team_match = team_match.rename(columns={"id_x": "id", "team_long_name": "away_team", "team_fifa_api_id": "away_team_fifa_api_id"})
team_match['away_team'] =  team_match['away_team'] + " (" + team_match['team_short_name'] + ")"
team_match.drop(columns = ['id_y', 'team_api_id', 'team_short_name'], axis = 1, inplace = True)

team_match = team_match[['id', 'match_api_id', 'date', 'home_team', 'home_team_api_id', 'home_team_fifa_api_id', 
                         'home_team_goal', 'away_team', 'away_team_api_id', 'away_team_fifa_api_id', 'away_team_goal', 
                         'country_league_id', 'stage', 'season_start', 'season_end', 'countryName', 'leagueName']]

winner = []
for index, row in team_match.iterrows():
    if team_match.iloc[index, 6] > team_match.iloc[index, 10]:
        winner.append(team_match.iloc[index, 3])
    elif team_match.iloc[index, 6] < team_match.iloc[index, 10]:
        winner.append(team_match.iloc[index, 7])
    else:
        winner.append("Tie")
team_match['winner'] = winner 

team_match.drop(columns = ['home_team_fifa_api_id', 'away_team_fifa_api_id'], axis = 1, inplace = True)
team_match.head()

Cleaning Player and Player_Attributes tables¶

Here we are merging the player and the player attributes tables into one while dropping any unnecessary columns.

players['birthday'] = pd.to_datetime(players['birthday'])
player_Attributes['date'] = pd.to_datetime(player_Attributes['date'])
player_attribute = players.merge(player_Attributes, left_on = 'player_api_id', right_on = 'player_api_id')
player_attribute.drop(columns = ['id_y', 'player_fifa_api_id_y'], axis = 1, inplace = True)
player_attribute = player_attribute.rename(columns={"id_x": "id", "player_fifa_api_id_x": "player_fifa_api_id"})
player_attribute.head()

Cleaning Team and Team_Attributes tables¶

Similarly to what we did with the player and player_attributes tables, here we are merging the team and team_attributes and then sorting them by name alphabetically.

team_attribute = team_Attributes.merge(teams, left_on = 'team_api_id', right_on = 'team_api_id')
team_attribute = team_attribute.rename(columns={"id_x": "id", "team_long_name": "team_name", "team_fifa_api_id_x": "team_fifa_api_id"})
team_attribute['team_name'] =  team_attribute['team_name'] + " (" + team_attribute['team_short_name'] + ")"
team_attribute.drop(columns = ['id_y', 'team_fifa_api_id_y', 'team_short_name'], axis = 1, inplace = True)
team_attribute.head()

cols = list(team_attribute.columns)
cols = [cols[0]] + [cols[-1]] + cols[1:-1]

team_attribute = team_attribute[cols]
team_attribute.sort_values(by='team_name', ascending=False)
team_attribute.head()

Exploratory analysis & Data visualization¶

Now with our data in order, we can begin the next step of the data lifecycle - data exploration and visualization. This following section will accomplish this section by running the gaunlet of questions we about the dataset as varies examples of the kinds of exploratory analysis that can be performed on data.

1. Which team had highest amount of goals per league and per year?¶

pd.set_option('mode.chained_assignment', None)

df = team_match[['season_start', 'home_team', 'home_team_goal', 'away_team', 'away_team_goal', 'leagueName']]
df['season_start'] = df['season_start'].astype(int)

Firstly, we are starting here by getting total number of goal for each team as home team and as away team each season for each league.

num_goal_home = df.groupby(['leagueName', 'season_start', 'home_team']).sum()
num_goal_home.drop(columns = ['away_team_goal'], axis = 1, inplace = True)
num_goal_home.head()

num_goal_away = df.groupby(['leagueName', 'season_start', 'away_team']).sum()
num_goal_away.drop(columns = ['home_team_goal'], axis = 1, inplace = True)
num_goal_away.head()

From there, we can just combine these columns to get the total number of goals for each team.

num_goal = num_goal_home.merge(num_goal_away, left_index=True, right_on=['leagueName', 'season_start', 'away_team'])
num_goal['total_num_goal'] = num_goal['home_team_goal'] + num_goal['away_team_goal']
num_goal.head()

A = num_goal # A is for later use

We then divided the teams by thier leagues and sorted the teams by their goal count.

num_goal.reset_index(inplace=True)  
num_goal_highest = num_goal.loc[num_goal.groupby(["leagueName", "season_start"])["total_num_goal"].idxmax()]

num_goal_highest = num_goal_highest.rename(columns={"away_team": "team_name"})
num_goal_highest.drop(columns = ['home_team_goal', 'away_team_goal'], axis = 1, inplace = True)
num_goal_highest.reset_index(drop=False, inplace=True)
num_goal_highest

for i, row in num_goal_highest.iterrows():
    num_goal_highest.iloc[i, 2] = str(num_goal_highest.iloc[i, 2]) + " - " + str(num_goal_highest.iloc[i, 2] + 1)

Now we can produce visualizations of these goal counts and see which teams held the most goals each season in each league.

%matplotlib inline

leagues = num_goal_highest.leagueName.unique()
count = 0
for league in leagues:
    new_df = num_goal_highest[num_goal_highest['leagueName'] == league]
    labels = new_df.season_start
    x = np.arange(len(labels))
    goals = new_df.total_num_goal
    width = 0.5
    fig, ax = plt.subplots()
    rects = ax.bar(x , goals, width)
    ax.set_ylabel('num goals')
    ax.set_xlabel('season')
    ax.set_title("team of highest goal per season of " + league)
    ax.set_xticks(x)
    ax.set_xticklabels(labels, fontsize=6)


    teams_label = new_df.team_name
    for rect in rects:
        team_name = teams_label[count].split("(")[0]
        height = rect.get_height() +1
        ax.text(rect.get_x() + width*0.5, 0.7*height, team_name,
                    ha='center', va='bottom', fontsize=7)
        count = count +1
    plt.show()

2. Which team won the most games per league and per year?¶

Very similarly to what we did for the previous question, we will first aggregate each teams wins.

df = team_match[team_match['winner'] != 'Tie'] # find rows with winner is not equal to Tie

df = df[['leagueName', 'season_start', 'winner']]
df['season_start'] = df['season_start'].astype(int)

Below we can see the teams with the most wins for each leaque over the years that the data set covers.

most_wins = df.groupby(['leagueName', 'season_start', 'winner'])['winner'].count().sort_values(ascending=False)
most_wins.head()

leagueName               season_start  winner              
Italy Serie A            2013          Juventus (JUV)          33
Spain LIGA BBVA          2011          Real Madrid CF (REA)    32
                         2012          FC Barcelona (BAR)      32
Scotland Premier League  2013          Celtic (CEL)            31
Spain LIGA BBVA          2009          FC Barcelona (BAR)      31
Name: winner, dtype: int64

Now we will be dividing them again to find the teams with the most wins in each league over the years.

leagues = []
seasons = []
teams = []
count_wins = []
for index, value in most_wins.items():
    leagues.append(index[0])
    season_full = str(index[1]) + " - " + str(index[1] + 1)
    seasons.append(season_full)
    teams.append(index[2])
    count_wins.append(value)
d = {'league':leagues,'season':seasons,'team_name':teams,'count_wins':count_wins}
most_wins = pd.DataFrame(d)
B = most_wins

most_wins = most_wins.loc[most_wins.groupby(["league", "season"])["count_wins"].idxmax()]
most_wins.reset_index(drop=False, inplace=True)
most_wins.head(10)

Now we can produce visualizations of these win counts and see which teams held the most wins each season in each league.

leagues = most_wins.league.unique()
count = 0
for league in leagues:
    new_df = most_wins[most_wins['league'] == league]
    labels = new_df.season
    x = np.arange(len(labels))
    count_wins = new_df.count_wins
    width = 0.35
    fig, ax = plt.subplots()
    rects = ax.bar(x , count_wins, width)
    ax.set_ylabel('num wins')
    ax.set_xlabel('season')
    ax.set_title("team of most won per season of " + league)
    ax.set_xticks(x)
    ax.set_xticklabels(labels, fontsize=6)
    teams_label = new_df.team_name
    for rect in rects:
        team_name = teams_label[count].split("(")[0]
        height = rect.get_height() +1
        ax.text(rect.get_x()+ width*0.5, 0.7*height, team_name,
                    ha='center', va='bottom', fontsize=7)
        count = count +1
    plt.show()

With both these sets of data retieved, we can attempt some cross analysis and compare the goals scored by each team to the wins each team then holds.

df = team_match[team_match['leagueName'] == 'Belgium Jupiler League']
df = df[df['season_start'] == '2013']
df.head()

Comparing most goal team vs most won team per season and per league¶

num_goal_highest.head()

most_wins.head()

team_goal_won = pd.merge(num_goal_highest, most_wins,  how='left', 
                         left_on=['leagueName','season_start'], right_on = ['league','season'])
team_goal_won = team_goal_won.drop(['index_x', 'league', 'season', 'index_y'], axis = 1)
team_goal_won = team_goal_won.rename(columns={"team_name_x": "team_highest_goal", "team_name_y": "team_most_won", 
                                              "season_start": "season"})
team_goal_won.head()

goal_won_same = team_goal_won[team_goal_won['team_highest_goal'] == team_goal_won['team_most_won']]

percent = goal_won_same.shape[0] / team_goal_won.shape[0]
percent * 100

63.63636363636363

From this analysis, we find that the team with the highest goals won the most games in a season roughly 64%.

We then can create a table with total goal number and total win count together for the best team per season and per league.

# A with the copy of the number of goals table from earlier
for index, row in A.iterrows():
    A.iloc[index, 1] = str(A.iloc[index, 1]) + " - " + str(A.iloc[index, 1] + 1)
A.head()

A = A.rename(columns={"away_team": "team_name"})
A.drop(columns = ['home_team_goal', 'away_team_goal'], axis = 1, inplace = True)

A.head()

B.head()

team_goal_won_total = pd.merge(A, B,  how='left', 
                         left_on=['leagueName','season_start', 'team_name'], right_on = ['league','season', 'team_name'])
team_goal_won_total = team_goal_won_total.drop(['league', 'season'], axis = 1)
team_goal_won_total = team_goal_won_total.rename(columns={"season_start": "season"})
team_goal_won_total.head()

3. How do the team attributes of the two top teams in each league differ from one another?¶

warnings.simplefilter(action='ignore', category=FutureWarning)

Here, we are aggregating and matching up the two best teams (based on each team's number of wins) of each league.

com_teams = []
leagues = most_wins.league.unique()
for league in leagues:
    new_df = team_goal_won[team_goal_won['leagueName'] == league]
    a = new_df['team_most_won'].value_counts()
    com_teams.append([league, a.index[0], a.index[1]])
df = pd.DataFrame(com_teams, columns = ['leangue', 'team1', 'team2']) 
df.head()

Now we can begin to visualize the differences in each of the team's attributes and compare to each other to see if we notice any trends in the two best teams that might account for their success.

# Belgium Jupiler League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "RSC Anderlecht (AND)"').mean(),
                            columns=['RSC Anderlecht (AND)'])
compare_attr['Club Brugge KV (CLB)'] = team_attribute.query('team_name == "Club Brugge KV (CLB)"').mean()
compare_attr = compare_attr[3:]

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Belgium Jupiler League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc=0);

RSC Anderlecht was the team with the most wins in the Belgium Jupiler League and when their stats are compared to the runner up of the league, we can begin to assume from the visualization that their lies with their chance creation with passes and their creation of opportunities for taking shots at the goal.

# England Premier League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Manchester United (MUN)"').mean(),
                            columns=['Manchester United (MUN)'])
compare_attr['Manchester City (MCI)'] = team_attribute.query('team_name == "Manchester City (MCI)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of England Premier League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

Manchester United's victory in the English Premier league seems to be a bit more contested by the runner up in terms of stats as manchester seems to really only pull ahead in chances created from crosses while all the attributes are pretty similar to their rivals, Manchester City.

# France Ligue 1
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Paris Saint-Germain (PSG)"').mean(),
                            columns=['Paris Saint-Germain (PSG)'])
compare_attr['Girondins de Bordeaux (BOR)'] = team_attribute.query('team_name == "Girondins de Bordeaux (BOR)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of France Ligue 1')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

# Italy Serie A
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Juventus (JUV)"').mean(),
                            columns=['Juventus (JUV)'])
compare_attr['Milan (ACM)'] = team_attribute.query('team_name == "Milan (ACM)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Italy Serie A')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

The Italy Serie A stats show something rather interesting. Milan(The league runner up) is actually ahead in a lot of the attribute but Juventus still managed to take the most wins in the league.

We continue with these same steps and analysis for the remaining leagues.

# Netherlands Eredivisie	Ajax (AJA)	PSV (PSV)
# Netherlands Eredivisie
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Ajax (AJA)"').mean(),
                            columns=['Ajax (AJA)'])
compare_attr['PSV (PSV)'] = team_attribute.query('team_name == "PSV (PSV)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Netherlands Eredivisie')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

# Poland Ekstraklasa	Legia Warszawa (LEG)	Wisła Kraków (WIS)
# Poland Ekstraklasa
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Legia Warszawa (LEG)"').mean(),
                            columns=['Legia Warszawa (LEG)'])
compare_attr['Wisła Kraków (WIS)'] = team_attribute.query('team_name == "Wisła Kraków (WIS)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Poland Ekstraklasa')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

# Portugal Liga ZON Sagres	SL Benfica (BEN)	FC Porto (POR)
# Portugal Liga ZON Sagres
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "SL Benfica (BEN)"').mean(),
                            columns=['SL Benfica (BEN)'])
compare_attr['FC Porto (POR)'] = team_attribute.query('team_name == "FC Porto (POR)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Portugal Liga ZON Sagres')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

# Scotland Premier League	Celtic (CEL)	Rangers (RAN)
# Scotland Premier League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Celtic (CEL)"').mean(),
                            columns=['Celtic (CEL)'])
compare_attr['Rangers (RAN)'] = team_attribute.query('team_name == "Rangers (RAN)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Scotland Premier League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

# Spain LIGA BBVA	FC Barcelona (BAR)	Real Madrid CF (REA)
# Spain LIGA BBVA
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "FC Barcelona (BAR)"').mean(),
                            columns=['FC Barcelona (BAR)'])
compare_attr['Real Madrid CF (REA)'] = team_attribute.query('team_name == "Real Madrid CF (REA)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Spain LIGA BBVA')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

Interestingly in LIGA BBVA league, even though the second team(Real Madrid) was ahead 6 attributes in total 9, the first most winner was FC Barcelona.

# Switzerland Super League	FC Basel (BAS)	FC Zürich (ZUR)
# Switzerland Super League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "FC Basel (BAS)"').mean(),
                            columns=['FC Basel (BAS)'])
compare_attr['FC Zürich (ZUR)'] = team_attribute.query('team_name == "FC Zürich (ZUR)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Switzerland Super League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

4. Are teams scoring more goal when playing home or away? How about the number of wins when home or away?¶

For this quesetion, we wanted to anaylsis if the old superstition of the "home field advantage" was true by comparing the wins of teams as home teams vs their away wins.

team_goal_won_total['game_count_as_hometeam'] = 'NAN'
team_goal_won_total['goal_count_as_hometeam'] = 'NAN'
team_goal_won_total['win_count_as_hometeam'] = 'NAN'

team_goal_won_total['game_count_as_awayteam'] = 'NAN'
team_goal_won_total['goal_count_as_awayteam'] = 'NAN'
team_goal_won_total['win_count_as_awayteam'] = 'NAN'

team_goal_won_total['num_total_game'] = 'NAN'

team_match['season'] = team_match['season_start'] + " - " + team_match['season_end']

for index, row in team_goal_won_total.iterrows():
    season = team_goal_won_total.iloc[index, 1]
    team_name = team_goal_won_total.iloc[index, 2]
    df = team_match[(team_match['season'] == season) & (team_match['home_team'] == team_name)]
    team_goal_won_total.iloc[index, 6] = df['home_team_goal'].sum()
    team_goal_won_total.iloc[index, 5] = df.shape[0]
    new_df = df[df['winner'] == team_name]
    team_goal_won_total.iloc[index, 7] = new_df.shape[0]
    
    df = team_match[(team_match['season'] == season) & (team_match['away_team'] == team_name)]
    team_goal_won_total.iloc[index, 9] = df['away_team_goal'].sum()
    team_goal_won_total.iloc[index, 8] = df.shape[0]
    new_df = df[df['winner'] == team_name]
    team_goal_won_total.iloc[index, 10] = new_df.shape[0]
    team_goal_won_total.iloc[index, 11] = team_goal_won_total.iloc[index, 5] + team_goal_won_total.iloc[index, 8]

team_goal_won_total = team_goal_won_total[['leagueName', 'season', 'team_name', 'num_total_game', 'game_count_as_hometeam',
                                           'game_count_as_awayteam', 'count_wins', 'win_count_as_hometeam', 
                         'win_count_as_awayteam', 'total_num_goal', 'goal_count_as_hometeam', 'goal_count_as_awayteam']]
team_goal_won_total = team_goal_won_total.rename(columns={"num_total_game": "game_count_total", "count_wins": "win_count_total", "total_num_goal": "goal_count_total"})
team_goal_won_total.head()

With the table created, we can now visualize the results to make it easier to interpret.

new_df = team_goal_won_total[team_goal_won_total.columns[-2:]]
df1 = new_df.mean().to_frame()
df1.reset_index(inplace=True)  
df1 = df1.rename(columns={"index": "type", 0: "mean_goal_count"})

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
x = df1.type
y = df1.mean_goal_count
ax.bar(x,y)
plt.show()

We come up with the mean value of total goals as a home team is higher than the mean value of total goals as an away team.

# win vs home or away
new_df = team_goal_won_total[team_goal_won_total.columns[7:9]]
df1 = new_df.mean().to_frame()
df1.reset_index(inplace=True)  
df1 = df1.rename(columns={"index": "type", 0: "mean_win_count"})

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
x = df1.type
y = df1.mean_win_count
ax.bar(x,y)
plt.show()

Likewise, the mean value of home team victories is much higher than when away.

5. Is there a relationship between player attributes and their overall rating?¶

For this section, we wanted to start a surface level exploration into the relationship between the individual player's overall rating and their attributes. We accomplish this by creating a linear regression with a subset of the player's attributes and their associated rating to see if there were any trends shown by plotting the results.

attributes = player_Attributes[['free_kick_accuracy','agility','shot_power','stamina', 'short_passing', 'vision', 'ball_control', 'positioning', 'finishing']].fillna(0).astype('int32')

reg = LinearRegression().fit(attributes, player_Attributes['overall_rating'].fillna(0))

print(reg.coef_)

i = 0
for x in list(attributes):
    fig, ax = plt.subplots()
    ax.set_ylabel('overall_rating')
    ax.set_xlabel(x)
    ax.scatter(attributes[x],player_Attributes['overall_rating'].fillna(0))
    ax.plot(attributes[x], reg.coef_[i]*attributes[x] + reg.intercept_, 'k')
    i = i + 1
plt.show()

[-0.02712711  0.01414654  0.10463825  0.08053166  0.13160613  0.07051224
  0.00833057 -0.01379527 -0.01003446]

From the results, we can see that there is a relationship present and that could inducate that some attributes weight towards a higher rating more than others. We will continue this exploration when we get to building our models.

6. Which players have the highest overall rating per year?¶

Here we wanted to see who was at the top of the game for each year in the dataset.

grouped_player = player_Attributes.groupby([player_Attributes['date'].dt.year])
players_list = []
ratings = []
seasons = []
for name, group in grouped_player:
    df = group.sort_values('overall_rating', ascending=False)
    row = df.iloc[0]
    player = players.loc[players[players['player_fifa_api_id'] == row['player_fifa_api_id']].index, 'player_name']
    players_list.append(player.values[0])
    ratings.append(row['overall_rating'])
    seasons.append(name)

overall_rating = pd.DataFrame(data={"seasons" : seasons, 'player' : players_list, "rating" : ratings})
overall_rating

Correlations from these results:

Wayne Rooney and cristiano Ronaldo were the two highest rated players during in 2007/2008, both of which played on Manchester United. Manchester United was also the team with the most wins in the Premier League during those years. Similarly, Messi and Neymar have been apart of Barcelona the years they were the highest rated players and while concurrently there at Barcelona, the club had the most wins in the spanish league.

Incidentally, it seems like the highest rated player comes from the spanish league for 8 years in a row here.

7. What precent of player prefer their left foot? right foot? or both? Does this connect with their rating at all?¶

For this question, we were curious to see the distributions of the players preferred foot. As well, in some sports using a different hand or foot than the major sometimes results in small advantages (take left handers in tennis for example), does this apply to this sport as well?

p_A_1= player_Attributes[['player_api_id', 'preferred_foot', 'overall_rating', 'potential']]
p_A_1.head()

p_A_1= p_A_1.groupby(['player_api_id', 'preferred_foot']).mean()  # getting the average player rating
p_A_1.head()

# get the total number of players that are right, left or both and there rating
left_foot=0; left_rating=0.0
right_foot=0; right_rating =0.0
both=0;       both_rating = 0.0
checked=[]
for key, value in p_A_1.iterrows():
    if key[0] in checked:
        both +=1
        both_rating += value[0] 
    elif key[1] == 'right':
        right_foot +=1
        checked.append(key[0])
        right_rating += value[0]
    else:
        left_foot +=1
        checked.append(key[0])
        left_rating += value[0]

total_players= left_foot + right_foot + both
average_left_rating= left_rating/ left_foot
average_right_rating = right_rating/ right_foot
both_rating = both_rating/ both

# Data to plot
labels = 'left', 'right', 'both'
sizes = [left_foot, right_foot, both]
colors = ['yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0)  # explode 1st slice

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.title("Overall players foot Preferance percentage ")
plt.show()

From this graph, we can see that the major of players in our dataset are prefer their right foot, while a smaller percentage prefer their left or both.

# Data to plot
labels = average_left_rating, average_right_rating, both_rating
sizes = [left_foot, right_foot, both]
colors = ['yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0)  # explode 1st slice

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.title("Total foot preferance percentage with respect to average rating ")
plt.show()

And from this graph, we can tell that their does not seem to be a bias in the ratings based off the player's prefer foot, possibly correlating to their also being no advantage for using the other foot or both.

8. What is the relationship between number of goals scored and the win percentage of the teams?¶

For this question, we wanted to see if their is a correlation between a high number of goals scored and a high win percentage.

plt.scatter(team_goal_won_total['goal_count_total'], team_goal_won_total['win_count_total'])
plt.ylabel("Total number of wins")
plt.xlabel("Total nubmer of goals")
plt.title("Total number of goals vs wins from 2008- 2016 for each team")

x= team_goal_won_total['goal_count_total'].values.reshape(-1,1)
y= team_goal_won_total['win_count_total']
reg_1 = LinearRegression()
reg_1 = reg_1.fit(x, y)

intercept= reg_1.intercept_
slope= reg_1.coef_
plt.plot(x, intercept + slope * x, 'red', label='reg')
plt.show()

team_goal_won_total['winning_percentage']= (team_goal_won_total['win_count_total'] / team_goal_won_total['game_count_total'])*100
team_goal_won_total.head()

plt.scatter(team_goal_won_total['goal_count_total'], team_goal_won_total['winning_percentage']) 
plt.ylabel("wining percentage")
plt.xlabel("total_number of goals")
plt.title("Wining percentage of all Europian leagues teams Vs there respective goals per season")

x= team_goal_won_total['goal_count_total'].values.reshape(-1,1)
y=team_goal_won_total['winning_percentage']
reg_1 = LinearRegression()
reg_1 = reg_1.fit(x,y)

intercept= reg_1.intercept_
slope= reg_1.coef_
plt.plot(x, intercept + slope * x, 'red', label='reg')
plt.show()

a= team_goal_won_total.groupby(['leagueName'])

for index, value in a:
    
    x= value.goal_count_total
    y= value.winning_percentage
    
    plt.scatter(x, y) 
    plt.ylabel("wining percentage")
    plt.xlabel("total_number of goals")
    plt.title(index)
    
    reg_1 = LinearRegression()
    reg_1 = reg_1.fit(x.values.reshape(-1,1),y)

    intercept= reg_1.intercept_
    slope= reg_1.coef_
    plt.plot(x, intercept + slope * x, 'red', label='reg')
    plt.show()

From each of these visualizations, aside from a couple of outliers, their is a consistant correlation between a high amount fo goals scored and a high number of wins.

Analysis¶

With exploration of the data done, we move on to the penultimate step of the data lifecycle - the analysis. In this section we apply machine learning and statistics to better understand the trends and relationships present in the data. We will be using two machine learning models to accomplish this in this tutorial - Linear regression and logistical regression.

Determining if there is a linear relationship between a player's overall rating and the player's attributes documented in the dataset¶

The first analysis we wanted to demonstate is the continuation of the question posed in question 5 of our exploration.

We will start by defining our null and alternative hypothesis.

Null hypothesis 1: There is no relation between the overall rating of a player and the attributes of player.

Alternative Hypothesis 1: There is a relation between the overall rating of a player and the attributes of player.

We will either accept or reject the null hypothesis based on the statistical results of our linear regression.

df = player_Attributes

# Preparing data for regression
selected_attr = [
       'potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']
target = ['overall_rating']
df = df.dropna(0)
X = df[selected_attr]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X,y)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)

coef_:  [[ 0.37695799  0.02093622  0.01136773  0.07026387  0.04911308  0.00396994
  -0.01316442  0.01198513  0.01265543  0.00757413  0.13461736  0.00689957
   0.0094194  -0.00881508  0.21155022  0.00888179  0.01722714  0.01439787
  -0.00499012  0.06054882 -0.01402617  0.02098035  0.01258783 -0.01056229
  -0.00180798  0.01427677  0.0324026   0.00251907 -0.02827941  0.16193239
   0.03178607 -0.03368222  0.05750351  0.02506919]]
intercept_:  [-3.484707]

X=sm.add_constant(X_train) 
model= sm.OLS(y_train, X).fit()
model.summary()

y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)

score:  0.8428583351656973

By look at the p values of the different attributes we can confidently reject the null hypothesis as we can see that the p values are less than 0.05, meaning that there is a relationship between player rating and attribute scores. From the coefficients we can see though we can see that some attributes are more heavily related to the rating the player receives that others. Lastly we find that we are able to predict the player's rating correctly approximately 84% percent of the time based on the attributes passed into the regression.

Determining if there is a linear relationship between a team's winning percentage(team) and team's attributes documented in the dataset¶

Null hypothesis 2: There is no relation between the winning percentage and other features of team.

Alternative Hypothesis 2: There is a relation between the winning percentage and other features of team.

team_goal_won_total['season_start'] = team_goal_won_total['season'].apply(lambda x: x.split('-')[0])
team_goal_won_total['season_end'] = team_goal_won_total['season'].apply(lambda x: x.split('-')[1])
team_goal_won_total['season_start'] = team_goal_won_total['season_start'].astype(int)

selected_col = [
       'team_name', 'date', 'buildUpPlaySpeed', 'buildUpPlayDribbling',
       'buildUpPlayPassing', 'chanceCreationPassing', 'chanceCreationCrossing', 'chanceCreationShooting', 'defencePressure',
       'defenceAggression', 'defenceTeamWidth']
df = team_attribute[selected_col]
df = df.dropna()
df['year'] = pd.DatetimeIndex(df['date']).year
df = df.drop(['date'], axis=1)

#match season start and team name
all_team_info = pd.merge(df, team_goal_won_total,  how='inner', 
                         left_on=['team_name','year'], right_on = ['team_name','season_start'])
all_team_info = all_team_info.drop(['season_start', 'season_end', 'team_name', 'year', 'leagueName', 'season', 'game_count_total',
       'game_count_as_hometeam', 'game_count_as_awayteam', 'win_count_total',
       'win_count_as_hometeam', 'win_count_as_awayteam', 'goal_count_total',
       'goal_count_as_hometeam', 'goal_count_as_awayteam'], axis=1)
all_team_info['winning_percentage'] = all_team_info['winning_percentage'].astype(float)

all_team_info.head()
# This table will have the winnign_percentage and team attributes.

y_target = all_team_info['winning_percentage']
X = all_team_info.iloc[:,0:-1]

X_train, X_test, y_train, y_test = train_test_split(X,y_target)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)

coef_:  [-0.00062286 -0.00068492 -0.30921599  0.08060845  0.27237476  0.01387741
  0.30282332  0.28520692  0.02394299]
intercept_:  4.200224758849281

y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)

score:  0.09874665373937364

X1=sm.add_constant(X_train) 
model= sm.OLS(y_train, X1.astype(float)).fit()
model.summary()

From the results of the linear regression, it seems as though most of the teams attributes could not be found to have a strong relationship with that team's winning percentage. As we can see all but the p values of chanceCreationCrossing, defencePressure, buildUpPlayPassing, and defenceAggression were less than 0.05. As well, the model scores very low in it's prediction of the win percentage when giving the team's attributes.

When attempting to predict the winning percentage with only the attributes which has p value less than 0.05

X_new = X[['chanceCreationCrossing', 'defencePressure', 'buildUpPlayPassing', 'defenceAggression']]
X_train, X_test, y_train, y_test = train_test_split(X_new, y_target)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)

coef_:  [ 0.32497495  0.32966984 -0.21479505  0.22177895]
intercept_:  4.502832143242543

y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)

score:  0.13486437688212916

X2=sm.add_constant(X_train) 
model= sm.OLS(y_train, X2.astype(float)).fit()
model.summary()

The score is still very low, in other word this model describe only about 13% of the data. But the coef of those attributes are not 0 because the p values are less than 0.05.

===========================================================================================================================

Logistic regression prediction on a given match for teams home play win based on the overall rating of starting players for the home team.¶

creating a logistic regression on matchs that teams either win, loss or draw based on the starting players overall rating for the year.

Null hypothesis 1: There is no strong relation between the players rating of the team and the teams win chance for a given match

Alternative Hypothesis 1: There is a strong relation between the overall player rating of a team for a given match and chance of winning

# Getting the columns needed/ Data processing

matches_and_players= matches_copy[['league_id', 'season', 'date', 'match_api_id', 'home_team_api_id', 'away_team_api_id', 
                            'home_player_1','home_player_2', 'home_player_3','home_player_4','home_player_5','home_player_6',
                            'home_player_7','home_player_8','home_player_9','home_player_10','home_player_11',
                            'away_player_1', 'away_player_2', 'away_player_3', 'away_player_4', 'away_player_5', 'away_player_6',
                            'away_player_7', 'away_player_8', 'away_player_9', 'away_player_10' , 'away_player_11',
                            'home_team_goal', 'away_team_goal']]
# clean up the data and 
# create label to who won the game. 1 home wins 0 drow and -1 away wins
matches_and_players.dropna(inplace=True)
matches_and_players['win_loss_drow']= np.where( matches_and_players['home_team_goal']==matches_and_players['away_team_goal'], 
                                                0, np.where(matches_and_players['home_team_goal']>matches_and_players['away_team_goal'],
                                                1, -1))

# melting the data sho each player is associated with team 
home_team_players= pd.melt(matches_and_players, id_vars=['league_id','season', 'date', 'match_api_id', 'home_team_api_id', 'win_loss_drow'], value_vars=['home_player_1','home_player_2', 'home_player_3','home_player_4',
                                                                     'home_player_5','home_player_6','home_player_7','home_player_8',
                                                                     'home_player_9','home_player_10','home_player_11'], 
                      ignore_index=False, var_name='home_players', value_name='players_id')
# convert the date col to the python datetime 
home_team_players['date'] = pd.to_datetime(home_team_players['date'])
home_team_players.head()

# create a dataframe for players and there overall rating with respect to time.
# extract the year from the given date
player_rating= player_Attributes[['player_api_id', 'date', 'overall_rating']]
player_rating['year'] = pd.DatetimeIndex(player_rating['date']).year
player_rating.head()

# group them by the year and player_id
player_rating= player_rating.groupby(['player_api_id', 'year']).mean()
player_rating.head()

# convert player id from float to int. so that it is easy for comparison 
home_team_players['players_id']= home_team_players['players_id'].astype(np.int64)
# extract year from date 
home_team_players['year'] = pd.DatetimeIndex(home_team_players['date']).year

# Connecting matches teams and players. 
matches_teams_players = pd.merge(home_team_players, player_rating,  how='left', 
                       left_on=['players_id', 'year'], 
                       right_on = ['player_api_id', 'year'])
print(matches_teams_players.shape)
# for the nan values for players ratting replace them with average value of the team for the given match
matches_teams_players['overall_rating']= matches_teams_players.groupby('match_api_id', 
                               )['overall_rating'].apply(lambda x: x.fillna(x.mean()))
print(matches_teams_players.shape)
# group the home players by match id 
grouped_matches_teams_players= matches_teams_players.groupby('match_api_id')

(235114, 10)
(235114, 10)

matches_teams_players.head()

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler

def process_logistic_reg(grouped_data_frame):
    
    home_team_X_values= list()
    home_team_Y_values= list()
    for index, value in grouped_data_frame:
    #     home_team_X_values.append( [round(num,3)  for num in value.overall_rating.tolist()])
        home_team_X_values.append(value.overall_rating.tolist())
        home_team_Y_values.append(value.win_loss_drow.tolist()[0])
        
# making it binary classification    win=1 draw/loss= 0
    home_team_y_binary= [0 if x==-1 else x for x in home_team_Y_values]
        
    X_train,X_test,y_train,y_test = train_test_split(home_team_X_values, home_team_y_binary, test_size=0.25, random_state=0)

    scaler = MinMaxScaler(feature_range=(0, 1.5))
    scaler.fit(X_train)
    scaler.fit(X_test)
    normalized_train = scaler.transform(X_train)
    normalized_test = scaler.transform(X_test)

    kf = KFold(n_splits = 10)

    logistic_regression= LogisticRegression(multi_class='ovr', solver='liblinear') 
    scores = cross_val_score(logistic_regression, normalized_train, y_train, cv = kf)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

    # Do logistic regresssion on the train data
    logistic_regression.fit(normalized_train, y_train)
    
    y_pred=logistic_regression.predict(normalized_test)      # prediction using the X test data 
    
    same_values= [i for i, j in zip(y_test, y_pred) if i == j]  # getting values that are the same prediction == actual

    accuracy = len(same_values)/ len(y_pred) *100     # this is the percetage that the prediction is correct 
    print("Accuracy of prediction: " + str(accuracy))
    
    #log_reg = sm.MNLogit(y_train, X_train).fit() 
    log_reg= sm.Logit(y_train, X_train).fit()
    
    print(log_reg.summary())

# logistic regression on overall Europe team's home match win vs the starting players ratting. If the p value of this
# outcome is low then this can be a good classification prediction for wchich one team can win a match based on the
# starting players ratting. If the p value is high then we can acept the null hypothisis of being there is less relation
# between the win and starting players rating.

process_logistic_reg(grouped_matches_teams_players)

Accuracy: 0.59 (+/- 0.02)
Accuracy of prediction: 58.75748502994011
Optimization terminated successfully.
         Current function value: 0.688455
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                16030
Model:                          Logit   Df Residuals:                    16019
Method:                           MLE   Df Model:                           10
Date:                Sat, 19 Dec 2020   Pseudo R-squ.:               0.0005141
Time:                        04:14:41   Log-Likelihood:                -11036.
converged:                       True   LL-Null:                       -11042.
Covariance Type:            nonrobust   LLR p-value:                    0.3307
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0166      0.004     -4.128      0.000      -0.025      -0.009
x2            -0.0109      0.005     -2.406      0.016      -0.020      -0.002
x3            -0.0108      0.005     -2.308      0.021      -0.020      -0.002
x4             0.0056      0.004      1.238      0.216      -0.003       0.014
x5            -0.0104      0.004     -2.382      0.017      -0.019      -0.002
x6             0.0070      0.004      1.570      0.116      -0.002       0.016
x7            -0.0003      0.004     -0.075      0.940      -0.009       0.008
x8             0.0061      0.004      1.398      0.162      -0.002       0.015
x9             0.0057      0.004      1.377      0.168      -0.002       0.014
x10            0.0115      0.004      2.745      0.006       0.003       0.020
x11            0.0106      0.004      2.497      0.013       0.002       0.019
==============================================================================

As shown above the logistic regression model to predict the winning chance of home team is about 59% accurate. This shows that it would be really hard to determine the outcome the game with just the overall players rating of the team. Although some of the p values for few players are in good condition. They are less than 0.05. It shows that players x1,x2,x3,x10,x11 have a significant values in this model. This makes sense as these players are goalkeeper, defenders and attackers.

# do logistic regression on some of the well known leagues such as England and Spanish leagues and see the outcome
league_match_grouped= matches_teams_players.groupby(['league_id'])
# Spanish league id= 21518
England_matches= league_match_grouped.get_group(21518).groupby('match_api_id')
process_logistic_reg(England_matches)

Accuracy: 0.63 (+/- 0.05)
Accuracy of prediction: 59.97045790251108
Optimization terminated successfully.
         Current function value: 0.680998
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 2030
Model:                          Logit   Df Residuals:                     2019
Method:                           MLE   Df Model:                           10
Date:                Sat, 19 Dec 2020   Pseudo R-squ.:                 0.01613
Time:                        04:14:43   Log-Likelihood:                -1382.4
converged:                       True   LL-Null:                       -1405.1
Covariance Type:            nonrobust   LLR p-value:                 1.893e-06
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0230      0.012     -1.987      0.047      -0.046      -0.000
x2            -0.0103      0.014     -0.749      0.454      -0.037       0.017
x3            -0.0222      0.015     -1.516      0.130      -0.051       0.006
x4            -0.0294      0.014     -2.060      0.039      -0.057      -0.001
x5            -0.0176      0.013     -1.327      0.184      -0.044       0.008
x6            -0.0116      0.014     -0.815      0.415      -0.039       0.016
x7            -0.0037      0.015     -0.254      0.799      -0.032       0.025
x8             0.0174      0.013      1.350      0.177      -0.008       0.043
x9            -0.0019      0.012     -0.157      0.875      -0.026       0.022
x10            0.0584      0.012      4.853      0.000       0.035       0.082
x11            0.0411      0.012      3.307      0.001       0.017       0.065
==============================================================================

In the above logistic regression, we have taken only the Spanish La Liga to see if there are some changes for the overall logistic regression model. Which has a little higher accuracy with 63% accuracy. As the same as the overall European league logistic regression model this one also show the significant of the keeper and attackers with defenders having less accurate p value.

Away team logistic regresion¶

# melting the data sho each player is associated with team 
away_team_players= pd.melt(matches_and_players, id_vars=['league_id','season', 'date', 'match_api_id', 'away_team_api_id', 'win_loss_drow'], 
                           value_vars=['away_player_1','away_player_2', 'away_player_3','away_player_4',
                                                                     'away_player_5','away_player_6','away_player_7','away_player_8',
                                                                     'away_player_9','away_player_10','away_player_11'], 
                      ignore_index=False, var_name='away_players', value_name='players_id')
# convert the date col to the python datetime 
away_team_players['date'] = pd.to_datetime(away_team_players['date'])
away_team_players['year'] = pd.DatetimeIndex(away_team_players['date']).year

# Connecting matches teams and players. 
matches_teams_players_away = pd.merge(away_team_players, player_rating,  how='left', 
                       left_on=['players_id', 'year'], 
                       right_on = ['player_api_id', 'year'])

# for the nan values for players ratting replace them with average value of the team for the given match
matches_teams_players_away['overall_rating']= matches_teams_players_away.groupby('match_api_id', 
                               )['overall_rating'].apply(lambda x: x.fillna(x.mean()))
# group the home players by match id 
grouped_matches_teams_players_away= matches_teams_players_away.groupby('match_api_id')

process_logistic_reg(grouped_matches_teams_players_away)

Accuracy: 0.55 (+/- 0.02)
Accuracy of prediction: 54.64071856287425
Optimization terminated successfully.
         Current function value: 0.687124
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                16030
Model:                          Logit   Df Residuals:                    16019
Method:                           MLE   Df Model:                           10
Date:                Sat, 19 Dec 2020   Pseudo R-squ.:                0.002447
Time:                        04:15:19   Log-Likelihood:                -11015.
converged:                       True   LL-Null:                       -11042.
Covariance Type:            nonrobust   LLR p-value:                 4.777e-08
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0018      0.004      0.451      0.652      -0.006       0.010
x2             0.0073      0.004      1.646      0.100      -0.001       0.016
x3            -0.0045      0.005     -0.965      0.335      -0.014       0.005
x4            -0.0018      0.004     -0.396      0.692      -0.010       0.007
x5             0.0056      0.004      1.292      0.196      -0.003       0.014
x6             0.0042      0.004      0.955      0.340      -0.004       0.013
x7            -0.0023      0.004     -0.544      0.587      -0.011       0.006
x8            -0.0075      0.004     -1.732      0.083      -0.016       0.001
x9             0.0021      0.004      0.501      0.616      -0.006       0.010
x10           -0.0144      0.004     -3.493      0.000      -0.022      -0.006
x11            0.0069      0.004      1.623      0.105      -0.001       0.015
==============================================================================

Above, we have also did logistic regression model to see away's team winning chance prediction. This model shows that an even lesses accuracy. In the away team model there is only one significat player in the team that has less than 0.05 p value, which is attacker.

Insight¶

At last, we come to the last step of the data lifecycle, for this step we draw our conclusions based off the insight we have gained from our analysis of the data.

From the exploratory analysis, we guessed there is linear relationship between overall player ratting and each players attributes. This model can describe 85% of data according to the linear regression score. From the statistic summery we got valid coefficients for each attribute with p value less than 0.05. As a result, we can reject the null hypothesis of there is no relationship between overall rating of players and player attributes.

However, we cannot see direct relationship in team attributes from the graph of the comparison of best two teams of each league, so we did linear regression model. But the model's score is too low to predict winning percentage from team attributes with the given dataset. There are some attributes that have low p value that are significant to this model. we cannot reject the null hypothesis

Based on the logistic regression model we have done, we can accept the null hypothesis of no strong relation between team players overall rating and winning chance of given match. This does not mean there is no relation! It is just we cannot reject the null hypostasis as the p values of the prediction model are higher, which makes it hard to predict the outcome of the match.

	id	country_id	name
0	1	1	Belgium Jupiler League
1	1729	1729	England Premier League
2	4769	4769	France Ligue 1
3	7809	7809	Germany 1. Bundesliga
4	10257	10257	Italy Serie A

	id	country_id	league_id	season	stage	date	match_api_id	home_team_api_id	away_team_api_id	home_team_goal	...	SJA	VCH	VCD	VCA	GBH	GBD	GBA	BSH	BSD	BSA
0	1	1	1	2008/2009	1	2008-08-17 00:00:00	492473	9987	9993	1	...	4.00	1.65	3.40	4.50	1.78	3.25	4.00	1.73	3.40	4.20
1	2	1	1	2008/2009	1	2008-08-16 00:00:00	492474	10000	9994	0	...	3.80	2.00	3.25	3.25	1.85	3.25	3.75	1.91	3.25	3.60
2	3	1	1	2008/2009	1	2008-08-16 00:00:00	492475	9984	8635	0	...	2.50	2.35	3.25	2.65	2.50	3.20	2.50	2.30	3.20	2.75
3	4	1	1	2008/2009	1	2008-08-17 00:00:00	492476	9991	9998	5	...	7.50	1.45	3.75	6.50	1.50	3.75	5.50	1.44	3.75	6.50
4	5	1	1	2008/2009	1	2008-08-16 00:00:00	492477	7947	9985	1	...	1.73	4.50	3.40	1.65	4.50	3.50	1.65	4.75	3.30	1.67

	id	player_api_id	player_name	player_fifa_api_id	birthday	height	weight
0	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187
1	2	155782	Aaron Cresswell	189615	1989-12-15 00:00:00	170.18	146
2	3	162549	Aaron Doran	186170	1991-05-13 00:00:00	170.18	163
3	4	30572	Aaron Galindo	140161	1982-05-08 00:00:00	182.88	198
4	5	23780	Aaron Hughes	17725	1979-11-08 00:00:00	182.88	154

	id	player_fifa_api_id	player_api_id	date	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	...	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	1	218353	505942	2016-02-18 00:00:00	67.0	71.0	right	medium	medium	49.0	...	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
1	2	218353	505942	2015-11-19 00:00:00	67.0	71.0	right	medium	medium	49.0	...	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
2	3	218353	505942	2015-09-21 00:00:00	62.0	66.0	right	medium	medium	49.0	...	54.0	48.0	65.0	66.0	69.0	6.0	11.0	10.0	8.0	8.0
3	4	218353	505942	2015-03-20 00:00:00	61.0	65.0	right	medium	medium	48.0	...	53.0	47.0	62.0	63.0	66.0	5.0	10.0	9.0	7.0	7.0
4	5	218353	505942	2007-02-22 00:00:00	61.0	65.0	right	medium	medium	48.0	...	53.0	47.0	62.0	63.0	66.0	5.0	10.0	9.0	7.0	7.0

	id	team_api_id	team_fifa_api_id	team_long_name	team_short_name
0	1	9987	673.0	KRC Genk	GEN
1	2	9993	675.0	Beerschot AC	BAC
2	3	10000	15005.0	SV Zulte-Waregem	ZUL
3	4	9994	2007.0	Sporting Lokeren	LOK
4	5	9984	1750.0	KSV Cercle Brugge	CEB

	id	team_fifa_api_id	team_api_id	date	buildUpPlaySpeed	buildUpPlaySpeedClass	buildUpPlayDribbling	buildUpPlayDribblingClass	buildUpPlayPassing	buildUpPlayPassingClass	...	chanceCreationShooting	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass
0	1	434	9930	2010-02-22 00:00:00	60	Balanced	NaN	Little	50	Mixed	...	55	Normal	Organised	50	Medium	55	Press	45	Normal	Cover
1	2	434	9930	2014-09-19 00:00:00	52	Balanced	48.0	Normal	56	Mixed	...	64	Normal	Organised	47	Medium	44	Press	54	Normal	Cover
2	3	434	9930	2015-09-10 00:00:00	47	Balanced	41.0	Normal	54	Mixed	...	64	Normal	Organised	47	Medium	44	Press	54	Normal	Cover
3	4	77	8485	2010-02-22 00:00:00	70	Fast	NaN	Little	70	Long	...	70	Lots	Organised	60	Medium	70	Double	70	Wide	Cover
4	5	77	8485	2011-02-22 00:00:00	47	Balanced	NaN	Little	52	Mixed	...	52	Normal	Organised	47	Medium	47	Press	52	Normal	Cover

	id	countryName	leagueName
0	1	Belgium	Belgium Jupiler League
1	1729	England	England Premier League
2	4769	France	France Ligue 1
3	7809	Germany	Germany 1. Bundesliga
4	10257	Italy	Italy Serie A

	id	country_league_id	stage	date	match_api_id	home_team_api_id	away_team_api_id	home_team_goal	away_team_goal	season_start	season_end	countryName	leagueName
0	1	1	1	2008-08-17	492473	9987	9993	1	1	2008	2009	Belgium	Belgium Jupiler League
1	2	1	1	2008-08-16	492474	10000	9994	0	0	2008	2009	Belgium	Belgium Jupiler League
2	3	1	1	2008-08-16	492475	9984	8635	0	3	2008	2009	Belgium	Belgium Jupiler League
3	4	1	1	2008-08-17	492476	9991	9998	5	0	2008	2009	Belgium	Belgium Jupiler League
4	5	1	1	2008-08-16	492477	7947	9985	1	3	2008	2009	Belgium	Belgium Jupiler League

	id	match_api_id	date	home_team	home_team_api_id	home_team_goal	away_team	away_team_api_id	away_team_goal	country_league_id	stage	season_start	season_end	countryName	leagueName	winner
0	1	492473	2008-08-17	KRC Genk (GEN)	9987	1	Beerschot AC (BAC)	9993	1	1	1	2008	2009	Belgium	Belgium Jupiler League	Tie
1	483	665353	2009-08-30	KRC Genk (GEN)	9987	1	Beerschot AC (BAC)	9993	1	1	5	2009	2010	Belgium	Belgium Jupiler League	Tie
2	524	838476	2010-08-01	KRC Genk (GEN)	9987	2	Beerschot AC (BAC)	9993	1	1	1	2010	2011	Belgium	Belgium Jupiler League	KRC Genk (GEN)
3	759	1032694	2011-07-30	KRC Genk (GEN)	9987	3	Beerschot AC (BAC)	9993	1	1	1	2011	2012	Belgium	Belgium Jupiler League	KRC Genk (GEN)
4	1014	1224064	2012-10-20	KRC Genk (GEN)	9987	3	Beerschot AC (BAC)	9993	0	1	11	2012	2013	Belgium	Belgium Jupiler League	KRC Genk (GEN)

	id	team_fifa_api_id	team_api_id	date	buildUpPlaySpeed	buildUpPlaySpeedClass	buildUpPlayDribbling	buildUpPlayDribblingClass	buildUpPlayPassing	buildUpPlayPassingClass	...	chanceCreationShootingClass	chanceCreationPositioningClass	defencePressure	defencePressureClass	defenceAggression	defenceAggressionClass	defenceTeamWidth	defenceTeamWidthClass	defenceDefenderLineClass	team_name
0	1	434	9930	2010-02-22	60	Balanced	NaN	Little	50	Mixed	...	Normal	Organised	50	Medium	55	Press	45	Normal	Cover	FC Aarau (AAR)
1	2	434	9930	2014-09-19	52	Balanced	48.0	Normal	56	Mixed	...	Normal	Organised	47	Medium	44	Press	54	Normal	Cover	FC Aarau (AAR)
2	3	434	9930	2015-09-10	47	Balanced	41.0	Normal	54	Mixed	...	Normal	Organised	47	Medium	44	Press	54	Normal	Cover	FC Aarau (AAR)
3	4	77	8485	2010-02-22	70	Fast	NaN	Little	70	Long	...	Lots	Organised	60	Medium	70	Double	70	Wide	Cover	Aberdeen (ABE)
4	5	77	8485	2011-02-22	47	Balanced	NaN	Little	52	Mixed	...	Normal	Organised	47	Medium	47	Press	52	Normal	Cover	Aberdeen (ABE)

			home_team_goal
leagueName	season_start	home_team
Belgium Jupiler League	2008	Beerschot AC (BAC)	32
		Club Brugge KV (CLB)	37
		FCV Dender EH (DEN)	21
		KAA Gent (GEN)	31
		KRC Genk (GEN)	23

			away_team_goal
leagueName	season_start	away_team
Belgium Jupiler League	2008	Beerschot AC (BAC)	12
		Club Brugge KV (CLB)	22
		FCV Dender EH (DEN)	23
		KAA Gent (GEN)	36
		KRC Genk (GEN)	25

	index	leagueName	season_start	team_name	total_num_goal
0	11	Belgium Jupiler League	2008	RSC Anderlecht (AND)	75
1	27	Belgium Jupiler League	2009	RSC Anderlecht (AND)	62
2	37	Belgium Jupiler League	2010	KRC Genk (GEN)	64
3	51	Belgium Jupiler League	2011	KAA Gent (GEN)	63
4	75	Belgium Jupiler League	2012	RSC Anderlecht (AND)	69
...	...	...	...	...	...
83	1432	Switzerland Super League	2011	FC Basel (BAS)	78
84	1448	Switzerland Super League	2012	FC Zürich (ZUR)	62
85	1453	Switzerland Super League	2013	FC Basel (BAS)	70
86	1463	Switzerland Super League	2014	FC Basel (BAS)	84
87	1472	Switzerland Super League	2015	FC Basel (BAS)	88

	index	league	season	team_name	count_wins
0	75	Belgium Jupiler League	2008 - 2009	RSC Anderlecht (AND)	24
1	117	Belgium Jupiler League	2009 - 2010	RSC Anderlecht (AND)	22
2	226	Belgium Jupiler League	2010 - 2011	KRC Genk (GEN)	19
3	177	Belgium Jupiler League	2011 - 2012	RSC Anderlecht (AND)	20
4	201	Belgium Jupiler League	2012 - 2013	RSC Anderlecht (AND)	20
5	1459	Belgium Jupiler League	2013 - 2014	KV Oostende (OOS)	4
6	309	Belgium Jupiler League	2014 - 2015	Club Brugge KV (CLB)	17
7	150	Belgium Jupiler League	2015 - 2016	Club Brugge KV (CLB)	21
8	28	England Premier League	2008 - 2009	Manchester United (MUN)	28
9	31	England Premier League	2009 - 2010	Manchester United (MUN)	27

	id	match_api_id	date	home_team	home_team_api_id	home_team_goal	away_team	away_team_api_id	away_team_goal	country_league_id	stage	season_start	season_end	countryName	leagueName	winner
784	1239	1677181	2014-04-05	Lierse SK (LIE)	9989	1	KAA Gent (GEN)	9991	0	1	2	2013	2014	Belgium	Belgium Jupiler League	Lierse SK (LIE)
791	1243	1677185	2014-04-19	Waasland-Beveren (WAA)	8475	2	KAA Gent (GEN)	9991	4	1	4	2013	2014	Belgium	Belgium Jupiler League	KAA Gent (GEN)
794	1246	1677188	2014-04-26	KV Oostende (OOS)	8573	1	KAA Gent (GEN)	9991	1	1	5	2013	2014	Belgium	Belgium Jupiler League	Tie
1418	1248	1677190	2014-05-03	KAA Gent (GEN)	9991	4	Lierse SK (LIE)	9989	0	1	6	2013	2014	Belgium	Belgium Jupiler League	KAA Gent (GEN)
1461	1237	1677179	2014-03-29	Waasland-Beveren (WAA)	8475	2	Lierse SK (LIE)	9989	0	1	1	2013	2014	Belgium	Belgium Jupiler League	Waasland-Beveren (WAA)

		overall_rating	potential
player_api_id	preferred_foot
2625	right	60.142857	61.142857
2752	right	69.380952	70.380952
2768	left	69.285714	70.571429
2770	left	71.571429	75.285714
2770	right	70.750000	72.000000

OLS Regression Results
Dep. Variable:	overall_rating	R-squared:	0.841
Model:	OLS	Adj. R-squared:	0.841
Method:	Least Squares	F-statistic:	2.108e+04
Date:	Sat, 19 Dec 2020	Prob (F-statistic):	0.00
Time:	04:13:49	Log-Likelihood:	-3.3093e+05
No. Observations:	135265	AIC:	6.619e+05
Df Residuals:	135230	BIC:	6.623e+05
Df Model:	34
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-3.4847	0.109	-32.028	0.000	-3.698	-3.271
potential	0.3770	0.002	222.886	0.000	0.374	0.380
crossing	0.0209	0.001	21.599	0.000	0.019	0.023
finishing	0.0114	0.001	10.683	0.000	0.009	0.013
heading_accuracy	0.0703	0.001	74.060	0.000	0.068	0.072
short_passing	0.0491	0.002	30.713	0.000	0.046	0.052
volleys	0.0040	0.001	4.158	0.000	0.002	0.006
dribbling	-0.0132	0.001	-9.740	0.000	-0.016	-0.011
curve	0.0120	0.001	12.919	0.000	0.010	0.014
free_kick_accuracy	0.0127	0.001	15.368	0.000	0.011	0.014
long_passing	0.0076	0.001	6.926	0.000	0.005	0.010
ball_control	0.1346	0.002	73.138	0.000	0.131	0.138
acceleration	0.0069	0.002	4.313	0.000	0.004	0.010
sprint_speed	0.0094	0.002	6.209	0.000	0.006	0.012
agility	-0.0088	0.001	-7.501	0.000	-0.011	-0.007
reactions	0.2116	0.001	168.101	0.000	0.209	0.214
balance	0.0089	0.001	9.804	0.000	0.007	0.011
shot_power	0.0172	0.001	16.938	0.000	0.015	0.019
jumping	0.0144	0.001	17.051	0.000	0.013	0.016
stamina	-0.0050	0.001	-5.305	0.000	-0.007	-0.003
strength	0.0605	0.001	62.246	0.000	0.059	0.062
long_shots	-0.0140	0.001	-13.220	0.000	-0.016	-0.012
aggression	0.0210	0.001	26.371	0.000	0.019	0.023
interceptions	0.0126	0.001	14.737	0.000	0.011	0.014
positioning	-0.0106	0.001	-11.539	0.000	-0.012	-0.009
vision	-0.0018	0.001	-1.782	0.075	-0.004	0.000
penalties	0.0143	0.001	16.457	0.000	0.013	0.016
marking	0.0324	0.001	24.892	0.000	0.030	0.035
standing_tackle	0.0025	0.002	1.669	0.095	-0.000	0.005
sliding_tackle	-0.0283	0.001	-22.079	0.000	-0.031	-0.026
gk_diving	0.1619	0.002	95.462	0.000	0.159	0.165
gk_handling	0.0318	0.002	14.292	0.000	0.027	0.036
gk_kicking	-0.0337	0.001	-46.584	0.000	-0.035	-0.032
gk_positioning	0.0575	0.002	25.992	0.000	0.053	0.062
gk_reflexes	0.0251	0.002	11.563	0.000	0.021	0.029

Omnibus:	5554.938	Durbin-Watson:	1.999
Prob(Omnibus):	0.000	Jarque-Bera (JB):	12606.865
Skew:	-0.258	Prob(JB):	0.00
Kurtosis:	4.404	Cond. No.	4.66e+03

	league	season	team_name	count_wins
0	Italy Serie A	2013 - 2014	Juventus (JUV)	33
1	Spain LIGA BBVA	2011 - 2012	Real Madrid CF (REA)	32
2	Spain LIGA BBVA	2012 - 2013	FC Barcelona (BAR)	32
3	Scotland Premier League	2013 - 2014	Celtic (CEL)	31
4	Spain LIGA BBVA	2009 - 2010	FC Barcelona (BAR)	31

	seasons	player	rating
0	2007	Wayne Rooney	93.0
1	2008	Cristiano Ronaldo	91.0
2	2009	Iker Casillas	91.0
3	2010	Lionel Messi	90.0
4	2011	Lionel Messi	94.0
5	2012	Lionel Messi	94.0
6	2013	Lionel Messi	94.0
7	2014	Lionel Messi	94.0
8	2015	Lionel Messi	94.0
9	2016	Neymar	90.0

	buildUpPlaySpeed	buildUpPlayDribbling	buildUpPlayPassing	chanceCreationPassing	chanceCreationCrossing	chanceCreationShooting	defencePressure	defenceAggression	defenceTeamWidth	winning_percentage
0	52	48.0	56	54	63	64	47	44	54	16.666667
1	58	64.0	62	56	68	57	41	42	60	60.526316
2	59	64.0	53	51	72	63	49	45	63	57.894737
3	48	70.0	52	66	36	75	58	57	49	34.210526
4	48	70.0	52	66	36	69	58	57	49	39.473684

Dep. Variable:	winning_percentage	R-squared:	0.162
Model:	OLS	Adj. R-squared:	0.132
Method:	Least Squares	F-statistic:	5.405
Date:	Sat, 19 Dec 2020	Prob (F-statistic):	8.89e-07
Time:	04:13:55	Log-Likelihood:	-1066.8
No. Observations:	262	AIC:	2154.
Df Residuals:	252	BIC:	2189.
Df Model:	9
Covariance Type:	nonrobust

Omnibus:	14.028	Durbin-Watson:	2.171
Prob(Omnibus):	0.001	Jarque-Bera (JB):	14.884
Skew:	0.577	Prob(JB):	0.000586
Kurtosis:	3.176	Cond. No.	2.04e+03

Omnibus:	18.912	Durbin-Watson:	1.960
Prob(Omnibus):	0.000	Jarque-Bera (JB):	20.841
Skew:	0.665	Prob(JB):	2.98e-05
Kurtosis:	3.372	Cond. No.	943.

	league_id	season	date	match_api_id	home_team_api_id	win_loss_drow	home_players	players_id
145	1	2008/2009	2009-02-27	493017	8203	1	home_player_1	38327.0
153	1	2008/2009	2009-03-08	493025	9984	-1	home_player_1	36835.0
155	1	2008/2009	2009-03-07	493027	8635	1	home_player_1	34480.0
162	1	2008/2009	2009-03-13	493034	8203	1	home_player_1	38327.0
168	1	2008/2009	2009-03-14	493040	10000	0	home_player_1	37900.0

	player_api_id	date	overall_rating	year
0	505942	2016-02-18	67.0	2016
1	505942	2015-11-19	67.0	2015
2	505942	2015-09-21	62.0	2015
3	505942	2015-03-20	61.0	2015
4	505942	2007-02-22	61.0	2007

		overall_rating
player_api_id	year
2625	2007	63.0
	2008	60.0
	2010	60.0
	2011	58.5
	2012	58.0