Analysis of European Soccer Database

Erick Herrera, Hyejeong Kim, Kibreab Gebrehiwot

Introduction

As the popularity of machine learning has drastically increased over the last decade, so has the demand for data literacy as a skill in the job market. This tutorial aims to give a general introduction to data science by walking the data science lifecycle (Collecting data, Cleaning data, Exploring data, Building models, and Interpretation) as we try to analyze the data within our chosen database.

We will be using a dataset based on the European fútball 2008-2016 seasons. This dataset is comprised of the information on 11 different European fútball league and their teams. Specifically, the data contains 7 tables (Country, League, Player, player attributes, team, team attributes, and all the matches played during 2008 - 2016 seasons. The goal is to some insight on the beautiful game based on the data held by this dataset.

For this tutorial, We will be using the Pandas library to read and organize the data we are collecting, the Matplot library to render plots for our visualizations of the dataset, and SciKit-Learn library for building our models.


During our exploratory analysis & Data visualization step, some of the questions that we thought would be interesting to try to answer were:

  1. Which team had highest amount of goals per league and per year?
  2. Which team won the most games per league and per year?
  3. How do the team attributes of the two top teams in each league differ from one another?
  4. Are teams scoring more goal when playing home or away? How about the number of wins when home or away?
  5. Is there a relationship between player attributes and their overall rating?
  6. Which players have the highest overall rating per year?
  7. What precent of player prefer their left foot? right foot? or both? Does this connect with their rating at all?
  8. What is the relationship between number of goals scored and the win percentage of the teams?

For our models in this tutorial, we will be presenting two linear regression examples:

  • Determining if there is a linear relationship between a player's overall rating and the player's attributes documented in the dataset
  • Determining if there is a linear relationship between a team's winning percentage(team) and team's attributes documented in the dataset

As well, a logistical regression in the form of:

  • Determining if a team will be the winners of a match from the overall rating of each player on the team's starting line up
In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import sqlite3
import matplotlib.pyplot as plt
import warnings
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import statsmodels.api as sm
import seaborn as sn

Collecting Data

The first step of the data lifescyle is the collecting phase. For this dataset, it is as simple as downloading the sql file from the database's host https://www.kaggle.com/hugomathien/soccer and performing SQL commands directly on the database to retrive the tables needed for the analysis of the data. The database contains 7 different tables, as such we created 7 variables to hold each the tables before we begin cleaning up.

In [2]:
sqlite_file = 'database.sqlite'
conn = sqlite3.connect(sqlite_file)    # connect to the database 

Reading in tables from the database

In [3]:
query = "SELECT * FROM Country" 
countries = pd.read_sql(query, conn)      # read and store country table 

query = "SELECT * FROM League"               
leagues = pd.read_sql(query, conn)        # read and store league table

query = "SELECT * FROM Match"                    
matches = pd.read_sql(query, conn)       # read and store match table

query = "SELECT * FROM Player"              
players = pd.read_sql(query, conn)       # read and store player table


query = "SELECT * FROM Player_Attributes"          
player_Attributes = pd.read_sql(query, conn)    # read and store player_attributes table


query = "SELECT * FROM Team"                 
teams = pd.read_sql(query, conn)               # read and store team table

query = "SELECT * FROM Team_Attributes"        
team_Attributes = pd.read_sql(query, conn)     # read and store team_Attributs table 

The tables have been stored in panda dataframes and each can be seen below.

In [4]:
countries.head()
Out[4]:
id name
0 1 Belgium
1 1729 England
2 4769 France
3 7809 Germany
4 10257 Italy
In [5]:
leagues.head()
Out[5]:
id country_id name
0 1 1 Belgium Jupiler League
1 1729 1729 England Premier League
2 4769 4769 France Ligue 1
3 7809 7809 Germany 1. Bundesliga
4 10257 10257 Italy Serie A
In [6]:
# Creating copy of match table for later use
matches_copy= matches.copy()

# getting the whole column names from match table 
my_list = list(matches)                      
my_list = matches.columns.values.tolist()

matches.head()
Out[6]:
id country_id league_id season stage date match_api_id home_team_api_id away_team_api_id home_team_goal ... SJA VCH VCD VCA GBH GBD GBA BSH BSD BSA
0 1 1 1 2008/2009 1 2008-08-17 00:00:00 492473 9987 9993 1 ... 4.00 1.65 3.40 4.50 1.78 3.25 4.00 1.73 3.40 4.20
1 2 1 1 2008/2009 1 2008-08-16 00:00:00 492474 10000 9994 0 ... 3.80 2.00 3.25 3.25 1.85 3.25 3.75 1.91 3.25 3.60
2 3 1 1 2008/2009 1 2008-08-16 00:00:00 492475 9984 8635 0 ... 2.50 2.35 3.25 2.65 2.50 3.20 2.50 2.30 3.20 2.75
3 4 1 1 2008/2009 1 2008-08-17 00:00:00 492476 9991 9998 5 ... 7.50 1.45 3.75 6.50 1.50 3.75 5.50 1.44 3.75 6.50
4 5 1 1 2008/2009 1 2008-08-16 00:00:00 492477 7947 9985 1 ... 1.73 4.50 3.40 1.65 4.50 3.50 1.65 4.75 3.30 1.67

5 rows × 115 columns

In [7]:
players.head()
Out[7]:
id player_api_id player_name player_fifa_api_id birthday height weight
0 1 505942 Aaron Appindangoye 218353 1992-02-29 00:00:00 182.88 187
1 2 155782 Aaron Cresswell 189615 1989-12-15 00:00:00 170.18 146
2 3 162549 Aaron Doran 186170 1991-05-13 00:00:00 170.18 163
3 4 30572 Aaron Galindo 140161 1982-05-08 00:00:00 182.88 198
4 5 23780 Aaron Hughes 17725 1979-11-08 00:00:00 182.88 154
In [8]:
player_Attributes.head()
Out[8]:
id player_fifa_api_id player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing ... vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 1 218353 505942 2016-02-18 00:00:00 67.0 71.0 right medium medium 49.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
1 2 218353 505942 2015-11-19 00:00:00 67.0 71.0 right medium medium 49.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
2 3 218353 505942 2015-09-21 00:00:00 62.0 66.0 right medium medium 49.0 ... 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 4 218353 505942 2015-03-20 00:00:00 61.0 65.0 right medium medium 48.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
4 5 218353 505942 2007-02-22 00:00:00 61.0 65.0 right medium medium 48.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0

5 rows × 42 columns

In [9]:
teams.head()
Out[9]:
id team_api_id team_fifa_api_id team_long_name team_short_name
0 1 9987 673.0 KRC Genk GEN
1 2 9993 675.0 Beerschot AC BAC
2 3 10000 15005.0 SV Zulte-Waregem ZUL
3 4 9994 2007.0 Sporting Lokeren LOK
4 5 9984 1750.0 KSV Cercle Brugge CEB
In [10]:
team_Attributes.head()
Out[10]:
id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1 434 9930 2010-02-22 00:00:00 60 Balanced NaN Little 50 Mixed ... 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 2 434 9930 2014-09-19 00:00:00 52 Balanced 48.0 Normal 56 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
2 3 434 9930 2015-09-10 00:00:00 47 Balanced 41.0 Normal 54 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
3 4 77 8485 2010-02-22 00:00:00 70 Fast NaN Little 70 Long ... 70 Lots Organised 60 Medium 70 Double 70 Wide Cover
4 5 77 8485 2011-02-22 00:00:00 47 Balanced NaN Little 52 Mixed ... 52 Normal Organised 47 Medium 47 Press 52 Normal Cover

5 rows × 25 columns


Data Processing

Now with our data retrieved and stored, our next step (and the next step of the data lifecycle) is to tidy up the raw data to make it more readable and easier for us to use for our analysis and visualizations. To accomplish this, we will be merging a few of the tables together to make seeing useful insights in the data easier and we will be drop some of the columns from all the tables to remove irrelevant information.

In [11]:
cursor = conn.cursor()

Combining the country and league tables

Country and league tables can be combined into one so we can avoid having to link the two by their country id.

In [12]:
result = cursor.execute("""SELECT Country.id, Country.name, League.name FROM Country INNER JOIN
League ON Country.ID = League.ID""")
countryLeague = pd.DataFrame(result.fetchall())
countryLeague.columns = ['id', 'countryName', 'leagueName']
countryLeague.head()
Out[12]:
id countryName leagueName
0 1 Belgium Belgium Jupiler League
1 1729 England England Premier League
2 4769 France France Ligue 1
3 7809 Germany Germany 1. Bundesliga
4 10257 Italy Italy Serie A

Cleaning Matches Table

For tidying the matches table we remove the null columns in the table, we split the season of the match into two columns (season_start and season_end) and we insert the information we gained from merging the country and league tables above to make recognizing who the match belongs to easier

In [13]:
matches.dropna(axis = 1, inplace = True)
matches['date'] = pd.to_datetime(matches['date'])
matches['season_start'] = matches['season'].apply(lambda x: x.split('/')[0])
matches['season_end'] = matches['season'].apply(lambda x: x.split('/')[1])
matches.drop(columns = ['season'], axis = 1, inplace = True)
matchCountry = matches.merge(countryLeague, left_on = 'country_id', right_on = 'id')
matchCountry.drop(columns = ['league_id', 'id_y'], axis = 1, inplace = True)
matchCountry = matchCountry.rename(columns={"id_x": "id", "country_id": "country_league_id"})
matchCountry.head()
Out[13]:
id country_league_id stage date match_api_id home_team_api_id away_team_api_id home_team_goal away_team_goal season_start season_end countryName leagueName
0 1 1 1 2008-08-17 492473 9987 9993 1 1 2008 2009 Belgium Belgium Jupiler League
1 2 1 1 2008-08-16 492474 10000 9994 0 0 2008 2009 Belgium Belgium Jupiler League
2 3 1 1 2008-08-16 492475 9984 8635 0 3 2008 2009 Belgium Belgium Jupiler League
3 4 1 1 2008-08-17 492476 9991 9998 5 0 2008 2009 Belgium Belgium Jupiler League
4 5 1 1 2008-08-16 492477 7947 9985 1 3 2008 2009 Belgium Belgium Jupiler League

Taking it a step further, here we're including the actual names of the teams next to their team ids to make things more legible and introducing a column to keep track of who won the match in that entry.

In [14]:
team_Attributes['date'] = pd.to_datetime(team_Attributes['date'])
In [15]:
team_match = matchCountry.merge(teams, left_on = 'home_team_api_id', right_on = 'team_api_id')
team_match = team_match.rename(columns={"id_x": "id", "team_long_name": "home_team", "team_fifa_api_id": "home_team_fifa_api_id"})
team_match['home_team'] =  team_match['home_team'] + " (" + team_match['team_short_name'] + ")"
team_match.drop(columns = ['id_y', 'team_api_id', 'team_short_name'], axis = 1, inplace = True)

team_match = team_match.merge(teams, left_on = 'away_team_api_id', right_on = 'team_api_id')
team_match = team_match.rename(columns={"id_x": "id", "team_long_name": "away_team", "team_fifa_api_id": "away_team_fifa_api_id"})
team_match['away_team'] =  team_match['away_team'] + " (" + team_match['team_short_name'] + ")"
team_match.drop(columns = ['id_y', 'team_api_id', 'team_short_name'], axis = 1, inplace = True)

team_match = team_match[['id', 'match_api_id', 'date', 'home_team', 'home_team_api_id', 'home_team_fifa_api_id', 
                         'home_team_goal', 'away_team', 'away_team_api_id', 'away_team_fifa_api_id', 'away_team_goal', 
                         'country_league_id', 'stage', 'season_start', 'season_end', 'countryName', 'leagueName']]

winner = []
for index, row in team_match.iterrows():
    if team_match.iloc[index, 6] > team_match.iloc[index, 10]:
        winner.append(team_match.iloc[index, 3])
    elif team_match.iloc[index, 6] < team_match.iloc[index, 10]:
        winner.append(team_match.iloc[index, 7])
    else:
        winner.append("Tie")
team_match['winner'] = winner 

team_match.drop(columns = ['home_team_fifa_api_id', 'away_team_fifa_api_id'], axis = 1, inplace = True)
team_match.head()
Out[15]:
id match_api_id date home_team home_team_api_id home_team_goal away_team away_team_api_id away_team_goal country_league_id stage season_start season_end countryName leagueName winner
0 1 492473 2008-08-17 KRC Genk (GEN) 9987 1 Beerschot AC (BAC) 9993 1 1 1 2008 2009 Belgium Belgium Jupiler League Tie
1 483 665353 2009-08-30 KRC Genk (GEN) 9987 1 Beerschot AC (BAC) 9993 1 1 5 2009 2010 Belgium Belgium Jupiler League Tie
2 524 838476 2010-08-01 KRC Genk (GEN) 9987 2 Beerschot AC (BAC) 9993 1 1 1 2010 2011 Belgium Belgium Jupiler League KRC Genk (GEN)
3 759 1032694 2011-07-30 KRC Genk (GEN) 9987 3 Beerschot AC (BAC) 9993 1 1 1 2011 2012 Belgium Belgium Jupiler League KRC Genk (GEN)
4 1014 1224064 2012-10-20 KRC Genk (GEN) 9987 3 Beerschot AC (BAC) 9993 0 1 11 2012 2013 Belgium Belgium Jupiler League KRC Genk (GEN)

Cleaning Player and Player_Attributes tables

Here we are merging the player and the player attributes tables into one while dropping any unnecessary columns.

In [16]:
players['birthday'] = pd.to_datetime(players['birthday'])
player_Attributes['date'] = pd.to_datetime(player_Attributes['date'])
player_attribute = players.merge(player_Attributes, left_on = 'player_api_id', right_on = 'player_api_id')
player_attribute.drop(columns = ['id_y', 'player_fifa_api_id_y'], axis = 1, inplace = True)
player_attribute = player_attribute.rename(columns={"id_x": "id", "player_fifa_api_id_x": "player_fifa_api_id"})
player_attribute.head()
Out[16]:
id player_api_id player_name player_fifa_api_id birthday height weight date overall_rating potential ... vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 1 505942 Aaron Appindangoye 218353 1992-02-29 182.88 187 2016-02-18 67.0 71.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
1 1 505942 Aaron Appindangoye 218353 1992-02-29 182.88 187 2015-11-19 67.0 71.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
2 1 505942 Aaron Appindangoye 218353 1992-02-29 182.88 187 2015-09-21 62.0 66.0 ... 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 1 505942 Aaron Appindangoye 218353 1992-02-29 182.88 187 2015-03-20 61.0 65.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
4 1 505942 Aaron Appindangoye 218353 1992-02-29 182.88 187 2007-02-22 61.0 65.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0

5 rows × 46 columns

Cleaning Team and Team_Attributes tables

Similarly to what we did with the player and player_attributes tables, here we are merging the team and team_attributes and then sorting them by name alphabetically.

In [17]:
team_attribute = team_Attributes.merge(teams, left_on = 'team_api_id', right_on = 'team_api_id')
team_attribute = team_attribute.rename(columns={"id_x": "id", "team_long_name": "team_name", "team_fifa_api_id_x": "team_fifa_api_id"})
team_attribute['team_name'] =  team_attribute['team_name'] + " (" + team_attribute['team_short_name'] + ")"
team_attribute.drop(columns = ['id_y', 'team_fifa_api_id_y', 'team_short_name'], axis = 1, inplace = True)
team_attribute.head()
Out[17]:
id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass team_name
0 1 434 9930 2010-02-22 60 Balanced NaN Little 50 Mixed ... Normal Organised 50 Medium 55 Press 45 Normal Cover FC Aarau (AAR)
1 2 434 9930 2014-09-19 52 Balanced 48.0 Normal 56 Mixed ... Normal Organised 47 Medium 44 Press 54 Normal Cover FC Aarau (AAR)
2 3 434 9930 2015-09-10 47 Balanced 41.0 Normal 54 Mixed ... Normal Organised 47 Medium 44 Press 54 Normal Cover FC Aarau (AAR)
3 4 77 8485 2010-02-22 70 Fast NaN Little 70 Long ... Lots Organised 60 Medium 70 Double 70 Wide Cover Aberdeen (ABE)
4 5 77 8485 2011-02-22 47 Balanced NaN Little 52 Mixed ... Normal Organised 47 Medium 47 Press 52 Normal Cover Aberdeen (ABE)

5 rows × 26 columns

In [18]:
cols = list(team_attribute.columns)
cols = [cols[0]] + [cols[-1]] + cols[1:-1]

team_attribute = team_attribute[cols]
team_attribute.sort_values(by='team_name', ascending=False)
team_attribute.head()
Out[18]:
id team_name team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1 FC Aarau (AAR) 434 9930 2010-02-22 60 Balanced NaN Little 50 ... 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 2 FC Aarau (AAR) 434 9930 2014-09-19 52 Balanced 48.0 Normal 56 ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
2 3 FC Aarau (AAR) 434 9930 2015-09-10 47 Balanced 41.0 Normal 54 ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
3 4 Aberdeen (ABE) 77 8485 2010-02-22 70 Fast NaN Little 70 ... 70 Lots Organised 60 Medium 70 Double 70 Wide Cover
4 5 Aberdeen (ABE) 77 8485 2011-02-22 47 Balanced NaN Little 52 ... 52 Normal Organised 47 Medium 47 Press 52 Normal Cover

5 rows × 26 columns


Exploratory analysis & Data visualization

Now with our data in order, we can begin the next step of the data lifecycle - data exploration and visualization. This following section will accomplish this section by running the gaunlet of questions we about the dataset as varies examples of the kinds of exploratory analysis that can be performed on data.

1. Which team had highest amount of goals per league and per year?

In [19]:
pd.set_option('mode.chained_assignment', None)
In [20]:
df = team_match[['season_start', 'home_team', 'home_team_goal', 'away_team', 'away_team_goal', 'leagueName']]
df['season_start'] = df['season_start'].astype(int)

Firstly, we are starting here by getting total number of goal for each team as home team and as away team each season for each league.

In [21]:
num_goal_home = df.groupby(['leagueName', 'season_start', 'home_team']).sum()
num_goal_home.drop(columns = ['away_team_goal'], axis = 1, inplace = True)
num_goal_home.head()
Out[21]:
home_team_goal
leagueName season_start home_team
Belgium Jupiler League 2008 Beerschot AC (BAC) 32
Club Brugge KV (CLB) 37
FCV Dender EH (DEN) 21
KAA Gent (GEN) 31
KRC Genk (GEN) 23
In [22]:
num_goal_away = df.groupby(['leagueName', 'season_start', 'away_team']).sum()
num_goal_away.drop(columns = ['home_team_goal'], axis = 1, inplace = True)
num_goal_away.head()
Out[22]:
away_team_goal
leagueName season_start away_team
Belgium Jupiler League 2008 Beerschot AC (BAC) 12
Club Brugge KV (CLB) 22
FCV Dender EH (DEN) 23
KAA Gent (GEN) 36
KRC Genk (GEN) 25

From there, we can just combine these columns to get the total number of goals for each team.

In [23]:
num_goal = num_goal_home.merge(num_goal_away, left_index=True, right_on=['leagueName', 'season_start', 'away_team'])
num_goal['total_num_goal'] = num_goal['home_team_goal'] + num_goal['away_team_goal']
num_goal.head()
Out[23]:
home_team_goal away_team_goal total_num_goal
leagueName season_start away_team
Belgium Jupiler League 2008 Beerschot AC (BAC) 32 12 44
Club Brugge KV (CLB) 37 22 59
FCV Dender EH (DEN) 21 23 44
KAA Gent (GEN) 31 36 67
KRC Genk (GEN) 23 25 48
In [24]:
A = num_goal # A is for later use

We then divided the teams by thier leagues and sorted the teams by their goal count.

In [25]:
num_goal.reset_index(inplace=True)  
num_goal_highest = num_goal.loc[num_goal.groupby(["leagueName", "season_start"])["total_num_goal"].idxmax()]
In [26]:
num_goal_highest = num_goal_highest.rename(columns={"away_team": "team_name"})
num_goal_highest.drop(columns = ['home_team_goal', 'away_team_goal'], axis = 1, inplace = True)
num_goal_highest.reset_index(drop=False, inplace=True)
num_goal_highest
Out[26]:
index leagueName season_start team_name total_num_goal
0 11 Belgium Jupiler League 2008 RSC Anderlecht (AND) 75
1 27 Belgium Jupiler League 2009 RSC Anderlecht (AND) 62
2 37 Belgium Jupiler League 2010 KRC Genk (GEN) 64
3 51 Belgium Jupiler League 2011 KAA Gent (GEN) 63
4 75 Belgium Jupiler League 2012 RSC Anderlecht (AND) 69
... ... ... ... ... ...
83 1432 Switzerland Super League 2011 FC Basel (BAS) 78
84 1448 Switzerland Super League 2012 FC Zürich (ZUR) 62
85 1453 Switzerland Super League 2013 FC Basel (BAS) 70
86 1463 Switzerland Super League 2014 FC Basel (BAS) 84
87 1472 Switzerland Super League 2015 FC Basel (BAS) 88

88 rows × 5 columns

In [27]:
for i, row in num_goal_highest.iterrows():
    num_goal_highest.iloc[i, 2] = str(num_goal_highest.iloc[i, 2]) + " - " + str(num_goal_highest.iloc[i, 2] + 1)

Now we can produce visualizations of these goal counts and see which teams held the most goals each season in each league.

In [28]:
%matplotlib inline

leagues = num_goal_highest.leagueName.unique()
count = 0
for league in leagues:
    new_df = num_goal_highest[num_goal_highest['leagueName'] == league]
    labels = new_df.season_start
    x = np.arange(len(labels))
    goals = new_df.total_num_goal
    width = 0.5
    fig, ax = plt.subplots()
    rects = ax.bar(x , goals, width)
    ax.set_ylabel('num goals')
    ax.set_xlabel('season')
    ax.set_title("team of highest goal per season of " + league)
    ax.set_xticks(x)
    ax.set_xticklabels(labels, fontsize=6)


    teams_label = new_df.team_name
    for rect in rects:
        team_name = teams_label[count].split("(")[0]
        height = rect.get_height() +1
        ax.text(rect.get_x() + width*0.5, 0.7*height, team_name,
                    ha='center', va='bottom', fontsize=7)
        count = count +1
    plt.show()

2. Which team won the most games per league and per year?

Very similarly to what we did for the previous question, we will first aggregate each teams wins.

In [29]:
df = team_match[team_match['winner'] != 'Tie'] # find rows with winner is not equal to Tie

df = df[['leagueName', 'season_start', 'winner']]
df['season_start'] = df['season_start'].astype(int)

Below we can see the teams with the most wins for each leaque over the years that the data set covers.

In [30]:
most_wins = df.groupby(['leagueName', 'season_start', 'winner'])['winner'].count().sort_values(ascending=False)
most_wins.head()
Out[30]:
leagueName               season_start  winner              
Italy Serie A            2013          Juventus (JUV)          33
Spain LIGA BBVA          2011          Real Madrid CF (REA)    32
                         2012          FC Barcelona (BAR)      32
Scotland Premier League  2013          Celtic (CEL)            31
Spain LIGA BBVA          2009          FC Barcelona (BAR)      31
Name: winner, dtype: int64

Now we will be dividing them again to find the teams with the most wins in each league over the years.

In [31]:
leagues = []
seasons = []
teams = []
count_wins = []
for index, value in most_wins.items():
    leagues.append(index[0])
    season_full = str(index[1]) + " - " + str(index[1] + 1)
    seasons.append(season_full)
    teams.append(index[2])
    count_wins.append(value)
d = {'league':leagues,'season':seasons,'team_name':teams,'count_wins':count_wins}
most_wins = pd.DataFrame(d)
B = most_wins
In [32]:
most_wins = most_wins.loc[most_wins.groupby(["league", "season"])["count_wins"].idxmax()]
most_wins.reset_index(drop=False, inplace=True)
most_wins.head(10)
Out[32]:
index league season team_name count_wins
0 75 Belgium Jupiler League 2008 - 2009 RSC Anderlecht (AND) 24
1 117 Belgium Jupiler League 2009 - 2010 RSC Anderlecht (AND) 22
2 226 Belgium Jupiler League 2010 - 2011 KRC Genk (GEN) 19
3 177 Belgium Jupiler League 2011 - 2012 RSC Anderlecht (AND) 20
4 201 Belgium Jupiler League 2012 - 2013 RSC Anderlecht (AND) 20
5 1459 Belgium Jupiler League 2013 - 2014 KV Oostende (OOS) 4
6 309 Belgium Jupiler League 2014 - 2015 Club Brugge KV (CLB) 17
7 150 Belgium Jupiler League 2015 - 2016 Club Brugge KV (CLB) 21
8 28 England Premier League 2008 - 2009 Manchester United (MUN) 28
9 31 England Premier League 2009 - 2010 Manchester United (MUN) 27

Now we can produce visualizations of these win counts and see which teams held the most wins each season in each league.

In [33]:
leagues = most_wins.league.unique()
count = 0
for league in leagues:
    new_df = most_wins[most_wins['league'] == league]
    labels = new_df.season
    x = np.arange(len(labels))
    count_wins = new_df.count_wins
    width = 0.35
    fig, ax = plt.subplots()
    rects = ax.bar(x , count_wins, width)
    ax.set_ylabel('num wins')
    ax.set_xlabel('season')
    ax.set_title("team of most won per season of " + league)
    ax.set_xticks(x)
    ax.set_xticklabels(labels, fontsize=6)
    teams_label = new_df.team_name
    for rect in rects:
        team_name = teams_label[count].split("(")[0]
        height = rect.get_height() +1
        ax.text(rect.get_x()+ width*0.5, 0.7*height, team_name,
                    ha='center', va='bottom', fontsize=7)
        count = count +1
    plt.show()

With both these sets of data retieved, we can attempt some cross analysis and compare the goals scored by each team to the wins each team then holds.

In [34]:
df = team_match[team_match['leagueName'] == 'Belgium Jupiler League']
df = df[df['season_start'] == '2013']
df.head()
Out[34]:
id match_api_id date home_team home_team_api_id home_team_goal away_team away_team_api_id away_team_goal country_league_id stage season_start season_end countryName leagueName winner
784 1239 1677181 2014-04-05 Lierse SK (LIE) 9989 1 KAA Gent (GEN) 9991 0 1 2 2013 2014 Belgium Belgium Jupiler League Lierse SK (LIE)
791 1243 1677185 2014-04-19 Waasland-Beveren (WAA) 8475 2 KAA Gent (GEN) 9991 4 1 4 2013 2014 Belgium Belgium Jupiler League KAA Gent (GEN)
794 1246 1677188 2014-04-26 KV Oostende (OOS) 8573 1 KAA Gent (GEN) 9991 1 1 5 2013 2014 Belgium Belgium Jupiler League Tie
1418 1248 1677190 2014-05-03 KAA Gent (GEN) 9991 4 Lierse SK (LIE) 9989 0 1 6 2013 2014 Belgium Belgium Jupiler League KAA Gent (GEN)
1461 1237 1677179 2014-03-29 Waasland-Beveren (WAA) 8475 2 Lierse SK (LIE) 9989 0 1 1 2013 2014 Belgium Belgium Jupiler League Waasland-Beveren (WAA)

Comparing most goal team vs most won team per season and per league

In [35]:
num_goal_highest.head()
Out[35]:
index leagueName season_start team_name total_num_goal
0 11 Belgium Jupiler League 2008 - 2009 RSC Anderlecht (AND) 75
1 27 Belgium Jupiler League 2009 - 2010 RSC Anderlecht (AND) 62
2 37 Belgium Jupiler League 2010 - 2011 KRC Genk (GEN) 64
3 51 Belgium Jupiler League 2011 - 2012 KAA Gent (GEN) 63
4 75 Belgium Jupiler League 2012 - 2013 RSC Anderlecht (AND) 69
In [36]:
most_wins.head()
Out[36]:
index league season team_name count_wins
0 75 Belgium Jupiler League 2008 - 2009 RSC Anderlecht (AND) 24
1 117 Belgium Jupiler League 2009 - 2010 RSC Anderlecht (AND) 22
2 226 Belgium Jupiler League 2010 - 2011 KRC Genk (GEN) 19
3 177 Belgium Jupiler League 2011 - 2012 RSC Anderlecht (AND) 20
4 201 Belgium Jupiler League 2012 - 2013 RSC Anderlecht (AND) 20
In [37]:
team_goal_won = pd.merge(num_goal_highest, most_wins,  how='left', 
                         left_on=['leagueName','season_start'], right_on = ['league','season'])
team_goal_won = team_goal_won.drop(['index_x', 'league', 'season', 'index_y'], axis = 1)
team_goal_won = team_goal_won.rename(columns={"team_name_x": "team_highest_goal", "team_name_y": "team_most_won", 
                                              "season_start": "season"})
team_goal_won.head()
Out[37]:
leagueName season team_highest_goal total_num_goal team_most_won count_wins
0 Belgium Jupiler League 2008 - 2009 RSC Anderlecht (AND) 75 RSC Anderlecht (AND) 24
1 Belgium Jupiler League 2009 - 2010 RSC Anderlecht (AND) 62 RSC Anderlecht (AND) 22
2 Belgium Jupiler League 2010 - 2011 KRC Genk (GEN) 64 KRC Genk (GEN) 19
3 Belgium Jupiler League 2011 - 2012 KAA Gent (GEN) 63 RSC Anderlecht (AND) 20
4 Belgium Jupiler League 2012 - 2013 RSC Anderlecht (AND) 69 RSC Anderlecht (AND) 20
In [38]:
goal_won_same = team_goal_won[team_goal_won['team_highest_goal'] == team_goal_won['team_most_won']]
In [39]:
percent = goal_won_same.shape[0] / team_goal_won.shape[0]
percent * 100
Out[39]:
63.63636363636363

From this analysis, we find that the team with the highest goals won the most games in a season roughly 64%.

We then can create a table with total goal number and total win count together for the best team per season and per league.

In [40]:
# A with the copy of the number of goals table from earlier
for index, row in A.iterrows():
    A.iloc[index, 1] = str(A.iloc[index, 1]) + " - " + str(A.iloc[index, 1] + 1)
A.head()
Out[40]:
leagueName season_start away_team home_team_goal away_team_goal total_num_goal
0 Belgium Jupiler League 2008 - 2009 Beerschot AC (BAC) 32 12 44
1 Belgium Jupiler League 2008 - 2009 Club Brugge KV (CLB) 37 22 59
2 Belgium Jupiler League 2008 - 2009 FCV Dender EH (DEN) 21 23 44
3 Belgium Jupiler League 2008 - 2009 KAA Gent (GEN) 31 36 67
4 Belgium Jupiler League 2008 - 2009 KRC Genk (GEN) 23 25 48
In [41]:
A = A.rename(columns={"away_team": "team_name"})
A.drop(columns = ['home_team_goal', 'away_team_goal'], axis = 1, inplace = True)
In [42]:
A.head()
Out[42]:
leagueName season_start team_name total_num_goal
0 Belgium Jupiler League 2008 - 2009 Beerschot AC (BAC) 44
1 Belgium Jupiler League 2008 - 2009 Club Brugge KV (CLB) 59
2 Belgium Jupiler League 2008 - 2009 FCV Dender EH (DEN) 44
3 Belgium Jupiler League 2008 - 2009 KAA Gent (GEN) 67
4 Belgium Jupiler League 2008 - 2009 KRC Genk (GEN) 48
In [43]:
B.head()
Out[43]:
league season team_name count_wins
0 Italy Serie A 2013 - 2014 Juventus (JUV) 33
1 Spain LIGA BBVA 2011 - 2012 Real Madrid CF (REA) 32
2 Spain LIGA BBVA 2012 - 2013 FC Barcelona (BAR) 32
3 Scotland Premier League 2013 - 2014 Celtic (CEL) 31
4 Spain LIGA BBVA 2009 - 2010 FC Barcelona (BAR) 31
In [44]:
team_goal_won_total = pd.merge(A, B,  how='left', 
                         left_on=['leagueName','season_start', 'team_name'], right_on = ['league','season', 'team_name'])
team_goal_won_total = team_goal_won_total.drop(['league', 'season'], axis = 1)
team_goal_won_total = team_goal_won_total.rename(columns={"season_start": "season"})
team_goal_won_total.head()
Out[44]:
leagueName season team_name total_num_goal count_wins
0 Belgium Jupiler League 2008 - 2009 Beerschot AC (BAC) 44 11
1 Belgium Jupiler League 2008 - 2009 Club Brugge KV (CLB) 59 18
2 Belgium Jupiler League 2008 - 2009 FCV Dender EH (DEN) 44 9
3 Belgium Jupiler League 2008 - 2009 KAA Gent (GEN) 67 17
4 Belgium Jupiler League 2008 - 2009 KRC Genk (GEN) 48 15

3. How do the team attributes of the two top teams in each league differ from one another?

In [45]:
warnings.simplefilter(action='ignore', category=FutureWarning)

Here, we are aggregating and matching up the two best teams (based on each team's number of wins) of each league.

In [46]:
com_teams = []
leagues = most_wins.league.unique()
for league in leagues:
    new_df = team_goal_won[team_goal_won['leagueName'] == league]
    a = new_df['team_most_won'].value_counts()
    com_teams.append([league, a.index[0], a.index[1]])
df = pd.DataFrame(com_teams, columns = ['leangue', 'team1', 'team2']) 
df.head()
Out[46]:
leangue team1 team2
0 Belgium Jupiler League RSC Anderlecht (AND) Club Brugge KV (CLB)
1 England Premier League Manchester United (MUN) Manchester City (MCI)
2 France Ligue 1 Paris Saint-Germain (PSG) LOSC Lille (LIL)
3 Germany 1. Bundesliga FC Bayern Munich (BMU) Borussia Dortmund (DOR)
4 Italy Serie A Juventus (JUV) Milan (ACM)

Now we can begin to visualize the differences in each of the team's attributes and compare to each other to see if we notice any trends in the two best teams that might account for their success.

In [47]:
# Belgium Jupiler League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "RSC Anderlecht (AND)"').mean(),
                            columns=['RSC Anderlecht (AND)'])
compare_attr['Club Brugge KV (CLB)'] = team_attribute.query('team_name == "Club Brugge KV (CLB)"').mean()
compare_attr = compare_attr[3:]

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Belgium Jupiler League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc=0);

RSC Anderlecht was the team with the most wins in the Belgium Jupiler League and when their stats are compared to the runner up of the league, we can begin to assume from the visualization that their lies with their chance creation with passes and their creation of opportunities for taking shots at the goal.

In [48]:
# England Premier League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Manchester United (MUN)"').mean(),
                            columns=['Manchester United (MUN)'])
compare_attr['Manchester City (MCI)'] = team_attribute.query('team_name == "Manchester City (MCI)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of England Premier League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

Manchester United's victory in the English Premier league seems to be a bit more contested by the runner up in terms of stats as manchester seems to really only pull ahead in chances created from crosses while all the attributes are pretty similar to their rivals, Manchester City.

In [49]:
# France Ligue 1
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Paris Saint-Germain (PSG)"').mean(),
                            columns=['Paris Saint-Germain (PSG)'])
compare_attr['Girondins de Bordeaux (BOR)'] = team_attribute.query('team_name == "Girondins de Bordeaux (BOR)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of France Ligue 1')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
In [50]:
# Italy Serie A
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Juventus (JUV)"').mean(),
                            columns=['Juventus (JUV)'])
compare_attr['Milan (ACM)'] = team_attribute.query('team_name == "Milan (ACM)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Italy Serie A')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

The Italy Serie A stats show something rather interesting. Milan(The league runner up) is actually ahead in a lot of the attribute but Juventus still managed to take the most wins in the league.

We continue with these same steps and analysis for the remaining leagues.

In [51]:
# Netherlands Eredivisie	Ajax (AJA)	PSV (PSV)
# Netherlands Eredivisie
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Ajax (AJA)"').mean(),
                            columns=['Ajax (AJA)'])
compare_attr['PSV (PSV)'] = team_attribute.query('team_name == "PSV (PSV)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Netherlands Eredivisie')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
In [52]:
# Poland Ekstraklasa	Legia Warszawa (LEG)	Wisła Kraków (WIS)
# Poland Ekstraklasa
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Legia Warszawa (LEG)"').mean(),
                            columns=['Legia Warszawa (LEG)'])
compare_attr['Wisła Kraków (WIS)'] = team_attribute.query('team_name == "Wisła Kraków (WIS)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Poland Ekstraklasa')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
In [53]:
# Portugal Liga ZON Sagres	SL Benfica (BEN)	FC Porto (POR)
# Portugal Liga ZON Sagres
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "SL Benfica (BEN)"').mean(),
                            columns=['SL Benfica (BEN)'])
compare_attr['FC Porto (POR)'] = team_attribute.query('team_name == "FC Porto (POR)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Portugal Liga ZON Sagres')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
In [54]:
# Scotland Premier League	Celtic (CEL)	Rangers (RAN)
# Scotland Premier League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Celtic (CEL)"').mean(),
                            columns=['Celtic (CEL)'])
compare_attr['Rangers (RAN)'] = team_attribute.query('team_name == "Rangers (RAN)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Scotland Premier League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
In [55]:
# Spain LIGA BBVA	FC Barcelona (BAR)	Real Madrid CF (REA)
# Spain LIGA BBVA
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "FC Barcelona (BAR)"').mean(),
                            columns=['FC Barcelona (BAR)'])
compare_attr['Real Madrid CF (REA)'] = team_attribute.query('team_name == "Real Madrid CF (REA)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Spain LIGA BBVA')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

Interestingly in LIGA BBVA league, even though the second team(Real Madrid) was ahead 6 attributes in total 9, the first most winner was FC Barcelona.

In [56]:
# Switzerland Super League	FC Basel (BAS)	FC Zürich (ZUR)
# Switzerland Super League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "FC Basel (BAS)"').mean(),
                            columns=['FC Basel (BAS)'])
compare_attr['FC Zürich (ZUR)'] = team_attribute.query('team_name == "FC Zürich (ZUR)"').mean()
compare_attr = compare_attr[3:]
compare_attr

compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Switzerland Super League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);

4. Are teams scoring more goal when playing home or away? How about the number of wins when home or away?

For this quesetion, we wanted to anaylsis if the old superstition of the "home field advantage" was true by comparing the wins of teams as home teams vs their away wins.

In [57]:
team_goal_won_total['game_count_as_hometeam'] = 'NAN'
team_goal_won_total['goal_count_as_hometeam'] = 'NAN'
team_goal_won_total['win_count_as_hometeam'] = 'NAN'

team_goal_won_total['game_count_as_awayteam'] = 'NAN'
team_goal_won_total['goal_count_as_awayteam'] = 'NAN'
team_goal_won_total['win_count_as_awayteam'] = 'NAN'

team_goal_won_total['num_total_game'] = 'NAN'
In [58]:
team_match['season'] = team_match['season_start'] + " - " + team_match['season_end']
In [59]:
for index, row in team_goal_won_total.iterrows():
    season = team_goal_won_total.iloc[index, 1]
    team_name = team_goal_won_total.iloc[index, 2]
    df = team_match[(team_match['season'] == season) & (team_match['home_team'] == team_name)]
    team_goal_won_total.iloc[index, 6] = df['home_team_goal'].sum()
    team_goal_won_total.iloc[index, 5] = df.shape[0]
    new_df = df[df['winner'] == team_name]
    team_goal_won_total.iloc[index, 7] = new_df.shape[0]
    
    df = team_match[(team_match['season'] == season) & (team_match['away_team'] == team_name)]
    team_goal_won_total.iloc[index, 9] = df['away_team_goal'].sum()
    team_goal_won_total.iloc[index, 8] = df.shape[0]
    new_df = df[df['winner'] == team_name]
    team_goal_won_total.iloc[index, 10] = new_df.shape[0]
    team_goal_won_total.iloc[index, 11] = team_goal_won_total.iloc[index, 5] + team_goal_won_total.iloc[index, 8]
In [60]:
team_goal_won_total = team_goal_won_total[['leagueName', 'season', 'team_name', 'num_total_game', 'game_count_as_hometeam',
                                           'game_count_as_awayteam', 'count_wins', 'win_count_as_hometeam', 
                         'win_count_as_awayteam', 'total_num_goal', 'goal_count_as_hometeam', 'goal_count_as_awayteam']]
team_goal_won_total = team_goal_won_total.rename(columns={"num_total_game": "game_count_total", "count_wins": "win_count_total", "total_num_goal": "goal_count_total"})
team_goal_won_total.head()
Out[60]:
leagueName season team_name game_count_total game_count_as_hometeam game_count_as_awayteam win_count_total win_count_as_hometeam win_count_as_awayteam goal_count_total goal_count_as_hometeam goal_count_as_awayteam
0 Belgium Jupiler League 2008 - 2009 Beerschot AC (BAC) 34 17 17 11 9 2 44 32 12
1 Belgium Jupiler League 2008 - 2009 Club Brugge KV (CLB) 34 17 17 18 11 7 59 37 22
2 Belgium Jupiler League 2008 - 2009 FCV Dender EH (DEN) 34 17 17 9 5 4 44 21 23
3 Belgium Jupiler League 2008 - 2009 KAA Gent (GEN) 34 17 17 17 9 8 67 31 36
4 Belgium Jupiler League 2008 - 2009 KRC Genk (GEN) 34 17 17 15 7 8 48 23 25

With the table created, we can now visualize the results to make it easier to interpret.

In [61]:
new_df = team_goal_won_total[team_goal_won_total.columns[-2:]]
df1 = new_df.mean().to_frame()
df1.reset_index(inplace=True)  
df1 = df1.rename(columns={"index": "type", 0: "mean_goal_count"})

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
x = df1.type
y = df1.mean_goal_count
ax.bar(x,y)
plt.show()

We come up with the mean value of total goals as a home team is higher than the mean value of total goals as an away team.

In [62]:
# win vs home or away
new_df = team_goal_won_total[team_goal_won_total.columns[7:9]]
df1 = new_df.mean().to_frame()
df1.reset_index(inplace=True)  
df1 = df1.rename(columns={"index": "type", 0: "mean_win_count"})

fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
x = df1.type
y = df1.mean_win_count
ax.bar(x,y)
plt.show()

Likewise, the mean value of home team victories is much higher than when away.

5. Is there a relationship between player attributes and their overall rating?

For this section, we wanted to start a surface level exploration into the relationship between the individual player's overall rating and their attributes. We accomplish this by creating a linear regression with a subset of the player's attributes and their associated rating to see if there were any trends shown by plotting the results.

In [63]:
attributes = player_Attributes[['free_kick_accuracy','agility','shot_power','stamina', 'short_passing', 'vision', 'ball_control', 'positioning', 'finishing']].fillna(0).astype('int32')

reg = LinearRegression().fit(attributes, player_Attributes['overall_rating'].fillna(0))

print(reg.coef_)

i = 0
for x in list(attributes):
    fig, ax = plt.subplots()
    ax.set_ylabel('overall_rating')
    ax.set_xlabel(x)
    ax.scatter(attributes[x],player_Attributes['overall_rating'].fillna(0))
    ax.plot(attributes[x], reg.coef_[i]*attributes[x] + reg.intercept_, 'k')
    i = i + 1
plt.show()
[-0.02712711  0.01414654  0.10463825  0.08053166  0.13160613  0.07051224
  0.00833057 -0.01379527 -0.01003446]

From the results, we can see that there is a relationship present and that could inducate that some attributes weight towards a higher rating more than others. We will continue this exploration when we get to building our models.

6. Which players have the highest overall rating per year?

Here we wanted to see who was at the top of the game for each year in the dataset.

In [64]:
grouped_player = player_Attributes.groupby([player_Attributes['date'].dt.year])
players_list = []
ratings = []
seasons = []
for name, group in grouped_player:
    df = group.sort_values('overall_rating', ascending=False)
    row = df.iloc[0]
    player = players.loc[players[players['player_fifa_api_id'] == row['player_fifa_api_id']].index, 'player_name']
    players_list.append(player.values[0])
    ratings.append(row['overall_rating'])
    seasons.append(name)

overall_rating = pd.DataFrame(data={"seasons" : seasons, 'player' : players_list, "rating" : ratings})
overall_rating
Out[64]:
seasons player rating
0 2007 Wayne Rooney 93.0
1 2008 Cristiano Ronaldo 91.0
2 2009 Iker Casillas 91.0
3 2010 Lionel Messi 90.0
4 2011 Lionel Messi 94.0
5 2012 Lionel Messi 94.0
6 2013 Lionel Messi 94.0
7 2014 Lionel Messi 94.0
8 2015 Lionel Messi 94.0
9 2016 Neymar 90.0

Correlations from these results:

Wayne Rooney and cristiano Ronaldo were the two highest rated players during in 2007/2008, both of which played on Manchester United. Manchester United was also the team with the most wins in the Premier League during those years. Similarly, Messi and Neymar have been apart of Barcelona the years they were the highest rated players and while concurrently there at Barcelona, the club had the most wins in the spanish league.


Incidentally, it seems like the highest rated player comes from the spanish league for 8 years in a row here.

7. What precent of player prefer their left foot? right foot? or both? Does this connect with their rating at all?

For this question, we were curious to see the distributions of the players preferred foot. As well, in some sports using a different hand or foot than the major sometimes results in small advantages (take left handers in tennis for example), does this apply to this sport as well?

In [65]:
p_A_1= player_Attributes[['player_api_id', 'preferred_foot', 'overall_rating', 'potential']]
p_A_1.head()
Out[65]:
player_api_id preferred_foot overall_rating potential
0 505942 right 67.0 71.0
1 505942 right 67.0 71.0
2 505942 right 62.0 66.0
3 505942 right 61.0 65.0
4 505942 right 61.0 65.0
In [66]:
p_A_1= p_A_1.groupby(['player_api_id', 'preferred_foot']).mean()  # getting the average player rating
p_A_1.head()
Out[66]:
overall_rating potential
player_api_id preferred_foot
2625 right 60.142857 61.142857
2752 right 69.380952 70.380952
2768 left 69.285714 70.571429
2770 left 71.571429 75.285714
right 70.750000 72.000000
In [67]:
# get the total number of players that are right, left or both and there rating
left_foot=0; left_rating=0.0
right_foot=0; right_rating =0.0
both=0;       both_rating = 0.0
checked=[]
for key, value in p_A_1.iterrows():
    if key[0] in checked:
        both +=1
        both_rating += value[0] 
    elif key[1] == 'right':
        right_foot +=1
        checked.append(key[0])
        right_rating += value[0]
    else:
        left_foot +=1
        checked.append(key[0])
        left_rating += value[0]
In [68]:
total_players= left_foot + right_foot + both
average_left_rating= left_rating/ left_foot
average_right_rating = right_rating/ right_foot
both_rating = both_rating/ both
In [69]:
# Data to plot
labels = 'left', 'right', 'both'
sizes = [left_foot, right_foot, both]
colors = ['yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0)  # explode 1st slice

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.title("Overall players foot Preferance percentage ")
plt.show()

From this graph, we can see that the major of players in our dataset are prefer their right foot, while a smaller percentage prefer their left or both.

In [70]:
# Data to plot
labels = average_left_rating, average_right_rating, both_rating
sizes = [left_foot, right_foot, both]
colors = ['yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0)  # explode 1st slice

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.title("Total foot preferance percentage with respect to average rating ")
plt.show()

And from this graph, we can tell that their does not seem to be a bias in the ratings based off the player's prefer foot, possibly correlating to their also being no advantage for using the other foot or both.

8. What is the relationship between number of goals scored and the win percentage of the teams?

For this question, we wanted to see if their is a correlation between a high number of goals scored and a high win percentage.

In [71]:
plt.scatter(team_goal_won_total['goal_count_total'], team_goal_won_total['win_count_total'])
plt.ylabel("Total number of wins")
plt.xlabel("Total nubmer of goals")
plt.title("Total number of goals vs wins from 2008- 2016 for each team")

x= team_goal_won_total['goal_count_total'].values.reshape(-1,1)
y= team_goal_won_total['win_count_total']
reg_1 = LinearRegression()
reg_1 = reg_1.fit(x, y)

intercept= reg_1.intercept_
slope= reg_1.coef_
plt.plot(x, intercept + slope * x, 'red', label='reg')
plt.show()
In [72]:
team_goal_won_total['winning_percentage']= (team_goal_won_total['win_count_total'] / team_goal_won_total['game_count_total'])*100
team_goal_won_total.head()
Out[72]:
leagueName season team_name game_count_total game_count_as_hometeam game_count_as_awayteam win_count_total win_count_as_hometeam win_count_as_awayteam goal_count_total goal_count_as_hometeam goal_count_as_awayteam winning_percentage
0 Belgium Jupiler League 2008 - 2009 Beerschot AC (BAC) 34 17 17 11 9 2 44 32 12 32.3529
1 Belgium Jupiler League 2008 - 2009 Club Brugge KV (CLB) 34 17 17 18 11 7 59 37 22 52.9412
2 Belgium Jupiler League 2008 - 2009 FCV Dender EH (DEN) 34 17 17 9 5 4 44 21 23 26.4706
3 Belgium Jupiler League 2008 - 2009 KAA Gent (GEN) 34 17 17 17 9 8 67 31 36 50
4 Belgium Jupiler League 2008 - 2009 KRC Genk (GEN) 34 17 17 15 7 8 48 23 25 44.1176
In [73]:
plt.scatter(team_goal_won_total['goal_count_total'], team_goal_won_total['winning_percentage']) 
plt.ylabel("wining percentage")
plt.xlabel("total_number of goals")
plt.title("Wining percentage of all Europian leagues teams Vs there respective goals per season")

x= team_goal_won_total['goal_count_total'].values.reshape(-1,1)
y=team_goal_won_total['winning_percentage']
reg_1 = LinearRegression()
reg_1 = reg_1.fit(x,y)

intercept= reg_1.intercept_
slope= reg_1.coef_
plt.plot(x, intercept + slope * x, 'red', label='reg')
plt.show()
In [74]:
a= team_goal_won_total.groupby(['leagueName'])

for index, value in a:
    
    x= value.goal_count_total
    y= value.winning_percentage
    
    plt.scatter(x, y) 
    plt.ylabel("wining percentage")
    plt.xlabel("total_number of goals")
    plt.title(index)
    
    reg_1 = LinearRegression()
    reg_1 = reg_1.fit(x.values.reshape(-1,1),y)

    intercept= reg_1.intercept_
    slope= reg_1.coef_
    plt.plot(x, intercept + slope * x, 'red', label='reg')
    plt.show()

From each of these visualizations, aside from a couple of outliers, their is a consistant correlation between a high amount fo goals scored and a high number of wins.


Analysis

With exploration of the data done, we move on to the penultimate step of the data lifecycle - the analysis. In this section we apply machine learning and statistics to better understand the trends and relationships present in the data. We will be using two machine learning models to accomplish this in this tutorial - Linear regression and logistical regression.

Determining if there is a linear relationship between a player's overall rating and the player's attributes documented in the dataset

The first analysis we wanted to demonstate is the continuation of the question posed in question 5 of our exploration.

We will start by defining our null and alternative hypothesis.

Null hypothesis 1: There is no relation between the overall rating of a player and the attributes of player.

Alternative Hypothesis 1: There is a relation between the overall rating of a player and the attributes of player.

We will either accept or reject the null hypothesis based on the statistical results of our linear regression.

In [75]:
df = player_Attributes

# Preparing data for regression
selected_attr = [
       'potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']
target = ['overall_rating']
df = df.dropna(0)
X = df[selected_attr]
y = df[target]
In [76]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)
coef_:  [[ 0.37695799  0.02093622  0.01136773  0.07026387  0.04911308  0.00396994
  -0.01316442  0.01198513  0.01265543  0.00757413  0.13461736  0.00689957
   0.0094194  -0.00881508  0.21155022  0.00888179  0.01722714  0.01439787
  -0.00499012  0.06054882 -0.01402617  0.02098035  0.01258783 -0.01056229
  -0.00180798  0.01427677  0.0324026   0.00251907 -0.02827941  0.16193239
   0.03178607 -0.03368222  0.05750351  0.02506919]]
intercept_:  [-3.484707]
In [77]:
X=sm.add_constant(X_train) 
model= sm.OLS(y_train, X).fit()
model.summary()
Out[77]:
OLS Regression Results
Dep. Variable: overall_rating R-squared: 0.841
Model: OLS Adj. R-squared: 0.841
Method: Least Squares F-statistic: 2.108e+04
Date: Sat, 19 Dec 2020 Prob (F-statistic): 0.00
Time: 04:13:49 Log-Likelihood: -3.3093e+05
No. Observations: 135265 AIC: 6.619e+05
Df Residuals: 135230 BIC: 6.623e+05
Df Model: 34
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -3.4847 0.109 -32.028 0.000 -3.698 -3.271
potential 0.3770 0.002 222.886 0.000 0.374 0.380
crossing 0.0209 0.001 21.599 0.000 0.019 0.023
finishing 0.0114 0.001 10.683 0.000 0.009 0.013
heading_accuracy 0.0703 0.001 74.060 0.000 0.068 0.072
short_passing 0.0491 0.002 30.713 0.000 0.046 0.052
volleys 0.0040 0.001 4.158 0.000 0.002 0.006
dribbling -0.0132 0.001 -9.740 0.000 -0.016 -0.011
curve 0.0120 0.001 12.919 0.000 0.010 0.014
free_kick_accuracy 0.0127 0.001 15.368 0.000 0.011 0.014
long_passing 0.0076 0.001 6.926 0.000 0.005 0.010
ball_control 0.1346 0.002 73.138 0.000 0.131 0.138
acceleration 0.0069 0.002 4.313 0.000 0.004 0.010
sprint_speed 0.0094 0.002 6.209 0.000 0.006 0.012
agility -0.0088 0.001 -7.501 0.000 -0.011 -0.007
reactions 0.2116 0.001 168.101 0.000 0.209 0.214
balance 0.0089 0.001 9.804 0.000 0.007 0.011
shot_power 0.0172 0.001 16.938 0.000 0.015 0.019
jumping 0.0144 0.001 17.051 0.000 0.013 0.016
stamina -0.0050 0.001 -5.305 0.000 -0.007 -0.003
strength 0.0605 0.001 62.246 0.000 0.059 0.062
long_shots -0.0140 0.001 -13.220 0.000 -0.016 -0.012
aggression 0.0210 0.001 26.371 0.000 0.019 0.023
interceptions 0.0126 0.001 14.737 0.000 0.011 0.014
positioning -0.0106 0.001 -11.539 0.000 -0.012 -0.009
vision -0.0018 0.001 -1.782 0.075 -0.004 0.000
penalties 0.0143 0.001 16.457 0.000 0.013 0.016
marking 0.0324 0.001 24.892 0.000 0.030 0.035
standing_tackle 0.0025 0.002 1.669 0.095 -0.000 0.005
sliding_tackle -0.0283 0.001 -22.079 0.000 -0.031 -0.026
gk_diving 0.1619 0.002 95.462 0.000 0.159 0.165
gk_handling 0.0318 0.002 14.292 0.000 0.027 0.036
gk_kicking -0.0337 0.001 -46.584 0.000 -0.035 -0.032
gk_positioning 0.0575 0.002 25.992 0.000 0.053 0.062
gk_reflexes 0.0251 0.002 11.563 0.000 0.021 0.029
Omnibus: 5554.938 Durbin-Watson: 1.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 12606.865
Skew: -0.258 Prob(JB): 0.00
Kurtosis: 4.404 Cond. No. 4.66e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.66e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [78]:
y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)
score:  0.8428583351656973

By look at the p values of the different attributes we can confidently reject the null hypothesis as we can see that the p values are less than 0.05, meaning that there is a relationship between player rating and attribute scores. From the coefficients we can see though we can see that some attributes are more heavily related to the rating the player receives that others. Lastly we find that we are able to predict the player's rating correctly approximately 84% percent of the time based on the attributes passed into the regression.


Determining if there is a linear relationship between a team's winning percentage(team) and team's attributes documented in the dataset

Null hypothesis 2: There is no relation between the winning percentage and other features of team.

Alternative Hypothesis 2: There is a relation between the winning percentage and other features of team.

In [79]:
team_goal_won_total['season_start'] = team_goal_won_total['season'].apply(lambda x: x.split('-')[0])
team_goal_won_total['season_end'] = team_goal_won_total['season'].apply(lambda x: x.split('-')[1])
team_goal_won_total['season_start'] = team_goal_won_total['season_start'].astype(int)
In [80]:
selected_col = [
       'team_name', 'date', 'buildUpPlaySpeed', 'buildUpPlayDribbling',
       'buildUpPlayPassing', 'chanceCreationPassing', 'chanceCreationCrossing', 'chanceCreationShooting', 'defencePressure',
       'defenceAggression', 'defenceTeamWidth']
df = team_attribute[selected_col]
df = df.dropna()
df['year'] = pd.DatetimeIndex(df['date']).year
df = df.drop(['date'], axis=1)
In [81]:
#match season start and team name
all_team_info = pd.merge(df, team_goal_won_total,  how='inner', 
                         left_on=['team_name','year'], right_on = ['team_name','season_start'])
all_team_info = all_team_info.drop(['season_start', 'season_end', 'team_name', 'year', 'leagueName', 'season', 'game_count_total',
       'game_count_as_hometeam', 'game_count_as_awayteam', 'win_count_total',
       'win_count_as_hometeam', 'win_count_as_awayteam', 'goal_count_total',
       'goal_count_as_hometeam', 'goal_count_as_awayteam'], axis=1)
all_team_info['winning_percentage'] = all_team_info['winning_percentage'].astype(float)

all_team_info.head()
# This table will have the winnign_percentage and team attributes. 
Out[81]:
buildUpPlaySpeed buildUpPlayDribbling buildUpPlayPassing chanceCreationPassing chanceCreationCrossing chanceCreationShooting defencePressure defenceAggression defenceTeamWidth winning_percentage
0 52 48.0 56 54 63 64 47 44 54 16.666667
1 58 64.0 62 56 68 57 41 42 60 60.526316
2 59 64.0 53 51 72 63 49 45 63 57.894737
3 48 70.0 52 66 36 75 58 57 49 34.210526
4 48 70.0 52 66 36 69 58 57 49 39.473684
In [82]:
y_target = all_team_info['winning_percentage']
X = all_team_info.iloc[:,0:-1]
In [83]:
X_train, X_test, y_train, y_test = train_test_split(X,y_target)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)
coef_:  [-0.00062286 -0.00068492 -0.30921599  0.08060845  0.27237476  0.01387741
  0.30282332  0.28520692  0.02394299]
intercept_:  4.200224758849281
In [84]:
y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)
score:  0.09874665373937364
In [85]:
X1=sm.add_constant(X_train) 
model= sm.OLS(y_train, X1.astype(float)).fit()
model.summary()
Out[85]:
OLS Regression Results
Dep. Variable: winning_percentage R-squared: 0.162
Model: OLS Adj. R-squared: 0.132
Method: Least Squares F-statistic: 5.405
Date: Sat, 19 Dec 2020 Prob (F-statistic): 8.89e-07
Time: 04:13:55 Log-Likelihood: -1066.8
No. Observations: 262 AIC: 2154.
Df Residuals: 252 BIC: 2189.
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 4.2002 11.959 0.351 0.726 -19.352 27.753
buildUpPlaySpeed -0.0006 0.102 -0.006 0.995 -0.202 0.200
buildUpPlayDribbling -0.0007 0.099 -0.007 0.995 -0.196 0.195
buildUpPlayPassing -0.3092 0.098 -3.146 0.002 -0.503 -0.116
chanceCreationPassing 0.0806 0.096 0.836 0.404 -0.109 0.270
chanceCreationCrossing 0.2724 0.097 2.797 0.006 0.081 0.464
chanceCreationShooting 0.0139 0.089 0.155 0.877 -0.162 0.190
defencePressure 0.3028 0.117 2.595 0.010 0.073 0.533
defenceAggression 0.2852 0.129 2.218 0.027 0.032 0.538
defenceTeamWidth 0.0239 0.133 0.180 0.857 -0.237 0.285
Omnibus: 14.028 Durbin-Watson: 2.171
Prob(Omnibus): 0.001 Jarque-Bera (JB): 14.884
Skew: 0.577 Prob(JB): 0.000586
Kurtosis: 3.176 Cond. No. 2.04e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.04e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

From the results of the linear regression, it seems as though most of the teams attributes could not be found to have a strong relationship with that team's winning percentage. As we can see all but the p values of chanceCreationCrossing, defencePressure, buildUpPlayPassing, and defenceAggression were less than 0.05. As well, the model scores very low in it's prediction of the win percentage when giving the team's attributes.


When attempting to predict the winning percentage with only the attributes which has p value less than 0.05

In [86]:
X_new = X[['chanceCreationCrossing', 'defencePressure', 'buildUpPlayPassing', 'defenceAggression']]
X_train, X_test, y_train, y_test = train_test_split(X_new, y_target)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)
coef_:  [ 0.32497495  0.32966984 -0.21479505  0.22177895]
intercept_:  4.502832143242543
In [87]:
y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)
score:  0.13486437688212916
In [88]:
X2=sm.add_constant(X_train) 
model= sm.OLS(y_train, X2.astype(float)).fit()
model.summary()
Out[88]:
OLS Regression Results
Dep. Variable: winning_percentage R-squared: 0.143
Model: OLS Adj. R-squared: 0.130
Method: Least Squares F-statistic: 10.72
Date: Sat, 19 Dec 2020 Prob (F-statistic): 4.73e-08
Time: 04:13:59 Log-Likelihood: -1056.8
No. Observations: 262 AIC: 2124.
Df Residuals: 257 BIC: 2141.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 4.5028 8.128 0.554 0.580 -11.502 20.508
chanceCreationCrossing 0.3250 0.090 3.596 0.000 0.147 0.503
defencePressure 0.3297 0.106 3.123 0.002 0.122 0.538
buildUpPlayPassing -0.2148 0.089 -2.420 0.016 -0.390 -0.040
defenceAggression 0.2218 0.114 1.938 0.054 -0.004 0.447
Omnibus: 18.912 Durbin-Watson: 1.960
Prob(Omnibus): 0.000 Jarque-Bera (JB): 20.841
Skew: 0.665 Prob(JB): 2.98e-05
Kurtosis: 3.372 Cond. No. 943.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The score is still very low, in other word this model describe only about 13% of the data. But the coef of those attributes are not 0 because the p values are less than 0.05.

===========================================================================================================================

Logistic regression prediction on a given match for teams home play win based on the overall rating of starting players for the home team.

creating a logistic regression on matchs that teams either win, loss or draw based on the starting players overall rating for the year.

Null hypothesis 1: There is no strong relation between the players rating of the team and the teams win chance for a given match

Alternative Hypothesis 1: There is a strong relation between the overall player rating of a team for a given match and chance of winning

In [89]:
# Getting the columns needed/ Data processing

matches_and_players= matches_copy[['league_id', 'season', 'date', 'match_api_id', 'home_team_api_id', 'away_team_api_id', 
                            'home_player_1','home_player_2', 'home_player_3','home_player_4','home_player_5','home_player_6',
                            'home_player_7','home_player_8','home_player_9','home_player_10','home_player_11',
                            'away_player_1', 'away_player_2', 'away_player_3', 'away_player_4', 'away_player_5', 'away_player_6',
                            'away_player_7', 'away_player_8', 'away_player_9', 'away_player_10' , 'away_player_11',
                            'home_team_goal', 'away_team_goal']]
# clean up the data and 
# create label to who won the game. 1 home wins 0 drow and -1 away wins
matches_and_players.dropna(inplace=True)
matches_and_players['win_loss_drow']= np.where( matches_and_players['home_team_goal']==matches_and_players['away_team_goal'], 
                                                0, np.where(matches_and_players['home_team_goal']>matches_and_players['away_team_goal'],
                                                1, -1))

# melting the data sho each player is associated with team 
home_team_players= pd.melt(matches_and_players, id_vars=['league_id','season', 'date', 'match_api_id', 'home_team_api_id', 'win_loss_drow'], value_vars=['home_player_1','home_player_2', 'home_player_3','home_player_4',
                                                                     'home_player_5','home_player_6','home_player_7','home_player_8',
                                                                     'home_player_9','home_player_10','home_player_11'], 
                      ignore_index=False, var_name='home_players', value_name='players_id')
# convert the date col to the python datetime 
home_team_players['date'] = pd.to_datetime(home_team_players['date'])
home_team_players.head()
Out[89]:
league_id season date match_api_id home_team_api_id win_loss_drow home_players players_id
145 1 2008/2009 2009-02-27 493017 8203 1 home_player_1 38327.0
153 1 2008/2009 2009-03-08 493025 9984 -1 home_player_1 36835.0
155 1 2008/2009 2009-03-07 493027 8635 1 home_player_1 34480.0
162 1 2008/2009 2009-03-13 493034 8203 1 home_player_1 38327.0
168 1 2008/2009 2009-03-14 493040 10000 0 home_player_1 37900.0
In [90]:
# create a dataframe for players and there overall rating with respect to time.
# extract the year from the given date
player_rating= player_Attributes[['player_api_id', 'date', 'overall_rating']]
player_rating['year'] = pd.DatetimeIndex(player_rating['date']).year
player_rating.head()
Out[90]:
player_api_id date overall_rating year
0 505942 2016-02-18 67.0 2016
1 505942 2015-11-19 67.0 2015
2 505942 2015-09-21 62.0 2015
3 505942 2015-03-20 61.0 2015
4 505942 2007-02-22 61.0 2007
In [91]:
# group them by the year and player_id
player_rating= player_rating.groupby(['player_api_id', 'year']).mean()
player_rating.head()
Out[91]:
overall_rating
player_api_id year
2625 2007 63.0
2008 60.0
2010 60.0
2011 58.5
2012 58.0
In [92]:
# convert player id from float to int. so that it is easy for comparison 
home_team_players['players_id']= home_team_players['players_id'].astype(np.int64)
# extract year from date 
home_team_players['year'] = pd.DatetimeIndex(home_team_players['date']).year

# Connecting matches teams and players. 
matches_teams_players = pd.merge(home_team_players, player_rating,  how='left', 
                       left_on=['players_id', 'year'], 
                       right_on = ['player_api_id', 'year'])
print(matches_teams_players.shape)
# for the nan values for players ratting replace them with average value of the team for the given match
matches_teams_players['overall_rating']= matches_teams_players.groupby('match_api_id', 
                               )['overall_rating'].apply(lambda x: x.fillna(x.mean()))
print(matches_teams_players.shape)
# group the home players by match id 
grouped_matches_teams_players= matches_teams_players.groupby('match_api_id')
(235114, 10)
(235114, 10)
In [93]:
matches_teams_players.head()
Out[93]:
league_id season date match_api_id home_team_api_id win_loss_drow home_players players_id year overall_rating
0 1 2008/2009 2009-02-27 493017 8203 1 home_player_1 38327 2009 61.0
1 1 2008/2009 2009-03-08 493025 9984 -1 home_player_1 36835 2009 64.0
2 1 2008/2009 2009-03-07 493027 8635 1 home_player_1 34480 2009 67.0
3 1 2008/2009 2009-03-13 493034 8203 1 home_player_1 38327 2009 61.0
4 1 2008/2009 2009-03-14 493040 10000 0 home_player_1 37900 2009 64.0
In [94]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler

def process_logistic_reg(grouped_data_frame):
    
    home_team_X_values= list()
    home_team_Y_values= list()
    for index, value in grouped_data_frame:
    #     home_team_X_values.append( [round(num,3)  for num in value.overall_rating.tolist()])
        home_team_X_values.append(value.overall_rating.tolist())
        home_team_Y_values.append(value.win_loss_drow.tolist()[0])
        
# making it binary classification    win=1 draw/loss= 0
    home_team_y_binary= [0 if x==-1 else x for x in home_team_Y_values]
        
    X_train,X_test,y_train,y_test = train_test_split(home_team_X_values, home_team_y_binary, test_size=0.25, random_state=0)

    scaler = MinMaxScaler(feature_range=(0, 1.5))
    scaler.fit(X_train)
    scaler.fit(X_test)
    normalized_train = scaler.transform(X_train)
    normalized_test = scaler.transform(X_test)

    kf = KFold(n_splits = 10)

    logistic_regression= LogisticRegression(multi_class='ovr', solver='liblinear') 
    scores = cross_val_score(logistic_regression, normalized_train, y_train, cv = kf)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

    # Do logistic regresssion on the train data
    logistic_regression.fit(normalized_train, y_train)
    
    y_pred=logistic_regression.predict(normalized_test)      # prediction using the X test data 
    
    same_values= [i for i, j in zip(y_test, y_pred) if i == j]  # getting values that are the same prediction == actual

    accuracy = len(same_values)/ len(y_pred) *100     # this is the percetage that the prediction is correct 
    print("Accuracy of prediction: " + str(accuracy))
    
    #log_reg = sm.MNLogit(y_train, X_train).fit() 
    log_reg= sm.Logit(y_train, X_train).fit()
    
    print(log_reg.summary())
In [95]:
# logistic regression on overall Europe team's home match win vs the starting players ratting. If the p value of this
# outcome is low then this can be a good classification prediction for wchich one team can win a match based on the
# starting players ratting. If the p value is high then we can acept the null hypothisis of being there is less relation
# between the win and starting players rating.

process_logistic_reg(grouped_matches_teams_players)
Accuracy: 0.59 (+/- 0.02)
Accuracy of prediction: 58.75748502994011
Optimization terminated successfully.
         Current function value: 0.688455
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                16030
Model:                          Logit   Df Residuals:                    16019
Method:                           MLE   Df Model:                           10
Date:                Sat, 19 Dec 2020   Pseudo R-squ.:               0.0005141
Time:                        04:14:41   Log-Likelihood:                -11036.
converged:                       True   LL-Null:                       -11042.
Covariance Type:            nonrobust   LLR p-value:                    0.3307
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0166      0.004     -4.128      0.000      -0.025      -0.009
x2            -0.0109      0.005     -2.406      0.016      -0.020      -0.002
x3            -0.0108      0.005     -2.308      0.021      -0.020      -0.002
x4             0.0056      0.004      1.238      0.216      -0.003       0.014
x5            -0.0104      0.004     -2.382      0.017      -0.019      -0.002
x6             0.0070      0.004      1.570      0.116      -0.002       0.016
x7            -0.0003      0.004     -0.075      0.940      -0.009       0.008
x8             0.0061      0.004      1.398      0.162      -0.002       0.015
x9             0.0057      0.004      1.377      0.168      -0.002       0.014
x10            0.0115      0.004      2.745      0.006       0.003       0.020
x11            0.0106      0.004      2.497      0.013       0.002       0.019
==============================================================================

As shown above the logistic regression model to predict the winning chance of home team is about 59% accurate. This shows that it would be really hard to determine the outcome the game with just the overall players rating of the team. Although some of the p values for few players are in good condition. They are less than 0.05. It shows that players x1,x2,x3,x10,x11 have a significant values in this model. This makes sense as these players are goalkeeper, defenders and attackers.

In [96]:
# do logistic regression on some of the well known leagues such as England and Spanish leagues and see the outcome
league_match_grouped= matches_teams_players.groupby(['league_id'])
# Spanish league id= 21518
England_matches= league_match_grouped.get_group(21518).groupby('match_api_id')
process_logistic_reg(England_matches)
Accuracy: 0.63 (+/- 0.05)
Accuracy of prediction: 59.97045790251108
Optimization terminated successfully.
         Current function value: 0.680998
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                 2030
Model:                          Logit   Df Residuals:                     2019
Method:                           MLE   Df Model:                           10
Date:                Sat, 19 Dec 2020   Pseudo R-squ.:                 0.01613
Time:                        04:14:43   Log-Likelihood:                -1382.4
converged:                       True   LL-Null:                       -1405.1
Covariance Type:            nonrobust   LLR p-value:                 1.893e-06
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0230      0.012     -1.987      0.047      -0.046      -0.000
x2            -0.0103      0.014     -0.749      0.454      -0.037       0.017
x3            -0.0222      0.015     -1.516      0.130      -0.051       0.006
x4            -0.0294      0.014     -2.060      0.039      -0.057      -0.001
x5            -0.0176      0.013     -1.327      0.184      -0.044       0.008
x6            -0.0116      0.014     -0.815      0.415      -0.039       0.016
x7            -0.0037      0.015     -0.254      0.799      -0.032       0.025
x8             0.0174      0.013      1.350      0.177      -0.008       0.043
x9            -0.0019      0.012     -0.157      0.875      -0.026       0.022
x10            0.0584      0.012      4.853      0.000       0.035       0.082
x11            0.0411      0.012      3.307      0.001       0.017       0.065
==============================================================================

In the above logistic regression, we have taken only the Spanish La Liga to see if there are some changes for the overall logistic regression model. Which has a little higher accuracy with 63% accuracy. As the same as the overall European league logistic regression model this one also show the significant of the keeper and attackers with defenders having less accurate p value.

Away team logistic regresion

In [97]:
# melting the data sho each player is associated with team 
away_team_players= pd.melt(matches_and_players, id_vars=['league_id','season', 'date', 'match_api_id', 'away_team_api_id', 'win_loss_drow'], 
                           value_vars=['away_player_1','away_player_2', 'away_player_3','away_player_4',
                                                                     'away_player_5','away_player_6','away_player_7','away_player_8',
                                                                     'away_player_9','away_player_10','away_player_11'], 
                      ignore_index=False, var_name='away_players', value_name='players_id')
# convert the date col to the python datetime 
away_team_players['date'] = pd.to_datetime(away_team_players['date'])
away_team_players['year'] = pd.DatetimeIndex(away_team_players['date']).year

# Connecting matches teams and players. 
matches_teams_players_away = pd.merge(away_team_players, player_rating,  how='left', 
                       left_on=['players_id', 'year'], 
                       right_on = ['player_api_id', 'year'])

# for the nan values for players ratting replace them with average value of the team for the given match
matches_teams_players_away['overall_rating']= matches_teams_players_away.groupby('match_api_id', 
                               )['overall_rating'].apply(lambda x: x.fillna(x.mean()))
# group the home players by match id 
grouped_matches_teams_players_away= matches_teams_players_away.groupby('match_api_id')

process_logistic_reg(grouped_matches_teams_players_away)
Accuracy: 0.55 (+/- 0.02)
Accuracy of prediction: 54.64071856287425
Optimization terminated successfully.
         Current function value: 0.687124
         Iterations 4
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                16030
Model:                          Logit   Df Residuals:                    16019
Method:                           MLE   Df Model:                           10
Date:                Sat, 19 Dec 2020   Pseudo R-squ.:                0.002447
Time:                        04:15:19   Log-Likelihood:                -11015.
converged:                       True   LL-Null:                       -11042.
Covariance Type:            nonrobust   LLR p-value:                 4.777e-08
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0018      0.004      0.451      0.652      -0.006       0.010
x2             0.0073      0.004      1.646      0.100      -0.001       0.016
x3            -0.0045      0.005     -0.965      0.335      -0.014       0.005
x4            -0.0018      0.004     -0.396      0.692      -0.010       0.007
x5             0.0056      0.004      1.292      0.196      -0.003       0.014
x6             0.0042      0.004      0.955      0.340      -0.004       0.013
x7            -0.0023      0.004     -0.544      0.587      -0.011       0.006
x8            -0.0075      0.004     -1.732      0.083      -0.016       0.001
x9             0.0021      0.004      0.501      0.616      -0.006       0.010
x10           -0.0144      0.004     -3.493      0.000      -0.022      -0.006
x11            0.0069      0.004      1.623      0.105      -0.001       0.015
==============================================================================

Above, we have also did logistic regression model to see away's team winning chance prediction. This model shows that an even lesses accuracy. In the away team model there is only one significat player in the team that has less than 0.05 p value, which is attacker.

Insight

At last, we come to the last step of the data lifecycle, for this step we draw our conclusions based off the insight we have gained from our analysis of the data.

From the exploratory analysis, we guessed there is linear relationship between overall player ratting and each players attributes. This model can describe 85% of data according to the linear regression score. From the statistic summery we got valid coefficients for each attribute with p value less than 0.05. As a result, we can reject the null hypothesis of there is no relationship between overall rating of players and player attributes.

However, we cannot see direct relationship in team attributes from the graph of the comparison of best two teams of each league, so we did linear regression model. But the model's score is too low to predict winning percentage from team attributes with the given dataset. There are some attributes that have low p value that are significant to this model. we cannot reject the null hypothesis

Based on the logistic regression model we have done, we can accept the null hypothesis of no strong relation between team players overall rating and winning chance of given match. This does not mean there is no relation! It is just we cannot reject the null hypostasis as the p values of the prediction model are higher, which makes it hard to predict the outcome of the match.