Erick Herrera, Hyejeong Kim, Kibreab Gebrehiwot
As the popularity of machine learning has drastically increased over the last decade, so has the demand for data literacy as a skill in the job market. This tutorial aims to give a general introduction to data science by walking the data science lifecycle (Collecting data, Cleaning data, Exploring data, Building models, and Interpretation) as we try to analyze the data within our chosen database.
We will be using a dataset based on the European fútball 2008-2016 seasons. This dataset is comprised of the information on 11 different European fútball league and their teams. Specifically, the data contains 7 tables (Country, League, Player, player attributes, team, team attributes, and all the matches played during 2008 - 2016 seasons. The goal is to some insight on the beautiful game based on the data held by this dataset.
For this tutorial, We will be using the Pandas library to read and organize the data we are collecting, the Matplot library to render plots for our visualizations of the dataset, and SciKit-Learn library for building our models.
During our exploratory analysis & Data visualization step, some of the questions that we thought would be interesting to try to answer were:
For our models in this tutorial, we will be presenting two linear regression examples:
As well, a logistical regression in the form of:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
import sqlite3
import matplotlib.pyplot as plt
import warnings
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import statsmodels.api as sm
import seaborn as sn
The first step of the data lifescyle is the collecting phase. For this dataset, it is as simple as downloading the sql file from the database's host https://www.kaggle.com/hugomathien/soccer and performing SQL commands directly on the database to retrive the tables needed for the analysis of the data. The database contains 7 different tables, as such we created 7 variables to hold each the tables before we begin cleaning up.
sqlite_file = 'database.sqlite'
conn = sqlite3.connect(sqlite_file) # connect to the database
query = "SELECT * FROM Country"
countries = pd.read_sql(query, conn) # read and store country table
query = "SELECT * FROM League"
leagues = pd.read_sql(query, conn) # read and store league table
query = "SELECT * FROM Match"
matches = pd.read_sql(query, conn) # read and store match table
query = "SELECT * FROM Player"
players = pd.read_sql(query, conn) # read and store player table
query = "SELECT * FROM Player_Attributes"
player_Attributes = pd.read_sql(query, conn) # read and store player_attributes table
query = "SELECT * FROM Team"
teams = pd.read_sql(query, conn) # read and store team table
query = "SELECT * FROM Team_Attributes"
team_Attributes = pd.read_sql(query, conn) # read and store team_Attributs table
The tables have been stored in panda dataframes and each can be seen below.
countries.head()
leagues.head()
# Creating copy of match table for later use
matches_copy= matches.copy()
# getting the whole column names from match table
my_list = list(matches)
my_list = matches.columns.values.tolist()
matches.head()
players.head()
player_Attributes.head()
teams.head()
team_Attributes.head()
Now with our data retrieved and stored, our next step (and the next step of the data lifecycle) is to tidy up the raw data to make it more readable and easier for us to use for our analysis and visualizations. To accomplish this, we will be merging a few of the tables together to make seeing useful insights in the data easier and we will be drop some of the columns from all the tables to remove irrelevant information.
cursor = conn.cursor()
Country and league tables can be combined into one so we can avoid having to link the two by their country id.
result = cursor.execute("""SELECT Country.id, Country.name, League.name FROM Country INNER JOIN
League ON Country.ID = League.ID""")
countryLeague = pd.DataFrame(result.fetchall())
countryLeague.columns = ['id', 'countryName', 'leagueName']
countryLeague.head()
For tidying the matches table we remove the null columns in the table, we split the season of the match into two columns (season_start and season_end) and we insert the information we gained from merging the country and league tables above to make recognizing who the match belongs to easier
matches.dropna(axis = 1, inplace = True)
matches['date'] = pd.to_datetime(matches['date'])
matches['season_start'] = matches['season'].apply(lambda x: x.split('/')[0])
matches['season_end'] = matches['season'].apply(lambda x: x.split('/')[1])
matches.drop(columns = ['season'], axis = 1, inplace = True)
matchCountry = matches.merge(countryLeague, left_on = 'country_id', right_on = 'id')
matchCountry.drop(columns = ['league_id', 'id_y'], axis = 1, inplace = True)
matchCountry = matchCountry.rename(columns={"id_x": "id", "country_id": "country_league_id"})
matchCountry.head()
Taking it a step further, here we're including the actual names of the teams next to their team ids to make things more legible and introducing a column to keep track of who won the match in that entry.
team_Attributes['date'] = pd.to_datetime(team_Attributes['date'])
team_match = matchCountry.merge(teams, left_on = 'home_team_api_id', right_on = 'team_api_id')
team_match = team_match.rename(columns={"id_x": "id", "team_long_name": "home_team", "team_fifa_api_id": "home_team_fifa_api_id"})
team_match['home_team'] = team_match['home_team'] + " (" + team_match['team_short_name'] + ")"
team_match.drop(columns = ['id_y', 'team_api_id', 'team_short_name'], axis = 1, inplace = True)
team_match = team_match.merge(teams, left_on = 'away_team_api_id', right_on = 'team_api_id')
team_match = team_match.rename(columns={"id_x": "id", "team_long_name": "away_team", "team_fifa_api_id": "away_team_fifa_api_id"})
team_match['away_team'] = team_match['away_team'] + " (" + team_match['team_short_name'] + ")"
team_match.drop(columns = ['id_y', 'team_api_id', 'team_short_name'], axis = 1, inplace = True)
team_match = team_match[['id', 'match_api_id', 'date', 'home_team', 'home_team_api_id', 'home_team_fifa_api_id',
'home_team_goal', 'away_team', 'away_team_api_id', 'away_team_fifa_api_id', 'away_team_goal',
'country_league_id', 'stage', 'season_start', 'season_end', 'countryName', 'leagueName']]
winner = []
for index, row in team_match.iterrows():
if team_match.iloc[index, 6] > team_match.iloc[index, 10]:
winner.append(team_match.iloc[index, 3])
elif team_match.iloc[index, 6] < team_match.iloc[index, 10]:
winner.append(team_match.iloc[index, 7])
else:
winner.append("Tie")
team_match['winner'] = winner
team_match.drop(columns = ['home_team_fifa_api_id', 'away_team_fifa_api_id'], axis = 1, inplace = True)
team_match.head()
Here we are merging the player and the player attributes tables into one while dropping any unnecessary columns.
players['birthday'] = pd.to_datetime(players['birthday'])
player_Attributes['date'] = pd.to_datetime(player_Attributes['date'])
player_attribute = players.merge(player_Attributes, left_on = 'player_api_id', right_on = 'player_api_id')
player_attribute.drop(columns = ['id_y', 'player_fifa_api_id_y'], axis = 1, inplace = True)
player_attribute = player_attribute.rename(columns={"id_x": "id", "player_fifa_api_id_x": "player_fifa_api_id"})
player_attribute.head()
Similarly to what we did with the player and player_attributes tables, here we are merging the team and team_attributes and then sorting them by name alphabetically.
team_attribute = team_Attributes.merge(teams, left_on = 'team_api_id', right_on = 'team_api_id')
team_attribute = team_attribute.rename(columns={"id_x": "id", "team_long_name": "team_name", "team_fifa_api_id_x": "team_fifa_api_id"})
team_attribute['team_name'] = team_attribute['team_name'] + " (" + team_attribute['team_short_name'] + ")"
team_attribute.drop(columns = ['id_y', 'team_fifa_api_id_y', 'team_short_name'], axis = 1, inplace = True)
team_attribute.head()
cols = list(team_attribute.columns)
cols = [cols[0]] + [cols[-1]] + cols[1:-1]
team_attribute = team_attribute[cols]
team_attribute.sort_values(by='team_name', ascending=False)
team_attribute.head()
Now with our data in order, we can begin the next step of the data lifecycle - data exploration and visualization. This following section will accomplish this section by running the gaunlet of questions we about the dataset as varies examples of the kinds of exploratory analysis that can be performed on data.
pd.set_option('mode.chained_assignment', None)
df = team_match[['season_start', 'home_team', 'home_team_goal', 'away_team', 'away_team_goal', 'leagueName']]
df['season_start'] = df['season_start'].astype(int)
Firstly, we are starting here by getting total number of goal for each team as home team and as away team each season for each league.
num_goal_home = df.groupby(['leagueName', 'season_start', 'home_team']).sum()
num_goal_home.drop(columns = ['away_team_goal'], axis = 1, inplace = True)
num_goal_home.head()
num_goal_away = df.groupby(['leagueName', 'season_start', 'away_team']).sum()
num_goal_away.drop(columns = ['home_team_goal'], axis = 1, inplace = True)
num_goal_away.head()
From there, we can just combine these columns to get the total number of goals for each team.
num_goal = num_goal_home.merge(num_goal_away, left_index=True, right_on=['leagueName', 'season_start', 'away_team'])
num_goal['total_num_goal'] = num_goal['home_team_goal'] + num_goal['away_team_goal']
num_goal.head()
A = num_goal # A is for later use
We then divided the teams by thier leagues and sorted the teams by their goal count.
num_goal.reset_index(inplace=True)
num_goal_highest = num_goal.loc[num_goal.groupby(["leagueName", "season_start"])["total_num_goal"].idxmax()]
num_goal_highest = num_goal_highest.rename(columns={"away_team": "team_name"})
num_goal_highest.drop(columns = ['home_team_goal', 'away_team_goal'], axis = 1, inplace = True)
num_goal_highest.reset_index(drop=False, inplace=True)
num_goal_highest
for i, row in num_goal_highest.iterrows():
num_goal_highest.iloc[i, 2] = str(num_goal_highest.iloc[i, 2]) + " - " + str(num_goal_highest.iloc[i, 2] + 1)
Now we can produce visualizations of these goal counts and see which teams held the most goals each season in each league.
%matplotlib inline
leagues = num_goal_highest.leagueName.unique()
count = 0
for league in leagues:
new_df = num_goal_highest[num_goal_highest['leagueName'] == league]
labels = new_df.season_start
x = np.arange(len(labels))
goals = new_df.total_num_goal
width = 0.5
fig, ax = plt.subplots()
rects = ax.bar(x , goals, width)
ax.set_ylabel('num goals')
ax.set_xlabel('season')
ax.set_title("team of highest goal per season of " + league)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=6)
teams_label = new_df.team_name
for rect in rects:
team_name = teams_label[count].split("(")[0]
height = rect.get_height() +1
ax.text(rect.get_x() + width*0.5, 0.7*height, team_name,
ha='center', va='bottom', fontsize=7)
count = count +1
plt.show()
Very similarly to what we did for the previous question, we will first aggregate each teams wins.
df = team_match[team_match['winner'] != 'Tie'] # find rows with winner is not equal to Tie
df = df[['leagueName', 'season_start', 'winner']]
df['season_start'] = df['season_start'].astype(int)
Below we can see the teams with the most wins for each leaque over the years that the data set covers.
most_wins = df.groupby(['leagueName', 'season_start', 'winner'])['winner'].count().sort_values(ascending=False)
most_wins.head()
Now we will be dividing them again to find the teams with the most wins in each league over the years.
leagues = []
seasons = []
teams = []
count_wins = []
for index, value in most_wins.items():
leagues.append(index[0])
season_full = str(index[1]) + " - " + str(index[1] + 1)
seasons.append(season_full)
teams.append(index[2])
count_wins.append(value)
d = {'league':leagues,'season':seasons,'team_name':teams,'count_wins':count_wins}
most_wins = pd.DataFrame(d)
B = most_wins
most_wins = most_wins.loc[most_wins.groupby(["league", "season"])["count_wins"].idxmax()]
most_wins.reset_index(drop=False, inplace=True)
most_wins.head(10)
Now we can produce visualizations of these win counts and see which teams held the most wins each season in each league.
leagues = most_wins.league.unique()
count = 0
for league in leagues:
new_df = most_wins[most_wins['league'] == league]
labels = new_df.season
x = np.arange(len(labels))
count_wins = new_df.count_wins
width = 0.35
fig, ax = plt.subplots()
rects = ax.bar(x , count_wins, width)
ax.set_ylabel('num wins')
ax.set_xlabel('season')
ax.set_title("team of most won per season of " + league)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=6)
teams_label = new_df.team_name
for rect in rects:
team_name = teams_label[count].split("(")[0]
height = rect.get_height() +1
ax.text(rect.get_x()+ width*0.5, 0.7*height, team_name,
ha='center', va='bottom', fontsize=7)
count = count +1
plt.show()
With both these sets of data retieved, we can attempt some cross analysis and compare the goals scored by each team to the wins each team then holds.
df = team_match[team_match['leagueName'] == 'Belgium Jupiler League']
df = df[df['season_start'] == '2013']
df.head()
num_goal_highest.head()
most_wins.head()
team_goal_won = pd.merge(num_goal_highest, most_wins, how='left',
left_on=['leagueName','season_start'], right_on = ['league','season'])
team_goal_won = team_goal_won.drop(['index_x', 'league', 'season', 'index_y'], axis = 1)
team_goal_won = team_goal_won.rename(columns={"team_name_x": "team_highest_goal", "team_name_y": "team_most_won",
"season_start": "season"})
team_goal_won.head()
goal_won_same = team_goal_won[team_goal_won['team_highest_goal'] == team_goal_won['team_most_won']]
percent = goal_won_same.shape[0] / team_goal_won.shape[0]
percent * 100
From this analysis, we find that the team with the highest goals won the most games in a season roughly 64%.
We then can create a table with total goal number and total win count together for the best team per season and per league.
# A with the copy of the number of goals table from earlier
for index, row in A.iterrows():
A.iloc[index, 1] = str(A.iloc[index, 1]) + " - " + str(A.iloc[index, 1] + 1)
A.head()
A = A.rename(columns={"away_team": "team_name"})
A.drop(columns = ['home_team_goal', 'away_team_goal'], axis = 1, inplace = True)
A.head()
B.head()
team_goal_won_total = pd.merge(A, B, how='left',
left_on=['leagueName','season_start', 'team_name'], right_on = ['league','season', 'team_name'])
team_goal_won_total = team_goal_won_total.drop(['league', 'season'], axis = 1)
team_goal_won_total = team_goal_won_total.rename(columns={"season_start": "season"})
team_goal_won_total.head()
warnings.simplefilter(action='ignore', category=FutureWarning)
Here, we are aggregating and matching up the two best teams (based on each team's number of wins) of each league.
com_teams = []
leagues = most_wins.league.unique()
for league in leagues:
new_df = team_goal_won[team_goal_won['leagueName'] == league]
a = new_df['team_most_won'].value_counts()
com_teams.append([league, a.index[0], a.index[1]])
df = pd.DataFrame(com_teams, columns = ['leangue', 'team1', 'team2'])
df.head()
Now we can begin to visualize the differences in each of the team's attributes and compare to each other to see if we notice any trends in the two best teams that might account for their success.
# Belgium Jupiler League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "RSC Anderlecht (AND)"').mean(),
columns=['RSC Anderlecht (AND)'])
compare_attr['Club Brugge KV (CLB)'] = team_attribute.query('team_name == "Club Brugge KV (CLB)"').mean()
compare_attr = compare_attr[3:]
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Belgium Jupiler League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc=0);
RSC Anderlecht was the team with the most wins in the Belgium Jupiler League and when their stats are compared to the runner up of the league, we can begin to assume from the visualization that their lies with their chance creation with passes and their creation of opportunities for taking shots at the goal.
# England Premier League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Manchester United (MUN)"').mean(),
columns=['Manchester United (MUN)'])
compare_attr['Manchester City (MCI)'] = team_attribute.query('team_name == "Manchester City (MCI)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of England Premier League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
Manchester United's victory in the English Premier league seems to be a bit more contested by the runner up in terms of stats as manchester seems to really only pull ahead in chances created from crosses while all the attributes are pretty similar to their rivals, Manchester City.
# France Ligue 1
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Paris Saint-Germain (PSG)"').mean(),
columns=['Paris Saint-Germain (PSG)'])
compare_attr['Girondins de Bordeaux (BOR)'] = team_attribute.query('team_name == "Girondins de Bordeaux (BOR)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of France Ligue 1')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
# Italy Serie A
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Juventus (JUV)"').mean(),
columns=['Juventus (JUV)'])
compare_attr['Milan (ACM)'] = team_attribute.query('team_name == "Milan (ACM)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Italy Serie A')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
The Italy Serie A stats show something rather interesting. Milan(The league runner up) is actually ahead in a lot of the attribute but Juventus still managed to take the most wins in the league.
We continue with these same steps and analysis for the remaining leagues.
# Netherlands Eredivisie Ajax (AJA) PSV (PSV)
# Netherlands Eredivisie
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Ajax (AJA)"').mean(),
columns=['Ajax (AJA)'])
compare_attr['PSV (PSV)'] = team_attribute.query('team_name == "PSV (PSV)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Netherlands Eredivisie')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
# Poland Ekstraklasa Legia Warszawa (LEG) Wisła Kraków (WIS)
# Poland Ekstraklasa
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Legia Warszawa (LEG)"').mean(),
columns=['Legia Warszawa (LEG)'])
compare_attr['Wisła Kraków (WIS)'] = team_attribute.query('team_name == "Wisła Kraków (WIS)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Poland Ekstraklasa')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
# Portugal Liga ZON Sagres SL Benfica (BEN) FC Porto (POR)
# Portugal Liga ZON Sagres
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "SL Benfica (BEN)"').mean(),
columns=['SL Benfica (BEN)'])
compare_attr['FC Porto (POR)'] = team_attribute.query('team_name == "FC Porto (POR)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Portugal Liga ZON Sagres')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
# Scotland Premier League Celtic (CEL) Rangers (RAN)
# Scotland Premier League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "Celtic (CEL)"').mean(),
columns=['Celtic (CEL)'])
compare_attr['Rangers (RAN)'] = team_attribute.query('team_name == "Rangers (RAN)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Scotland Premier League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
# Spain LIGA BBVA FC Barcelona (BAR) Real Madrid CF (REA)
# Spain LIGA BBVA
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "FC Barcelona (BAR)"').mean(),
columns=['FC Barcelona (BAR)'])
compare_attr['Real Madrid CF (REA)'] = team_attribute.query('team_name == "Real Madrid CF (REA)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Spain LIGA BBVA')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
Interestingly in LIGA BBVA league, even though the second team(Real Madrid) was ahead 6 attributes in total 9, the first most winner was FC Barcelona.
# Switzerland Super League FC Basel (BAS) FC Zürich (ZUR)
# Switzerland Super League
compare_attr = pd.DataFrame(data=team_attribute.query('team_name == "FC Basel (BAS)"').mean(),
columns=['FC Basel (BAS)'])
compare_attr['FC Zürich (ZUR)'] = team_attribute.query('team_name == "FC Zürich (ZUR)"').mean()
compare_attr = compare_attr[3:]
compare_attr
compare_attr.plot.barh(width=0.50, figsize=(12,8))
plt.title('comparing best two teams of Switzerland Super League')
plt.ylabel('Attribute type')
plt.xlabel('Attribute value')
plt.legend(loc = 0);
For this quesetion, we wanted to anaylsis if the old superstition of the "home field advantage" was true by comparing the wins of teams as home teams vs their away wins.
team_goal_won_total['game_count_as_hometeam'] = 'NAN'
team_goal_won_total['goal_count_as_hometeam'] = 'NAN'
team_goal_won_total['win_count_as_hometeam'] = 'NAN'
team_goal_won_total['game_count_as_awayteam'] = 'NAN'
team_goal_won_total['goal_count_as_awayteam'] = 'NAN'
team_goal_won_total['win_count_as_awayteam'] = 'NAN'
team_goal_won_total['num_total_game'] = 'NAN'
team_match['season'] = team_match['season_start'] + " - " + team_match['season_end']
for index, row in team_goal_won_total.iterrows():
season = team_goal_won_total.iloc[index, 1]
team_name = team_goal_won_total.iloc[index, 2]
df = team_match[(team_match['season'] == season) & (team_match['home_team'] == team_name)]
team_goal_won_total.iloc[index, 6] = df['home_team_goal'].sum()
team_goal_won_total.iloc[index, 5] = df.shape[0]
new_df = df[df['winner'] == team_name]
team_goal_won_total.iloc[index, 7] = new_df.shape[0]
df = team_match[(team_match['season'] == season) & (team_match['away_team'] == team_name)]
team_goal_won_total.iloc[index, 9] = df['away_team_goal'].sum()
team_goal_won_total.iloc[index, 8] = df.shape[0]
new_df = df[df['winner'] == team_name]
team_goal_won_total.iloc[index, 10] = new_df.shape[0]
team_goal_won_total.iloc[index, 11] = team_goal_won_total.iloc[index, 5] + team_goal_won_total.iloc[index, 8]
team_goal_won_total = team_goal_won_total[['leagueName', 'season', 'team_name', 'num_total_game', 'game_count_as_hometeam',
'game_count_as_awayteam', 'count_wins', 'win_count_as_hometeam',
'win_count_as_awayteam', 'total_num_goal', 'goal_count_as_hometeam', 'goal_count_as_awayteam']]
team_goal_won_total = team_goal_won_total.rename(columns={"num_total_game": "game_count_total", "count_wins": "win_count_total", "total_num_goal": "goal_count_total"})
team_goal_won_total.head()
With the table created, we can now visualize the results to make it easier to interpret.
new_df = team_goal_won_total[team_goal_won_total.columns[-2:]]
df1 = new_df.mean().to_frame()
df1.reset_index(inplace=True)
df1 = df1.rename(columns={"index": "type", 0: "mean_goal_count"})
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
x = df1.type
y = df1.mean_goal_count
ax.bar(x,y)
plt.show()
We come up with the mean value of total goals as a home team is higher than the mean value of total goals as an away team.
# win vs home or away
new_df = team_goal_won_total[team_goal_won_total.columns[7:9]]
df1 = new_df.mean().to_frame()
df1.reset_index(inplace=True)
df1 = df1.rename(columns={"index": "type", 0: "mean_win_count"})
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
x = df1.type
y = df1.mean_win_count
ax.bar(x,y)
plt.show()
Likewise, the mean value of home team victories is much higher than when away.
For this section, we wanted to start a surface level exploration into the relationship between the individual player's overall rating and their attributes. We accomplish this by creating a linear regression with a subset of the player's attributes and their associated rating to see if there were any trends shown by plotting the results.
attributes = player_Attributes[['free_kick_accuracy','agility','shot_power','stamina', 'short_passing', 'vision', 'ball_control', 'positioning', 'finishing']].fillna(0).astype('int32')
reg = LinearRegression().fit(attributes, player_Attributes['overall_rating'].fillna(0))
print(reg.coef_)
i = 0
for x in list(attributes):
fig, ax = plt.subplots()
ax.set_ylabel('overall_rating')
ax.set_xlabel(x)
ax.scatter(attributes[x],player_Attributes['overall_rating'].fillna(0))
ax.plot(attributes[x], reg.coef_[i]*attributes[x] + reg.intercept_, 'k')
i = i + 1
plt.show()
From the results, we can see that there is a relationship present and that could inducate that some attributes weight towards a higher rating more than others. We will continue this exploration when we get to building our models.
Here we wanted to see who was at the top of the game for each year in the dataset.
grouped_player = player_Attributes.groupby([player_Attributes['date'].dt.year])
players_list = []
ratings = []
seasons = []
for name, group in grouped_player:
df = group.sort_values('overall_rating', ascending=False)
row = df.iloc[0]
player = players.loc[players[players['player_fifa_api_id'] == row['player_fifa_api_id']].index, 'player_name']
players_list.append(player.values[0])
ratings.append(row['overall_rating'])
seasons.append(name)
overall_rating = pd.DataFrame(data={"seasons" : seasons, 'player' : players_list, "rating" : ratings})
overall_rating
Correlations from these results:
Wayne Rooney and cristiano Ronaldo were the two highest rated players during in 2007/2008, both of which played on Manchester United. Manchester United was also the team with the most wins in the Premier League during those years. Similarly, Messi and Neymar have been apart of Barcelona the years they were the highest rated players and while concurrently there at Barcelona, the club had the most wins in the spanish league.
Incidentally, it seems like the highest rated player comes from the spanish league for 8 years in a row here.
For this question, we were curious to see the distributions of the players preferred foot. As well, in some sports using a different hand or foot than the major sometimes results in small advantages (take left handers in tennis for example), does this apply to this sport as well?
p_A_1= player_Attributes[['player_api_id', 'preferred_foot', 'overall_rating', 'potential']]
p_A_1.head()
p_A_1= p_A_1.groupby(['player_api_id', 'preferred_foot']).mean() # getting the average player rating
p_A_1.head()
# get the total number of players that are right, left or both and there rating
left_foot=0; left_rating=0.0
right_foot=0; right_rating =0.0
both=0; both_rating = 0.0
checked=[]
for key, value in p_A_1.iterrows():
if key[0] in checked:
both +=1
both_rating += value[0]
elif key[1] == 'right':
right_foot +=1
checked.append(key[0])
right_rating += value[0]
else:
left_foot +=1
checked.append(key[0])
left_rating += value[0]
total_players= left_foot + right_foot + both
average_left_rating= left_rating/ left_foot
average_right_rating = right_rating/ right_foot
both_rating = both_rating/ both
# Data to plot
labels = 'left', 'right', 'both'
sizes = [left_foot, right_foot, both]
colors = ['yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0) # explode 1st slice
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.title("Overall players foot Preferance percentage ")
plt.show()
From this graph, we can see that the major of players in our dataset are prefer their right foot, while a smaller percentage prefer their left or both.
# Data to plot
labels = average_left_rating, average_right_rating, both_rating
sizes = [left_foot, right_foot, both]
colors = ['yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0) # explode 1st slice
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.title("Total foot preferance percentage with respect to average rating ")
plt.show()
And from this graph, we can tell that their does not seem to be a bias in the ratings based off the player's prefer foot, possibly correlating to their also being no advantage for using the other foot or both.
For this question, we wanted to see if their is a correlation between a high number of goals scored and a high win percentage.
plt.scatter(team_goal_won_total['goal_count_total'], team_goal_won_total['win_count_total'])
plt.ylabel("Total number of wins")
plt.xlabel("Total nubmer of goals")
plt.title("Total number of goals vs wins from 2008- 2016 for each team")
x= team_goal_won_total['goal_count_total'].values.reshape(-1,1)
y= team_goal_won_total['win_count_total']
reg_1 = LinearRegression()
reg_1 = reg_1.fit(x, y)
intercept= reg_1.intercept_
slope= reg_1.coef_
plt.plot(x, intercept + slope * x, 'red', label='reg')
plt.show()
team_goal_won_total['winning_percentage']= (team_goal_won_total['win_count_total'] / team_goal_won_total['game_count_total'])*100
team_goal_won_total.head()
plt.scatter(team_goal_won_total['goal_count_total'], team_goal_won_total['winning_percentage'])
plt.ylabel("wining percentage")
plt.xlabel("total_number of goals")
plt.title("Wining percentage of all Europian leagues teams Vs there respective goals per season")
x= team_goal_won_total['goal_count_total'].values.reshape(-1,1)
y=team_goal_won_total['winning_percentage']
reg_1 = LinearRegression()
reg_1 = reg_1.fit(x,y)
intercept= reg_1.intercept_
slope= reg_1.coef_
plt.plot(x, intercept + slope * x, 'red', label='reg')
plt.show()
a= team_goal_won_total.groupby(['leagueName'])
for index, value in a:
x= value.goal_count_total
y= value.winning_percentage
plt.scatter(x, y)
plt.ylabel("wining percentage")
plt.xlabel("total_number of goals")
plt.title(index)
reg_1 = LinearRegression()
reg_1 = reg_1.fit(x.values.reshape(-1,1),y)
intercept= reg_1.intercept_
slope= reg_1.coef_
plt.plot(x, intercept + slope * x, 'red', label='reg')
plt.show()
From each of these visualizations, aside from a couple of outliers, their is a consistant correlation between a high amount fo goals scored and a high number of wins.
With exploration of the data done, we move on to the penultimate step of the data lifecycle - the analysis. In this section we apply machine learning and statistics to better understand the trends and relationships present in the data. We will be using two machine learning models to accomplish this in this tutorial - Linear regression and logistical regression.
The first analysis we wanted to demonstate is the continuation of the question posed in question 5 of our exploration.
We will start by defining our null and alternative hypothesis.
Null hypothesis 1: There is no relation between the overall rating of a player and the attributes of player.
Alternative Hypothesis 1: There is a relation between the overall rating of a player and the attributes of player.
We will either accept or reject the null hypothesis based on the statistical results of our linear regression.
df = player_Attributes
# Preparing data for regression
selected_attr = [
'potential', 'crossing', 'finishing', 'heading_accuracy',
'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
'gk_reflexes']
target = ['overall_rating']
df = df.dropna(0)
X = df[selected_attr]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X,y)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)
X=sm.add_constant(X_train)
model= sm.OLS(y_train, X).fit()
model.summary()
y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)
By look at the p values of the different attributes we can confidently reject the null hypothesis as we can see that the p values are less than 0.05, meaning that there is a relationship between player rating and attribute scores. From the coefficients we can see though we can see that some attributes are more heavily related to the rating the player receives that others. Lastly we find that we are able to predict the player's rating correctly approximately 84% percent of the time based on the attributes passed into the regression.
Null hypothesis 2: There is no relation between the winning percentage and other features of team.
Alternative Hypothesis 2: There is a relation between the winning percentage and other features of team.
team_goal_won_total['season_start'] = team_goal_won_total['season'].apply(lambda x: x.split('-')[0])
team_goal_won_total['season_end'] = team_goal_won_total['season'].apply(lambda x: x.split('-')[1])
team_goal_won_total['season_start'] = team_goal_won_total['season_start'].astype(int)
selected_col = [
'team_name', 'date', 'buildUpPlaySpeed', 'buildUpPlayDribbling',
'buildUpPlayPassing', 'chanceCreationPassing', 'chanceCreationCrossing', 'chanceCreationShooting', 'defencePressure',
'defenceAggression', 'defenceTeamWidth']
df = team_attribute[selected_col]
df = df.dropna()
df['year'] = pd.DatetimeIndex(df['date']).year
df = df.drop(['date'], axis=1)
#match season start and team name
all_team_info = pd.merge(df, team_goal_won_total, how='inner',
left_on=['team_name','year'], right_on = ['team_name','season_start'])
all_team_info = all_team_info.drop(['season_start', 'season_end', 'team_name', 'year', 'leagueName', 'season', 'game_count_total',
'game_count_as_hometeam', 'game_count_as_awayteam', 'win_count_total',
'win_count_as_hometeam', 'win_count_as_awayteam', 'goal_count_total',
'goal_count_as_hometeam', 'goal_count_as_awayteam'], axis=1)
all_team_info['winning_percentage'] = all_team_info['winning_percentage'].astype(float)
all_team_info.head()
# This table will have the winnign_percentage and team attributes.
y_target = all_team_info['winning_percentage']
X = all_team_info.iloc[:,0:-1]
X_train, X_test, y_train, y_test = train_test_split(X,y_target)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)
y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)
X1=sm.add_constant(X_train)
model= sm.OLS(y_train, X1.astype(float)).fit()
model.summary()
From the results of the linear regression, it seems as though most of the teams attributes could not be found to have a strong relationship with that team's winning percentage. As we can see all but the p values of chanceCreationCrossing, defencePressure, buildUpPlayPassing, and defenceAggression were less than 0.05. As well, the model scores very low in it's prediction of the win percentage when giving the team's attributes.
When attempting to predict the winning percentage with only the attributes which has p value less than 0.05
X_new = X[['chanceCreationCrossing', 'defencePressure', 'buildUpPlayPassing', 'defenceAggression']]
X_train, X_test, y_train, y_test = train_test_split(X_new, y_target)
reg = LinearRegression()
reg.fit(X_train,y_train)
print("coef_: " , reg.coef_)
print("intercept_: " , reg.intercept_)
y_prediction = reg.predict(X_test)
score = reg.score(X_test, y_test)
print("score: " , score)
X2=sm.add_constant(X_train)
model= sm.OLS(y_train, X2.astype(float)).fit()
model.summary()
The score is still very low, in other word this model describe only about 13% of the data. But the coef of those attributes are not 0 because the p values are less than 0.05.
===========================================================================================================================
creating a logistic regression on matchs that teams either win, loss or draw based on the starting players overall rating for the year.
Null hypothesis 1: There is no strong relation between the players rating of the team and the teams win chance for a given match
Alternative Hypothesis 1: There is a strong relation between the overall player rating of a team for a given match and chance of winning
# Getting the columns needed/ Data processing
matches_and_players= matches_copy[['league_id', 'season', 'date', 'match_api_id', 'home_team_api_id', 'away_team_api_id',
'home_player_1','home_player_2', 'home_player_3','home_player_4','home_player_5','home_player_6',
'home_player_7','home_player_8','home_player_9','home_player_10','home_player_11',
'away_player_1', 'away_player_2', 'away_player_3', 'away_player_4', 'away_player_5', 'away_player_6',
'away_player_7', 'away_player_8', 'away_player_9', 'away_player_10' , 'away_player_11',
'home_team_goal', 'away_team_goal']]
# clean up the data and
# create label to who won the game. 1 home wins 0 drow and -1 away wins
matches_and_players.dropna(inplace=True)
matches_and_players['win_loss_drow']= np.where( matches_and_players['home_team_goal']==matches_and_players['away_team_goal'],
0, np.where(matches_and_players['home_team_goal']>matches_and_players['away_team_goal'],
1, -1))
# melting the data sho each player is associated with team
home_team_players= pd.melt(matches_and_players, id_vars=['league_id','season', 'date', 'match_api_id', 'home_team_api_id', 'win_loss_drow'], value_vars=['home_player_1','home_player_2', 'home_player_3','home_player_4',
'home_player_5','home_player_6','home_player_7','home_player_8',
'home_player_9','home_player_10','home_player_11'],
ignore_index=False, var_name='home_players', value_name='players_id')
# convert the date col to the python datetime
home_team_players['date'] = pd.to_datetime(home_team_players['date'])
home_team_players.head()
# create a dataframe for players and there overall rating with respect to time.
# extract the year from the given date
player_rating= player_Attributes[['player_api_id', 'date', 'overall_rating']]
player_rating['year'] = pd.DatetimeIndex(player_rating['date']).year
player_rating.head()
# group them by the year and player_id
player_rating= player_rating.groupby(['player_api_id', 'year']).mean()
player_rating.head()
# convert player id from float to int. so that it is easy for comparison
home_team_players['players_id']= home_team_players['players_id'].astype(np.int64)
# extract year from date
home_team_players['year'] = pd.DatetimeIndex(home_team_players['date']).year
# Connecting matches teams and players.
matches_teams_players = pd.merge(home_team_players, player_rating, how='left',
left_on=['players_id', 'year'],
right_on = ['player_api_id', 'year'])
print(matches_teams_players.shape)
# for the nan values for players ratting replace them with average value of the team for the given match
matches_teams_players['overall_rating']= matches_teams_players.groupby('match_api_id',
)['overall_rating'].apply(lambda x: x.fillna(x.mean()))
print(matches_teams_players.shape)
# group the home players by match id
grouped_matches_teams_players= matches_teams_players.groupby('match_api_id')
matches_teams_players.head()
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
def process_logistic_reg(grouped_data_frame):
home_team_X_values= list()
home_team_Y_values= list()
for index, value in grouped_data_frame:
# home_team_X_values.append( [round(num,3) for num in value.overall_rating.tolist()])
home_team_X_values.append(value.overall_rating.tolist())
home_team_Y_values.append(value.win_loss_drow.tolist()[0])
# making it binary classification win=1 draw/loss= 0
home_team_y_binary= [0 if x==-1 else x for x in home_team_Y_values]
X_train,X_test,y_train,y_test = train_test_split(home_team_X_values, home_team_y_binary, test_size=0.25, random_state=0)
scaler = MinMaxScaler(feature_range=(0, 1.5))
scaler.fit(X_train)
scaler.fit(X_test)
normalized_train = scaler.transform(X_train)
normalized_test = scaler.transform(X_test)
kf = KFold(n_splits = 10)
logistic_regression= LogisticRegression(multi_class='ovr', solver='liblinear')
scores = cross_val_score(logistic_regression, normalized_train, y_train, cv = kf)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
# Do logistic regresssion on the train data
logistic_regression.fit(normalized_train, y_train)
y_pred=logistic_regression.predict(normalized_test) # prediction using the X test data
same_values= [i for i, j in zip(y_test, y_pred) if i == j] # getting values that are the same prediction == actual
accuracy = len(same_values)/ len(y_pred) *100 # this is the percetage that the prediction is correct
print("Accuracy of prediction: " + str(accuracy))
#log_reg = sm.MNLogit(y_train, X_train).fit()
log_reg= sm.Logit(y_train, X_train).fit()
print(log_reg.summary())
# logistic regression on overall Europe team's home match win vs the starting players ratting. If the p value of this
# outcome is low then this can be a good classification prediction for wchich one team can win a match based on the
# starting players ratting. If the p value is high then we can acept the null hypothisis of being there is less relation
# between the win and starting players rating.
process_logistic_reg(grouped_matches_teams_players)
As shown above the logistic regression model to predict the winning chance of home team is about 59% accurate. This shows that it would be really hard to determine the outcome the game with just the overall players rating of the team. Although some of the p values for few players are in good condition. They are less than 0.05. It shows that players x1,x2,x3,x10,x11 have a significant values in this model. This makes sense as these players are goalkeeper, defenders and attackers.
# do logistic regression on some of the well known leagues such as England and Spanish leagues and see the outcome
league_match_grouped= matches_teams_players.groupby(['league_id'])
# Spanish league id= 21518
England_matches= league_match_grouped.get_group(21518).groupby('match_api_id')
process_logistic_reg(England_matches)
In the above logistic regression, we have taken only the Spanish La Liga to see if there are some changes for the overall logistic regression model. Which has a little higher accuracy with 63% accuracy. As the same as the overall European league logistic regression model this one also show the significant of the keeper and attackers with defenders having less accurate p value.
# melting the data sho each player is associated with team
away_team_players= pd.melt(matches_and_players, id_vars=['league_id','season', 'date', 'match_api_id', 'away_team_api_id', 'win_loss_drow'],
value_vars=['away_player_1','away_player_2', 'away_player_3','away_player_4',
'away_player_5','away_player_6','away_player_7','away_player_8',
'away_player_9','away_player_10','away_player_11'],
ignore_index=False, var_name='away_players', value_name='players_id')
# convert the date col to the python datetime
away_team_players['date'] = pd.to_datetime(away_team_players['date'])
away_team_players['year'] = pd.DatetimeIndex(away_team_players['date']).year
# Connecting matches teams and players.
matches_teams_players_away = pd.merge(away_team_players, player_rating, how='left',
left_on=['players_id', 'year'],
right_on = ['player_api_id', 'year'])
# for the nan values for players ratting replace them with average value of the team for the given match
matches_teams_players_away['overall_rating']= matches_teams_players_away.groupby('match_api_id',
)['overall_rating'].apply(lambda x: x.fillna(x.mean()))
# group the home players by match id
grouped_matches_teams_players_away= matches_teams_players_away.groupby('match_api_id')
process_logistic_reg(grouped_matches_teams_players_away)
Above, we have also did logistic regression model to see away's team winning chance prediction. This model shows that an even lesses accuracy. In the away team model there is only one significat player in the team that has less than 0.05 p value, which is attacker.
At last, we come to the last step of the data lifecycle, for this step we draw our conclusions based off the insight we have gained from our analysis of the data.
From the exploratory analysis, we guessed there is linear relationship between overall player ratting and each players attributes. This model can describe 85% of data according to the linear regression score. From the statistic summery we got valid coefficients for each attribute with p value less than 0.05. As a result, we can reject the null hypothesis of there is no relationship between overall rating of players and player attributes.
However, we cannot see direct relationship in team attributes from the graph of the comparison of best two teams of each league, so we did linear regression model. But the model's score is too low to predict winning percentage from team attributes with the given dataset. There are some attributes that have low p value that are significant to this model. we cannot reject the null hypothesis
Based on the logistic regression model we have done, we can accept the null hypothesis of no strong relation between team players overall rating and winning chance of given match. This does not mean there is no relation! It is just we cannot reject the null hypostasis as the p values of the prediction model are higher, which makes it hard to predict the outcome of the match.