Now more than ever, data collection is a well established commodity despite often being of poor quality and inconsitent in its obtaining process. The FIFA Data Ecosystem is no exception however, several data sources, data processors and distribution layers help to provide needed consistency and quality for all relevant stakeholders. The FIFA Arab Cup 2021™ and the FIFA Club World Cup™ in 2019 and 2021 are some of several FIFA tournaments where data collected by FIFA’s Football Data Ecosystem has been developed and tested through recent years. We will be looking at a fairly messy data sample from 2021 and attempt to clean it for future use. Dataset source:https://www.kaggle.com/datasets/yagunnersya/fifa-21-messy-raw-dataset-for-cleaning-exploring?select=fifa21_raw_data.csv%E2%80%8B Lets get started.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Of the 80 columns of data in this sample, we will first look at the Height of the listed player data and off the bat there is inconsistency as the majority of data collected for this statistic is in centimeters however, a few rows have entries in feet and inches. We observe these inconsitencies specifically in rows 820:875 and can probably assume there are more. Not good.
## c("195cm", "179cm", "191cm", "170cm", "182cm", "195cm", "178cm",
## "187cm", "187cm", "190cm", "188cm", "198cm", "180cm", "172cm",
## "189cm", "192cm", "179cm", "184cm", "186cm", "178cm", "173cm",
## "176cm", "173cm", "187cm", "182cm", "186cm", "188cm", "170cm",
## "6'3\"", "6'5\"", "185cm", "179cm", "195cm", "179cm", "185cm",
## "174cm", "183cm", "180cm", "181cm", "186cm", "178cm", "5'11\"",
## "6'4\"", "6'1\"", "6'0\"", "6'1\"", "5'11\"", "6'2\"", "6'0\"",
## "6'3\"", "6'0\"", "184cm", "5'10\"", "5'9\"", "5'11\"", "5'6\""
## )
First, separate the Height column into two new columns Feet and Inches
Second, convert Feet and Inches columns to numeric
Third, create a Height(cm) column that sends through cm variables from Height but converts any Feet(’) or Inches(“) variables to cm
a <- data.frame(Height = c("174cm", "183cm", "180cm", "181cm", "186cm", "178cm", "5'11\"", "6'4\"", "6'1\"", "6'0\"", "6'1\"", "5'11\"", "6'2\"", "6'0\"", "6'3\"", "6'0\"", "184cm", "5'10\"", "5'9\"", "5'11\"", "5'6\""))
a |>
separate(Height, c("Feet", "Inches"), sep = "'", remove = FALSE) |>
mutate(across(Feet:Inches, ~parse_number(.x))) |>
mutate("Height(cm)" = if_else(is.na(Inches), Feet, round((Feet+ Inches/12)*30.48)))
## Height Feet Inches Height(cm)
## 1 174cm 174 NA 174
## 2 183cm 183 NA 183
## 3 180cm 180 NA 180
## 4 181cm 181 NA 181
## 5 186cm 186 NA 186
## 6 178cm 178 NA 178
## 7 5'11" 5 11 180
## 8 6'4" 6 4 193
## 9 6'1" 6 1 185
## 10 6'0" 6 0 183
## 11 6'1" 6 1 185
## 12 5'11" 5 11 180
## 13 6'2" 6 2 188
## 14 6'0" 6 0 183
## 15 6'3" 6 3 190
## 16 6'0" 6 0 183
## 17 184cm 184 NA 184
## 18 5'10" 5 10 178
## 19 5'9" 5 9 175
## 20 5'11" 5 11 180
## 21 5'6" 5 6 168
## fifa21_raw_data.Joined.1.20.
## 1 Jul 1, 2004
## 2 Jul 10, 2018
## 3 Jul 16, 2014
## 4 Aug 30, 2015
## 5 Aug 3, 2017
## 6 Jul 1, 2014
## 7 Jul 1, 2017
## 8 Jul 19, 2018
## 9 Jul 1, 2018
## 10 Jul 1, 2014
## 11 Jan 1, 2018
## 12 Jul 1, 2016
## 13 Jul 11, 2013
## 14 Aug 9, 2018
## 15 Jul 1, 2011
## 16 Jul 9, 2009
## 17 Aug 1, 2005
## 18 Jul 28, 2011
## 19 Jul 14, 2015
## 20 Jul 16, 2016
We look at the Joined column in this data set and see that it is in a format that is difficult to work with. Fortunately, we can split this column into three respective columns with Year, Month, and Day in an appropriate format that is much easier to analyze. Currently we have inputs in the format of Jul 1, 2004. Not the easiest formatting. Let’s change this using:
fifa21_raw_data$Joined[1:20] |>
as.Date(vec, format = “%b %d, %Y”) |>
as.character() |>
strsplit(“-”) |>
do.call(rbind, args=_)
## [,1] [,2] [,3]
## [1,] "2004" "07" "01"
## [2,] "2018" "07" "10"
## [3,] "2014" "07" "16"
## [4,] "2015" "08" "30"
## [5,] "2017" "08" "03"
## [6,] "2014" "07" "01"
## [7,] "2017" "07" "01"
## [8,] "2018" "07" "19"
## [9,] "2018" "07" "01"
## [10,] "2014" "07" "01"
## [11,] "2018" "01" "01"
## [12,] "2016" "07" "01"
## [13,] "2013" "07" "11"
## [14,] "2018" "08" "09"
## [15,] "2011" "07" "01"
## [16,] "2009" "07" "09"
## [17,] "2005" "08" "01"
## [18,] "2011" "07" "28"
## [19,] "2015" "07" "14"
## [20,] "2016" "07" "16"
Fantastic! Let’s go ahead and rename the columns as well while we’re at it:
fifa21_raw_data$Joined[1:20] |>
as.Date(vec, format = “%b %d, %Y”) |>
as.character() |>
strsplit(“-”) |>
do.call(rbind, args=_) |>
as.data.frame() |>
setNames(c(“Year”,“Month”, “Day”))
## Year Month Day
## 1 2004 07 01
## 2 2018 07 10
## 3 2014 07 16
## 4 2015 08 30
## 5 2017 08 03
## 6 2014 07 01
## 7 2017 07 01
## 8 2018 07 19
## 9 2018 07 01
## 10 2014 07 01
## 11 2018 01 01
## 12 2016 07 01
## 13 2013 07 11
## 14 2018 08 09
## 15 2011 07 01
## 16 2009 07 09
## 17 2005 08 01
## 18 2011 07 28
## 19 2015 07 14
## 20 2016 07 16
## c("€103.5M", "€63M", "€120M", "€129M", "€132M", "€111M",
## "€120.5M", "€102M", "€185.5M", "€110M", "€113M", "€120.5M",
## "€90.5M", "€82M", "€17.5M", "€83.5M", "€33.5M", "€83.5M",
## "€114.5M", "€78M")
## c("€560K", "€220K", "€125K", "€370K", "€270K", "€240K",
## "€250K", "€160K", "€160K", "€260K", "€210K", "€250K",
## "€310K", "€250K", "€130K", "€350K", "€300K", "€300K",
## "€270K", "€190K")
In these columns we can see that instead of integers, we have character types. Creating a data frame for all three, we are able to adjust these strings into integers in one fell swoop after defining values to “K” and “M”. We can use this information to get an idea of how many players are highly valued but still underpaid (low wage). Via scatterplot, we can see players who are paid the same yearly wage but are valued differently:
df <- structure(list(Wage = c(“�082�560K”, “�082�220K”, “�082�125K”, “�082�370K”, “�082�270K”, “�082�240K”, “�082�160K”, “�082�160K”, “�082�250K”, “�082�250K”, “�082�210K”, “�082�260K”, “�082�310K”, “�082�250K”, “�082�125K”, “�082�350K”, “�082�300K”, “�082�300K”, “�082�190K”, “�082�145K”), Value = c(“�082�67.5M”, “�082�46M”, “�082�75M”, “�082�87M”, “�082�90M”, “�082�80M”, “�082�105.5M”, “�082�62.5M”, “�082�78M”, “�082�78M”, “�082�75.5M”, “�082�69.5M”, “�082�59.5M”, “�082�56M”, “�082�29M”, “�082�53M”, “�082�24.5M”, “�082�53M”, “�082�51M”, “�082�65M” )), class = “data.frame”, row.names = c(NA, -20L))
dct <- c(K= ‘e3’, M = ‘e6’)
map_df(df, ~as.numeric(str_replace_all(str_remove(.x, “\D+”), dct)))
ggplot(df,aes(x=Wage,y=Value))+geom_point()
## # A tibble: 20 × 2
## Wage Value
## <dbl> <dbl>
## 1 560000 67500000
## 2 220000 46000000
## 3 125000 75000000
## 4 370000 87000000
## 5 270000 90000000
## 6 240000 80000000
## 7 160000 105500000
## 8 160000 62500000
## 9 250000 78000000
## 10 250000 78000000
## 11 210000 75500000
## 12 260000 69500000
## 13 310000 59500000
## 14 250000 56000000
## 15 125000 29000000
## 16 350000 53000000
## 17 300000 24500000
## 18 300000 53000000
## 19 190000 51000000
## 20 145000 65000000
Here we see that, although not shown on the dataframe column itself, there are hidden newline characters in the Hits column eg: n372, n344, n86, n163, n273, n182, n646, n79, n164, n170, n170, n93, n131, n89, n90, n169, n187, n103, n169, n317 We will use sapply to remove these from the elements of the column without compromising the entire dataframe:
fifa21_raw_data$Hits <- sapply(fifa21_raw_data$Hits, function(x) {gsub(“[]”, ““, x)})
## c("771", "562", "150", "207", "595", "248", "246", "120", "1.6K",
## "130", "321", "189", "175", "96", "118", "216", "212", "154",
## "205", "202")
dput(fifa21_raw_data$‘W/F’[1:20])
## NULL
c(“4 ★”, “4 ★”, “3 ★”, “5 ★”, “5 ★”, “4 ★”, “4 ★”, “3 ★”, “3 ★”, “4 ★”, “3 ★”, “4 ★”, “3 ★”, “3 ★”, “4 ★”, “4 ★”, “3 ★”, “4 ★”, “3 ★”, “4 ★”)
We have a couple columns that have ★ symbols, particularly the W/F column. Unfortunately, although the chunk output displays this, rmarkdown will not show the same output. I will leave the chunk and reproduce the output via text as I have not found a solution to this. We will use a gsub calling to address this.
gsub(” ★“,” “, fifa21_raw_data$‘W/F’[1:20])
## character(0)
[1] “4” “4” “3” “5” “5” “4” “4” “3” “3” “4” “3”
[12] “4” “3” “3” “4” “4” “3” “4” “3” “4”
We may use this simple gsub call to adjust the other columns containing the non-alphanumeric symbol.