Now more than ever, data collection is a well established commodity despite often being of poor quality and inconsitent in its obtaining process. The FIFA Data Ecosystem is no exception however, several data sources, data processors and distribution layers help to provide needed consistency and quality for all relevant stakeholders. The FIFA Arab Cup 2021™ and the FIFA Club World Cup™ in 2019 and 2021 are some of several FIFA tournaments where data collected by FIFA’s Football Data Ecosystem has been developed and tested through recent years. We will be looking at a fairly messy data sample from 2021 and attempt to clean it for future use. Dataset source:https://www.kaggle.com/datasets/yagunnersya/fifa-21-messy-raw-dataset-for-cleaning-exploring?select=fifa21_raw_data.csv%E2%80%8B Lets get started.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Convert height and weight columns to consistent numerics

Of the 80 columns of data in this sample, we will first look at the Height of the listed player data and off the bat there is inconsistency as the majority of data collected for this statistic is in centimeters however, a few rows have entries in feet and inches. We observe these inconsitencies specifically in rows 820:875 and can probably assume there are more. Not good.

## c("195cm", "179cm", "191cm", "170cm", "182cm", "195cm", "178cm", 
## "187cm", "187cm", "190cm", "188cm", "198cm", "180cm", "172cm", 
## "189cm", "192cm", "179cm", "184cm", "186cm", "178cm", "173cm", 
## "176cm", "173cm", "187cm", "182cm", "186cm", "188cm", "170cm", 
## "6'3\"", "6'5\"", "185cm", "179cm", "195cm", "179cm", "185cm", 
## "174cm", "183cm", "180cm", "181cm", "186cm", "178cm", "5'11\"", 
## "6'4\"", "6'1\"", "6'0\"", "6'1\"", "5'11\"", "6'2\"", "6'0\"", 
## "6'3\"", "6'0\"", "184cm", "5'10\"", "5'9\"", "5'11\"", "5'6\""
## )
  • First, separate the Height column into two new columns Feet and Inches

  • Second, convert Feet and Inches columns to numeric

  • Third, create a Height(cm) column that sends through cm variables from Height but converts any Feet(’) or Inches(“) variables to cm

  • a <- data.frame(Height = c("174cm", "183cm", "180cm", "181cm", "186cm", "178cm", "5'11\"", "6'4\"", "6'1\"", "6'0\"", "6'1\"", "5'11\"", "6'2\"", "6'0\"", "6'3\"", "6'0\"", "184cm", "5'10\"", "5'9\"", "5'11\"", "5'6\""))

    a |>

    separate(Height, c("Feet", "Inches"), sep = "'", remove = FALSE) |>

    mutate(across(Feet:Inches, ~parse_number(.x))) |>

    mutate("Height(cm)" = if_else(is.na(Inches), Feet, round((Feet+ Inches/12)*30.48)))

##    Height Feet Inches Height(cm)
## 1   174cm  174     NA        174
## 2   183cm  183     NA        183
## 3   180cm  180     NA        180
## 4   181cm  181     NA        181
## 5   186cm  186     NA        186
## 6   178cm  178     NA        178
## 7   5'11"    5     11        180
## 8    6'4"    6      4        193
## 9    6'1"    6      1        185
## 10   6'0"    6      0        183
## 11   6'1"    6      1        185
## 12  5'11"    5     11        180
## 13   6'2"    6      2        188
## 14   6'0"    6      0        183
## 15   6'3"    6      3        190
## 16   6'0"    6      0        183
## 17  184cm  184     NA        184
## 18  5'10"    5     10        178
## 19   5'9"    5      9        175
## 20  5'11"    5     11        180
## 21   5'6"    5      6        168

Separating joined column into year, month, and day

##    fifa21_raw_data.Joined.1.20.
## 1                   Jul 1, 2004
## 2                  Jul 10, 2018
## 3                  Jul 16, 2014
## 4                  Aug 30, 2015
## 5                   Aug 3, 2017
## 6                   Jul 1, 2014
## 7                   Jul 1, 2017
## 8                  Jul 19, 2018
## 9                   Jul 1, 2018
## 10                  Jul 1, 2014
## 11                  Jan 1, 2018
## 12                  Jul 1, 2016
## 13                 Jul 11, 2013
## 14                  Aug 9, 2018
## 15                  Jul 1, 2011
## 16                  Jul 9, 2009
## 17                  Aug 1, 2005
## 18                 Jul 28, 2011
## 19                 Jul 14, 2015
## 20                 Jul 16, 2016

We look at the Joined column in this data set and see that it is in a format that is difficult to work with. Fortunately, we can split this column into three respective columns with Year, Month, and Day in an appropriate format that is much easier to analyze. Currently we have inputs in the format of Jul 1, 2004. Not the easiest formatting. Let’s change this using:

fifa21_raw_data$Joined[1:20] |>

as.Date(vec, format = “%b %d, %Y”) |>

as.character() |>

strsplit(“-”) |>

do.call(rbind, args=_)

##       [,1]   [,2] [,3]
##  [1,] "2004" "07" "01"
##  [2,] "2018" "07" "10"
##  [3,] "2014" "07" "16"
##  [4,] "2015" "08" "30"
##  [5,] "2017" "08" "03"
##  [6,] "2014" "07" "01"
##  [7,] "2017" "07" "01"
##  [8,] "2018" "07" "19"
##  [9,] "2018" "07" "01"
## [10,] "2014" "07" "01"
## [11,] "2018" "01" "01"
## [12,] "2016" "07" "01"
## [13,] "2013" "07" "11"
## [14,] "2018" "08" "09"
## [15,] "2011" "07" "01"
## [16,] "2009" "07" "09"
## [17,] "2005" "08" "01"
## [18,] "2011" "07" "28"
## [19,] "2015" "07" "14"
## [20,] "2016" "07" "16"

Fantastic! Let’s go ahead and rename the columns as well while we’re at it:

fifa21_raw_data$Joined[1:20] |>

as.Date(vec, format = “%b %d, %Y”) |>

as.character() |>

strsplit(“-”) |>

do.call(rbind, args=_) |>

as.data.frame() |>

setNames(c(“Year”,“Month”, “Day”))

##    Year Month Day
## 1  2004    07  01
## 2  2018    07  10
## 3  2014    07  16
## 4  2015    08  30
## 5  2017    08  03
## 6  2014    07  01
## 7  2017    07  01
## 8  2018    07  19
## 9  2018    07  01
## 10 2014    07  01
## 11 2018    01  01
## 12 2016    07  01
## 13 2013    07  11
## 14 2018    08  09
## 15 2011    07  01
## 16 2009    07  09
## 17 2005    08  01
## 18 2011    07  28
## 19 2015    07  14
## 20 2016    07  16

Transform value and wage columns

## c("€103.5M", "€63M", "€120M", "€129M", "€132M", "€111M", 
## "€120.5M", "€102M", "€185.5M", "€110M", "€113M", "€120.5M", 
## "€90.5M", "€82M", "€17.5M", "€83.5M", "€33.5M", "€83.5M", 
## "€114.5M", "€78M")
## c("€560K", "€220K", "€125K", "€370K", "€270K", "€240K", 
## "€250K", "€160K", "€160K", "€260K", "€210K", "€250K", 
## "€310K", "€250K", "€130K", "€350K", "€300K", "€300K", 
## "€270K", "€190K")

In these columns we can see that instead of integers, we have character types. Creating a data frame for all three, we are able to adjust these strings into integers in one fell swoop after defining values to “K” and “M”. We can use this information to get an idea of how many players are highly valued but still underpaid (low wage). Via scatterplot, we can see players who are paid the same yearly wage but are valued differently:

df <- structure(list(Wage = c(“�082�560K”, “�082�220K”, “�082�125K”, “�082�370K”, “�082�270K”, “�082�240K”, “�082�160K”, “�082�160K”, “�082�250K”, “�082�250K”, “�082�210K”, “�082�260K”, “�082�310K”, “�082�250K”, “�082�125K”, “�082�350K”, “�082�300K”, “�082�300K”, “�082�190K”, “�082�145K”), Value = c(“�082�67.5M”, “�082�46M”, “�082�75M”, “�082�87M”, “�082�90M”, “�082�80M”, “�082�105.5M”, “�082�62.5M”, “�082�78M”, “�082�78M”, “�082�75.5M”, “�082�69.5M”, “�082�59.5M”, “�082�56M”, “�082�29M”, “�082�53M”, “�082�24.5M”, “�082�53M”, “�082�51M”, “�082�65M” )), class = “data.frame”, row.names = c(NA, -20L))

dct <- c(K= ‘e3’, M = ‘e6’)

map_df(df, ~as.numeric(str_replace_all(str_remove(.x, “\D+”), dct)))

ggplot(df,aes(x=Wage,y=Value))+geom_point()

## # A tibble: 20 × 2
##      Wage     Value
##     <dbl>     <dbl>
##  1 560000  67500000
##  2 220000  46000000
##  3 125000  75000000
##  4 370000  87000000
##  5 270000  90000000
##  6 240000  80000000
##  7 160000 105500000
##  8 160000  62500000
##  9 250000  78000000
## 10 250000  78000000
## 11 210000  75500000
## 12 260000  69500000
## 13 310000  59500000
## 14 250000  56000000
## 15 125000  29000000
## 16 350000  53000000
## 17 300000  24500000
## 18 300000  53000000
## 19 190000  51000000
## 20 145000  65000000

Removing newline character strings from a column

Here we see that, although not shown on the dataframe column itself, there are hidden newline characters in the Hits column eg: n372, n344, n86, n163, n273, n182, n646, n79, n164, n170, n170, n93, n131, n89, n90, n169, n187, n103, n169, n317 We will use sapply to remove these from the elements of the column without compromising the entire dataframe:

fifa21_raw_data$Hits <- sapply(fifa21_raw_data$Hits, function(x) {gsub(“[]”, ““, x)})

## c("771", "562", "150", "207", "595", "248", "246", "120", "1.6K", 
## "130", "321", "189", "175", "96", "118", "216", "212", "154", 
## "205", "202")

Removing non-alphanumeric characters

dput(fifa21_raw_data$‘W/F’[1:20])

## NULL

c(“4 ★”, “4 ★”, “3 ★”, “5 ★”, “5 ★”, “4 ★”, “4 ★”, “3 ★”, “3 ★”, “4 ★”, “3 ★”, “4 ★”, “3 ★”, “3 ★”, “4 ★”, “4 ★”, “3 ★”, “4 ★”, “3 ★”, “4 ★”)

We have a couple columns that have ★ symbols, particularly the W/F column. Unfortunately, although the chunk output displays this, rmarkdown will not show the same output. I will leave the chunk and reproduce the output via text as I have not found a solution to this. We will use a gsub calling to address this.

gsub(” ★“,” “, fifa21_raw_data$‘W/F’[1:20])

## character(0)

[1] “4” “4” “3” “5” “5” “4” “4” “3” “3” “4” “3”

[12] “4” “3” “3” “4” “4” “3” “4” “3” “4”

We may use this simple gsub call to adjust the other columns containing the non-alphanumeric symbol.