Thursday, October 18, 2018

GOAT Talk - Cleaning Up The Data

Amazingly I was able to scrape 11 NBA players thanks to Selenium and some decent knowledge of BeautifulSoup.  Now, it was time to actually some initial cleaning and EDA.

First thing first, Kareem's data.  Good thing was that there was plenty of it.  He played for 20 years with solid contributions for most of those years.  For data science purposes,  the NBA statisticians did him a disservice very early on.



It might not be as obvious, but there are many holes in Kareem's data from the 1969-70 season up to the 1976-77 season.  Part of this is easy to fill in as there was no 3-point shot in the NBA until the 1979-80 season.  However, Steals, Blocks, Offensive Rebounds, Defensive Rebounds, VORP, Plus/Minus were not known stats until the 1973-74 season.  With those types of gaps, I had no real way to fill in.  I didn't want to just input data for the sake of keeping a full table.  My only action was to remove all of his data from my gatherings.


Now, that leaves me with 10 players, but that doesn't satisfy the scientist in me.  So, we need more data.


I decided to pick 4 new players to at least add to the variety.  All of these players have one at least 1 MVP, have been to numerous playoffs, and won an NBA Championship (except for Iversion with the awesome MVP year he had in 2000-01, but I digress)

I did the same as before by scraping their Per Game statistics,

and then some of the Advanced Statistics,

then adding Seasons, All-Star appearances, shares of the MVP voting, MVP Placing, and MVP trophies won.

After getting the new players added, it was time to check for any null values. **Notice the disclaimer.  The work was already performed, but I saved the file to .csv, and then once loaded back into Jupyter Notebook and would get errors when running the command.

Once I viewed that there were no more null values in any of my players, I felt at least comfortable with calling my data cleaned and ready for some EDA.

No comments:

Post a Comment