How Did I Get Myself Into This - My Walk Into the Data Science Realm: 2018

Monday, October 22, 2018

GOAT Talk - Modeling AKA Hey, my man....What It Look Like....

After cleaning and doing EDA and charts with the data, now was the time for some modeling.

I started with taking a look at the general correlation of each feature in the table.

I decided to use PER as the output that I wanted to predict. I wanted to start with a baseline and just use all of the features as predictors. I have to remove VORP or Wins Over Replacement because Wins Over Replacement is simply a multiple of VORP.

I used a list of regression models to start with: Linear Regression, KNeighbors Regressor, Decision Tree, Extra Trees, Random Forest, AdaBoost, GradientBooost, SVR, Lasso, and Ridge. After train-test splitting my samples, Upon the initial run, Linear Regression had the best score (0.994 - train, 0.986 - test), followed by Lasso (0.974 - train, 0.982 - test) and Ridge (0.984 - train, 0.9895 - test)

After receiving those scores, I wanted to see what the testing would look like after using PCA on the features. The initial run at least explained around 45% of the variance. Running through the same models I got similar scores for Linear Regression and Ridge, but Lasso (0.971- train, 0.978 - test). With this knowledge, I wanted to attempt one more adjustment and I lowered the features from 43 to 21. Same models and I was able to explain closer to 70% of the variance in the model this time.

The scores dropped more: Linear (0.957 - train, 0.959 - test), Lasso (0.939 - train, 0.945 - test), Ridge (0.948 - train, 0.947 - test), but overall they still came out the same. And after using that Linear Regression model, the predictions did not appear far off.

GOAT Talk - Take That For Data

David Fizdale, the current head coach for the New York Knicks, was a rookie head coach for the Memphis Grizzlies the 2016-17 NBA season. During the midst of a 1st round Playoff matchup with the San Antonio Spurs, his team got beat 96-82 in the 2nd game of their matchup. He started the press conference off calmly. Discussing his team's effort against the #2 seed in the Western Conference. But, soon he would start to show his frustration with the officiating crew. Noting how his star players weren't getting calls and how there was a discrepancy with the free throws. "I'm not a numbers guy, but that doesn't seem to add up.", Fizdale proclaimed. Eventually, his frustration came to fruition, and after discussing Gregg Popovich's coaching tenure and "...they're not going to rook us.", he gave us the phrase.......

And this is where the 3rd part of my Capstone Project lead me. All of this seemingly bountiful data that I scraped and cleaned like a deer from Red Dead Redemption. This was the time to evaluate what I had and see how I could use it.

First things first, recover from the previous cleanings and merge all of the csv's together into one giant dataframe.

Then I narrowed down some of the data gathered for the modeling.

There were 214 rows of data collected between 14 NBA players to start with. I narrowed it down to those who played at least half of a 82-game season and that dropped our data from 214 to 202 rows.

Then I wanted only the PER scores that were above the league average of 15. That took the data from 202 to 196.

From there, it was adjusting some of the percentage stats into decimal format (25% = .25). After that, nothing left to do but some graphing and plotting to see what the data looks like amongst each other.

Thursday, October 18, 2018

GOAT Talk - Cleaning Up The Data

Amazingly I was able to scrape 11 NBA players thanks to Selenium and some decent knowledge of BeautifulSoup. Now, it was time to actually some initial cleaning and EDA.

First thing first, Kareem's data. Good thing was that there was plenty of it. He played for 20 years with solid contributions for most of those years. For data science purposes, the NBA statisticians did him a disservice very early on.

It might not be as obvious, but there are many holes in Kareem's data from the 1969-70 season up to the 1976-77 season. Part of this is easy to fill in as there was no 3-point shot in the NBA until the 1979-80 season. However, Steals, Blocks, Offensive Rebounds, Defensive Rebounds, VORP, Plus/Minus were not known stats until the 1973-74 season. With those types of gaps, I had no real way to fill in. I didn't want to just input data for the sake of keeping a full table. My only action was to remove all of his data from my gatherings.

Now, that leaves me with 10 players, but that doesn't satisfy the scientist in me. So, we need more data.

I decided to pick 4 new players to at least add to the variety. All of these players have one at least 1 MVP, have been to numerous playoffs, and won an NBA Championship (except for Iversion with the awesome MVP year he had in 2000-01, but I digress)

I did the same as before by scraping their Per Game statistics,

and then some of the Advanced Statistics,

then adding Seasons, All-Star appearances, shares of the MVP voting, MVP Placing, and MVP trophies won.

After getting the new players added, it was time to check for any null values. **Notice the disclaimer. The work was already performed, but I saved the file to .csv, and then once loaded back into Jupyter Notebook and would get errors when running the command.

Once I viewed that there were no more null values in any of my players, I felt at least comfortable with calling my data cleaned and ready for some EDA.

Friday, October 5, 2018

GOAT Talk - The Joys of Scraping

Well, after finally deciding on the subject of my Capstone, I had to start diving into data collection.

Yes, my feelings fully encapsulated the feelings of Kip, Jonah, and the Channel 4 News Team. No judgment on this page.

First, I had to pick the players that I thought best described the Greatest of All Time as far as the NBA goes. I didn't pick any International players simply because this is America!!!!!

I also wanted to include players from at least the 1970's and on. Simply because most defensive stats like blocks and steals were not captured or thought of during this time. Also, the 3 Point shot was not implemented until the 1979-80 season. That eliminated such greats as Bill Russell (11-time NBA Champion) and Wilt "The Stilt" Chamberlain (2-time NBA Champion, but more known for the only 100-point game in NBA history as well as the 2nd greatest movie appearance of all time....Kareem Abdul-Jabbar having the #1 of course....and #3)

From that point, I had it narrowed down to at least 11. Hard to pick honestly.

Kareem Abdul-Jabbar
Magic Johnson
Larry Bird
Isiah Thomas (The Detroit Pistons Point GOD!!!!!!)
Michael Jordan
Hakeem Olajuwon
Shaquille O'Neal
Tim Duncan
Kobe Bryant (The Black Mamba.....my personal GOAT)
Lebron James
Kevin Durant

Lebron and KD were difficult choices only because they are currently playing in the league now. As accomplished as they currently are, it only leads me to believe that if this project was done 4-5 years from now it would be much more robust.

Next, I had to decide what data I wanted exactly. Basketball Reference (https://www.basketball-reference.com/) is pretty much a cornucopia of knowledge, but how much did I want to take?

I figured the Per Game data was where I wanted to start: Age, Games Played, Minutes Per G, Field Goals Per G, FG Attempts Per G, FG %, 3PT Per G, 3PT Attempts Per G, 3PT %, 2PT Per G, 2PT Attempts Per G, 2PT %, Effective FG% (same as FG%, but adjusting for the fact that a 3PT is worth more and weighing more for those shots), Free Throws Per G, FT %, Offensive Rebounds Per G, Defensive Rebound Per G, Total Rebounds Per G, Assists Per G, Steals Per G, Blocks Per G, Turnovers Per G, Fouls Per G, and Points Per G.

With the basic stats, I wanted something that would be more in depth as far as there skills. Perhaps, something more.........Advanced????

So, under the Advanced tab, I wanted the following stats: Age (to match up on the other table), PER (Player Efficiency Rating - measure of per-minute production standardized such that the league average is 15), True Shooting % (a measure of shooting efficiency that includes 2-pointer, 3-pointers and Free Throws), Total Rebound % (percentage of available rebounds a player grabbed while he was on the floor), Assist %, Steal %, Block %, Turnover %, Usage %, Win Shares (estimate of the number wins contributed by a player), Offensive and Defensive Box +/- (box score estimate of the offensive points per 100 possessions and defensive points per 100 possessions a player contributed above a league-average player, translated to an average team) Box +/-, VORP (Value Over Replacement Player - A box score estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season; can be multiplied by 2.70 to convert to Wins Over Replacement Player)

I also wanted to find and notate their accomplishments. So time to find all awards given....

All-Star Games. MVPs. MVP Voting Shares (fun fact - the league MVP was voted on by the players until the 1979-80 season. Since the 1980–81 season, the award is decided by a panel of sportswriters and broadcasters throughout the United States and Canada.) . All-League Team (All-Rookie, All-NBA, All-Defensive)

Now that I have the list of players, and the data I think would be important, it's time to actually start the data scrape. Here is where my headache began.

Sunday, September 30, 2018

GOAT Talk - The Big Bad Voodoo Daddy

In between work and the DSI class, I haven't had a lot of time to update my blog. As I sit in the middle of class.....which is probably not the best time to blog post.....I have been trying to decide on my capstone project. I wanted to stay with something basketball related. My first few searches were mostly prediction based. Lots of "Predict the MVP", "Predict the Rookie of the Year", or "Predict the Scoring Champ". However, I like to make life much more difficult on myself than I should.......and I wanted to pick the best basketball player of ALL TIME based off stats. So, here is to hours or possible headaches, but what could possibly be an interesting project in general.

Friday, August 10, 2018

Project 1 - I Love the Smell of Napalm In The Morning

This has quite possibly been the nightmare I have quite possibly had for the last few weeks. I knew this course was going to be rough, but I genuinely was not prepared mentally for the first project.

We dove into the ideas of distributions, hypothesis testing and confidence intervals this week. I have been up for the past 5 hours, but I think I have something to show for it. The Confidence Interval portion has certainly piqued my interest. Being able to (remember 10+ years since college math) calculate 95% probability of where a value lies seems to come in handy for presentations.

The graphing portion, I definitely have work to do. I still do not feel as confident with the remembering of formulas for put out histograms and distribution plotting.

PowerPoint is still a mystery, but overall.....I feel a lot better from earlier this week than I do now.

https://git.generalassemb.ly/omarcarr/project-1