Monday, October 22, 2018

GOAT Talk - Modeling AKA Hey, my man....What It Look Like....

After cleaning and doing EDA and charts with the data, now was the time for some modeling.

I started with taking a look at the general correlation of each feature in the table.
I decided to use PER as the output that I wanted to predict.  I wanted to start with a baseline and just use all of the features as predictors.  I have to remove VORP or Wins Over Replacement because Wins Over Replacement is simply a multiple of VORP.

I used a list of regression models to start with: Linear Regression, KNeighbors Regressor, Decision Tree, Extra Trees, Random Forest, AdaBoost, GradientBooost, SVR, Lasso, and Ridge.  After train-test splitting my samples,   Upon the initial run, Linear Regression had the best score (0.994 - train, 0.986 - test), followed by Lasso (0.974 - train, 0.982 - test) and Ridge (0.984 - train, 0.9895 - test)

After receiving those scores, I wanted to see what the testing would look like after using PCA on the features.  The initial run at least explained around 45% of the variance.  Running through the same models I got similar scores for Linear Regression and Ridge, but Lasso (0.971- train, 0.978 - test).  With this knowledge, I wanted to attempt one more adjustment and I lowered the features from 43 to 21.  Same models and I was able to explain closer to 70% of the variance in the model this time.

The scores dropped more: Linear (0.957 - train, 0.959 - test), Lasso (0.939 - train, 0.945 - test), Ridge (0.948 - train, 0.947 - test), but overall they still came out the same.  And after using that Linear Regression model, the predictions did not appear far off.



No comments:

Post a Comment