After cleaning and doing EDA and charts with the data, now was the time for some modeling.
I started with taking a look at the general correlation of each feature in the table.
I decided to use PER as the output that I wanted to predict. I wanted to start with a baseline and just use all of the features as predictors. I have to remove VORP or Wins Over Replacement because Wins Over Replacement is simply a multiple of VORP.
I used a list of regression models to start with: Linear Regression, KNeighbors Regressor, Decision Tree, Extra Trees, Random Forest, AdaBoost, GradientBooost, SVR, Lasso, and Ridge. After train-test splitting my samples, Upon the initial run, Linear Regression had the best score (0.994 - train, 0.986 - test), followed by Lasso (0.974 - train, 0.982 - test) and Ridge (0.984 - train, 0.9895 - test)
After receiving those scores, I wanted to see what the testing would look like after using PCA on the features. The initial run at least explained around 45% of the variance. Running through the same models I got similar scores for Linear Regression and Ridge, but Lasso (0.971- train, 0.978 - test). With this knowledge, I wanted to attempt one more adjustment and I lowered the features from 43 to 21. Same models and I was able to explain closer to 70% of the variance in the model this time.
The scores dropped more: Linear (0.957 - train, 0.959 - test), Lasso (0.939 - train, 0.945 - test), Ridge (0.948 - train, 0.947 - test), but overall they still came out the same. And after using that Linear Regression model, the predictions did not appear far off.
No comments:
Post a Comment