Name: Suhas Venkatesan
Introduction
In this project, I want to analyze a large set of NBA team data in order to successfully build a model to predict an NBA team's win percentage during a particular NBA season.
The NBA makes 8.3 billion dollars in revenue every year. The ultimate goal for every team is to win the coveted NBA championship in June, but the first step to making this happen is maximizing the number of wins in the regular season. Only 8 teams from each of the two conferences are allowed to compete in the playoffs every year, and the teams with more wins receive higher playoff seeds, meaning that they have home court advantage in more matchups and an easier path to the finals. The more wins a team achieves in the regular season, the more likely they will be to receive exclusive sponsorships, contracts, and ticket sales in the next year. To know more about how wins in the NBA work, you can take a quick look at https://basketballnoise.com/how-do-standings-work-in-the-nba/.
The modern NBA is heavily dependent on data analytics, and as a huge NBA fan myself, I wanted to see if there was a way to predict a team's win percentage using data science. This data is difficult to access and there have not been many projects done on this topic. It is a very valuable topic, as sports analysts and sports bettors alike need accurate analytics models to predict exactly how many games a team will win at the end of the season given their current statistics.
I will be using these libraries in my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
To collect the data, I made a custom scraper which scrapes statistics from https://www.basketball-reference.com/ , a well known basketball statistics site. I used code from an existing scraper at https://github.com/vishaalagartha/basketball_reference_scraper/blob/master/API.md, and modified it to scrape team statistics for every year from 1980 until 2019. The source code for the scrapers is included in my Github repo, at https://github.com/Suhas-Venkatesan/Suhas-Venkatesan.github.io/blob/main/MyNBAScraper.py. I will be looking at all NBA team data from 1980 to 2019. There are 30 teams in the NBA, and each row of the data represents a team from a specific year from 1980 to 2019, and the per game statistics for that team at the end of the year. For example, the feature 'PTS' represents the points per game that the team scored throughout that season, and '3P%' reperesents the per game three point percentage for the team in that season. The team data itself is accessible at https://github.com/Suhas-Venkatesan/Suhas-Venkatesan.github.io/blob/main/nbateamdata.csv.
df = pd.read_csv('nbateamdata.csv')
df.head()
This data is absolutely amazing, with over 50 different features. There are a variety of abbreviations here to represent various per game statistics, and the meaning of these abbreviations may be confusing to non NBA fans. However, every single NBA statistic abbreviation can be found in this glossary, along with its name and definition. https://www.nba.com/stats/help/glossary/
The main feature we are trying to predict is the Win/Loss percentage of each team. Since there are typically 82 games in an NBA season, the number of wins alone would normally suffice. However, there are a few seasons in between where the total number of games is not 82. One example is the NBA lockout in 2011 where there were much fewer games during the 2011-12 season. As a result, in order to normalize this, we must look at win percentage, which needs to be added as a feature, and is the percentage of total games which were wins.
df['Win Percentage'] = df['W']/ (df['W'] + df['L'])
Something else that is important is to remove any features that are either completely irrelevant or too obvious to use to predict win percentage. We can remove games played, wins, and losses because these numbers are already factored into the win percentage calculation. PW and PL refer to pythagorean wins and losses, which are win and loss predictions for a particular team based on a sports analytics formula developed by Bill James, a baseball statistician. These prediction numbers correlate very highly with the actual number of wins and losses for a team so its a good idea to remove them. We should also take out attendance and attendance per game because they have a lot of null values. I do not believe it is worth filling these null values in because attendance per game is very difficult to estimate. Arena is also irrelevant here because it basically just corresponds to the team name.
df = df.drop(columns = ['G', 'W', 'L', 'PW', 'PL', 'ARENA', 'ATTENDANCE', 'ATTENDANCE/G'])
df.describe(include='all')
df.head()
With such a wealth of features it is important that we engage in exploratory data analysis. We should first start with a correlation heatmap of all the features at once.
plt.figure(figsize=(20,15))
dataplot= sb.heatmap(df.corr())
plt.show()
If we look at what is correlated with win percentage, we can see that some things like MOV (average margin of victory), SRS (a rating which factors in the strength of the schedule and average margin of victory), and NRtg (net rating, a statistic which takes into account the offensive and defensive rating) are some statistics which are positively correlated with win percentage almost completely, at 1.0. This makes sense because these statistics generally measure how strong the team is overall, which would lead to the team winning more games.
On the opposite end, Drtg (defensive rating) and SOS (strength of schedule) are almost completely correlated negatively with percentage at around -1. This makes sense for defensive rating, because the lower a defensive rating is, the better the team is at defense.
However, I am very surprised by the strength of schedule correlation. The heatmap shows that the easier a team's schedule is (meaning that the teams they play against are weaker), the higher their win percentage. However, many great teams over the past decade such as the Golden State Warriors play in competitive divisions in the West where they have some of the toughest schedules in the league, and some sub par teams like the Orlando Magic generally have easier schedules. Despite this, Orlando almost always has a losing record while Golden State usually has the highest win percentages in the NBA. This is a very interesting finding.
Next, we will make histograms to investigate the distribution of each feature.
df.hist(figsize = (20,20))
plt.show()
Most features seem normally distributed without any noticeable skew. However, there are some exceptions. MP (minutes played) is heavily skewed to the right. This means that most teams cluster toward a lower than average amount of minutes played each game on average, while there are a small number of teams which play more minutes. FG (field goals) and FGA (field goal attempts) are also slightly right skewed. This makes sense, because when you look at the NBA over time, teams in the newer era of the NBA over the past decade attempt and make more shots than in the past, because the league has changed from being focused on defense to being focused on offense.
However, the most important finding from these histograms in my opinion are the distributions of 3P (three pointers), 3PA (three point attempts), and 3P# (three point percentage). It is no secret that the modern day NBA has transitioned from barely shooting any threes in its early days to a heavily increased emphasis on three pointers. 3P and 3PA are heavily skewed right because most of the years from 1980 to 2010 represented when the NBA didn't care much about threes. At the same time that these two statistics are right skewed, 3P# is skewed to the left. This shows that over its history, the majority of NBA teams from 1980 to 2019 shot barely any threes but made a large percentage, while there are a small number of teams (over the past few years) which attempt a ton of threes, but make only a small percentage.
I also want to talk about 2 pointers and 2 pointer attempts which seem to be the only bimodal features, having most points cluster around two areas. This is an interesting finding, and I cannot think of why this would be the case.
Something else that may be worth looking at is the win percentage of different teams, or trying to see how much the team/organization itself influences winning. As NBA fans, we know that certain organizations such as the Lakers and Celtics have continued success over time due to their market, brand, and management style.
# Making a new dataframe with each team and their average win percentage
df_teams = df.groupby("TEAM")["Win Percentage"].mean()
df_teams = df_teams.to_frame()
index = df_teams.index
df_teams["TEAM"] = index
df_teams = df_teams.sort_values('Win Percentage')
# making a bar graph
y_axis = df_teams["Win Percentage"]
x_axis = df_teams["TEAM"]
plt.figure(figsize=(8,32))
plt.barh(x_axis, y_axis)
plt.title('Historical Team vs. Average Win Percentage')
plt.ylabel('Team')
plt.xlabel('Average Win Percentage')
plt.show()
The winningest franchises in history are the San Antonio Spurs, Los Angeles Lakers, Boston Celtics, and Oklahoma City Thunder. It is somewhat surprising to see the Thunder on that list, because they are not a large market team. On the other end, the worst franchises in history by the numbers have been the Minnesota Timberwolves, the Charlotte Bobcats (which are now the Charlotte Hornets), and the Los Angeles Clippers.
Next, we can look at how different team based statistics have changed over time throughout the league across every single team. There are a few trends here that I am expecting, such as the various three pointer statistics to trend upwards, and the overall field goal percentages to trend downwards over time. I am choosing to exclude a few statistics which are irrelevant.
from sklearn.linear_model import LinearRegression
# The year can be the first year of each season
df['year'] = df['SEASON'].astype(str).str[0:4].astype(int)
df_overtime = df
# Since we are investigating percentages, attempts and makes for every category
# of shot can be dropped.
df_overtime = df_overtime.drop(columns = ['SEASON', 'FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA'])
# minutes played is basically always the same so that can also be dropped.
# We can also drop some really advanced statistics that are difficult to
# interpret, and some statistics that are really just calculations using statistics
# that we already are plotting
df_overtime = df_overtime.drop(columns = ['MP','MOV', 'SOS', 'SRS', 'NRtg',
'FTr', '3PAr', 'TS%', 'eFG%', 'TOV%',
'ORB%', 'FT/FGA', 'DRB%', 'TRB', 'Win Percentage'])
#creating subplots and relevant columns/labels
fig, axes = plt.subplots(ncols = 1, nrows = 16, squeeze = False, figsize = (10, 100))
cols = ['FG%', '3P%', '2P%', 'FT%', 'ORB', 'DRB', 'AST', 'STL', 'BLK',
'TOV', 'PF', 'PTS', 'AGE', 'ORtg', 'DRtg', 'PACE']
y_labels = ['Field Goal Percentage', 'Three Point Percentage', 'Two Point Percentage', 'Free Throw Percentage', 'Offensive Rebounds',
'Defensive Rebounds', 'Assists', 'Steals', 'Blocks', 'Turnovers', 'Personal Fouls', 'Points', 'Age', 'Offensive Rating',
'Defensive Rating', 'Pace']
years = [x for x in range(1979, 2019)]
k = 0
#make subplot for each column
for column in df_overtime.columns:
if k == len(cols): break
stat_results = []
#grab each stat info and add to stats
for year in range(1979, 2019):
stat = cols[k]
year_stats = df_overtime[df_overtime["year"] == year]
stat_results.append(sum(year_stats[stat]) / len(year_stats[stat]))
#make linear regressor model and fit to the scatter plot
linear_regressor = LinearRegression()
linear_regressor.fit(np.array(years).reshape(-1, 1), stat_results)
y_pred = linear_regressor.predict(np.array(years).reshape(-1, 1))
#add features to the plot itself
axes[k, 0].scatter(years, stat_results)
axes[k, 0].plot(years, y_pred, color = 'red')
axes[k, 0].title.set_text("Average " + y_labels[k] + " versus Year")
axes[k, 0].set_xlabel("Year")
axes[k, 0].set_ylabel("Average " + y_labels[k])
k += 1
These graphs are fascinating. Here is my analysis and interpretation of the various graphs:
Three Point Percentage: As expected, the average three point percentage over the years has a strong positive correlation which seems to have a slight parabolic shape. This article has more information about how the NBA has evolved over the years to place more emphasis on the three pointer. https://www.nba.com/news/3-point-era-nba-75
Field Goal Percentage: The negative trend in field goal percentage is expected. However, there is a huge dip around the year of 2000. This may be attributed to the fact that the pace of the league also follows roughly the same trend with a similar dip.
Two Point Percentage: This also follows a parabola like curve with a dip near 2000.
Free Throw Percentage: There is no correlation.
Offensive Rebounds: There is a very strong negative linear correlation. This is fascinating, as it seems that teams steadily grabbed less offensive rebounds every year. The only thing that I can think of which may be able to explain this is the fact that there are less dominant centers today who are strong enough to grab offensive rebounds, so the majority of rebounds become defensive rebounds.
Defensive Rebounds: There is a positive correlation here, especially in the last few years where it has risen at a faster rate. I think this is expected, and can be explained by the lack of dominant centers in the modern NBA.
Assists: This feature trends negatively with a parabolic shape, but is starting to increase again in the last few years.
Steals: This trends negatively, and this makes sense because the league has become less focused on defense.
Blocks: This trends negatively, this makes sense for the same reason as the steals.
Turnovers: This has a strong, negative, linear correlation. This also makes sense for the same reason as the steals and blocks.
Personal Fouls: There is a reasonably strong negative, linear correlation here which I am very surprised by. The officiating in the modern game has a reputation for being "soft", or calling too many fouls, so it is interesting to see that the number of fouls is actually going down from the supposedly tougher eras of the 80's and 90's.
Points: This feature follows a parabola shaped trend, and is trending back upwards in recent years. This statistic also has a dip around the year of 2000.
Age: This statistic follows a parabola shaped trend, but peaks around the year of 2000. This is very strange, as there seems to be no reason for the average age of the league to fluctuate much at all.
Offensive Rating: There is a very slight positive correlation.
Defensive Rating: There is a very slight positive correlation here as well. This is somewhat surprising, given that defenses today as a whole are widely believed to be less strong than defenses of the past.
Pace: This statistic follows a parabola shaped trend which is maximized around the 80's and the modern day, with a dip around the year of 2000. This is interesting because today's game is very fast paced, and NBA fans do not typically think of the 80's as being a fast paced era, but the numbers suggest otherwise.
Overall, I am very perplexed about the sudden change in trends around the year of 2000, as nothing signficant that I am aware of changed in the league around that time besides the loosening of a few defensive rules. For more information on how the NBA has changed over time, I would recommend visiting https://bleacherreport.com/articles/1282804-how-the-nba-game-has-changed-over-the-last-decade.
In the last section, we visualized and analyzed the data thoroughly in order to gain a deeper understanding of what we were dealing with. The next step is to use our data to build an actual machine learning model which can be used to predict a team's win percentage. Since the win percentage is continuous, I will need to use regression. The two machine learning models I am going to use are multiple linear regression (MLR) and random forest regression.
from sklearn.preprocessing import MinMaxScaler
# We will again drop all of the stats that factor into percentage stats that we
# are already including, or are just calculations using statistics that we are already
# including.
df_ML = df.drop(columns = ['SEASON', 'FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA', 'year'])
df_ML1 = df_ML.drop(columns = ['MP','MOV', 'SOS', 'SRS', 'NRtg',
'FTr', '3PAr', 'TS%', 'eFG%', 'TOV%',
'ORB%', 'FT/FGA', 'DRB%', 'TRB'])
# The team is
# categorical and is also not essential for our model since there are
# much better predictors we can use.
df_ML2 = df_ML1.drop(columns = ['TEAM'])
df_ML2.head()
# Since some of the variables are on a percentage scale, and others are not,
# we need to normalize all of the variables on a scale from 0 to 1.
# create a scaler object
scaler = MinMaxScaler()
# fit and transform the data
df_norm = pd.DataFrame(scaler.fit_transform(df_ML2), columns=df_ML2.columns)
df_norm = df_norm.drop(columns=df.columns[0])
df_norm.head()
Now that the data is normalized and our predictors are ready to go, we can split the data into testing and training sets. I am using a 80-20 split between training and testing.
y = df_norm['Win Percentage']
X = df_norm.drop('Win Percentage',axis=1)
X.shape, y.shape
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 12)
X_train.shape, X_test.shape
The first model I will be using is multiple linear regression. Regression models are used to describe relationships between variables by fitting a line to the observed data. Linear regression uses OLS (Ordinary Least Squares), meaning that it generates a line which minimizes the squared residuals of each data point. Multiple linear regression uses multiple (more than one) predictor in order to create this line. I will be attempting to create a model where I use a specific subset of features in my data in order to successfully predict a team's win percentage. To know more about how multiple linear regression works, visit https://www.scribbr.com/statistics/multiple-linear-regression/.
#Build a linear model
import statsmodels.api as sm
X_train_lm = sm.add_constant(X_train)
lr_1 = sm.OLS(y_train, X_train_lm).fit()
lr_1.summary()
After running a multiple linear regression model on our data, we can see which features we can remove to make the model more accurate. Using an alpha level of .1, we can see that several of the predictors have a higher p value than .1 so they can be removed from the model. First, however, we must check for multicollinearity.
# Checking for the VIF values of the variables.
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Creating a dataframe that will contain the names of all the feature variables and their VIFs
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif
This model has extremely high VIFs, meaning that the features are highly correlated with each other. This makes a lot of sense, because a lot of statistics in basketball are highly collinear. For example, a team that scores a lot of points will inevitably have a high offensive rating, and this is why the VIF values are extremely high. I will remove a few of the parameters with extremely high VIFs so that we do not supply redundant information to the model, and remove the parameters from my original model with a p-value above the threshold of .1. Although technically, all variables with a VIF > 5 should be removed, I cannot follow this rule because then I would have no variables left.
# Dropping highly correlated variables and insignificant variables
X_train_new = X_train.drop(['ORtg'], 1,)
X_train_lm = sm.add_constant(X)
X_train_new = X_train_new.drop('ORB', 1)
X_train_lm = sm.add_constant(X)
X_train_new = X_train_new.drop('TOV', 1)
X_train_lm = sm.add_constant(X)
X_train_new = X_train_new.drop('AST', 1)
X_train_lm = sm.add_constant(X)
X_train_new = X_train_new.drop('DRB', 1)
X_train_lm = sm.add_constant(X)
X_train_new = X_train_new.drop('STL', 1)
X_train_lm = sm.add_constant(X)
X_train_new = X_train_new.drop('2P%', 1)
X_train_lm = sm.add_constant(X)
X_train_new = X_train_new.drop('FT%', 1)
X_train_lm = sm.add_constant(X)
X_train_new = X_train_new.drop('3P%', 1)
X_train_lm = sm.add_constant(X)
lr_2 = sm.OLS(y_train, X_train_new).fit()
# Printing the summary of the final model
print(lr_2.summary())
# Calculate the VIFs again for the new model
vif = pd.DataFrame()
vif['Features'] = X_train_new.columns
vif['VIF'] = [variance_inflation_factor(X_train_new.values, i) for i in range(X_train_new.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif
All of the high p values are taken care of. Although we still do have multicollinearity, our VIF values are much smaller with the new model and this is the best we can do with this data. Next, we must check the important assumptions of linear regression to see that we have a valid model. We will plot a histogram of the error terms below.
import seaborn as sns
y_train_percentage = lr_2.predict(X_train_new)
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_percentage), bins = 20)
fig.suptitle('Residuals Histogram', fontsize = 20) # Plot heading
plt.xlabel('Residuals', fontsize = 18) # X-label
The error terms closely resemble a normal distribution, so the normality of the errors is satisfied.
from sklearn.linear_model import LinearRegression
x = df_norm[['PTS', 'PACE', 'FG%', 'DRtg', 'PF', 'BLK', 'AGE']]
y = df_norm['Win Percentage']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 100)
mlr = LinearRegression()
mlr.fit(x_train, y_train)
y_pred_mlr= mlr.predict(x_test)
# This dataframe shows what the actual test values are, and what my model predicted
mlr_diff = pd.DataFrame({'Actual value': y_test, 'Predicted value': y_pred_mlr})
mlr_diff.head()
Now, we can evaluate the final model and actually use it to predict win percentages for teams.
#Model Evaluation
from sklearn import metrics
meanAbErr = metrics.mean_absolute_error(y_test, y_pred_mlr)
meanSqErr = metrics.mean_squared_error(y_test, y_pred_mlr)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_mlr))
acc_train_lr = mlr.score(x_train, y_train)
acc_test_lr = mlr.score(x_test, y_test)
print('R squared: {:.2f}'.format(mlr.score(x,y)))
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)
This is a great model, as the errors are minimal and the R-squared is high. The r-squared value for the final linear model on the testing data was .94, and the r-squared on the training data is .94 (from the OLS summary above), which is equal. Therefore, this model is a great fit. The root mean squared error is the best metric to evaluate the accuracy of a regression model, and it is a very small value of .052 meaning that there is minimal error and the model performs very well. Our model can very accurately predict the win percentage of an NBA team.
The next model we are using is Random Forest Regression, which is based on decision trees, a very powerful tool in machine learning. Decision trees are a type of supervised machine learning algorithm where you feed in some input, and the data is continuously passed into nodes of a tree which make one decision or another until a final answer is arrived to at the end. Random forests utilize multiple decision trees, and output the mean output of all the trees as the output of the algorithm. If you want to learn more about how random forests work in regression problems, I would recommend visiting the link below: https://levelup.gitconnected.com/random-forest-regression-209c0f354c84
# Setting the X and y for the regression
X = df_norm.iloc[:, :-1]
y = df_norm.iloc[:, -1]
# Creating and fitting the model
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
Now, we can evaluate the model.
#predicting the target value from the model for the samples
y_test_rf = rf.predict(X_test)
y_train_rf = rf.predict(X_train)
#computing the accuracy of the model performance
acc_train_rf = rf.score(X_train, y_train)
acc_test_rf = rf.score(X_test, y_test)
from sklearn.metrics import mean_squared_error
#computing root mean squared error (RMSE)
rmse_train_rf = np.sqrt(mean_squared_error(y_train, y_train_rf))
rmse_test_rf = np.sqrt(mean_squared_error(y_test, y_test_rf))
print('\nRandom Forest: The RMSE of the training set is:', rmse_train_rf)
print('Random Forest: The RMSE of the testing set is:', rmse_test_rf)
The has a much higher root mean squared error on the testing data when compared to the other model. The root mean squared error is .2051. The closer this number is to 0, the better the model, due to the fact that the errors are minimized. We can improve upon our model by tuning the hyperparameters.
# We can first increase the number of trees to 700. The more trees there are,
# the more accurate the model will be since there are more trees to average
# out.
# We will also change n_jobs to -1, to remove any restrictions on the number
# processors that the model is allowed to use.
# We will make the max depth 50 to make the trees have more splits and
# capture more data. If each tree uses more data then the overall result
# will be more accurate.
rf_improved = RandomForestRegressor(n_estimators=700, n_jobs= -1, max_depth=50)
rf_improved.fit(X_train, y_train)
We can now evaluate the improved model.
#predicting the target value from the model for the samples
y_test_rf_improved = rf_improved.predict(X_test)
y_train_rf_improved = rf_improved.predict(X_train)
#computing the accuracy of the model performance
acc_train_rf_improved = rf_improved.score(X_train, y_train)
acc_test_rf_improved = rf_improved.score(X_test, y_test)
from sklearn.metrics import mean_squared_error
#computing root mean squared error (RMSE)
rmse_train_rf_improved = np.sqrt(mean_squared_error(y_train, y_train_rf_improved))
rmse_test_rf_improved = np.sqrt(mean_squared_error(y_test, y_test_rf_improved))
print('\nRandom Forest Improved: The RMSE of the training set is:', rmse_train_rf_improved)
print('Random Forest Improved: The RMSE of the testing set is:', rmse_test_rf_improved)
After tuning our parameters, we were able to get our root mean squared error down to .2040. Although it is not significant improvement, it is still improved by a little.
After creating both models, it is clear that the multiple linear regression model wins out significantly due to its significantly lower error values. If we were to choose one model to use to predict win percentages, it should be the linear regression model.
We covered a lot of ground in this tutorial, and I have attached some more links for anyone who wants to continue working with data science/machine learning or NBA data analytics.
Other Regression Models:
NBA Analytics
In this tutorial, we set out to analyze a large set of NBA team data and attempt to create our own model to predict the win percentage of an NBA team. I first scraped the data using my own scraper, and cleaned the data to remove any irrelevant columns and prepare the data for analysis. We then engaged in exploratory data analysis, looking for relationships between the features. Through a correlation heatmap, we found that some features were very highly correlated with win percentage while others were not. We then looked at the distributions of each individual feature through histograms. These distributions gave us important clues about what our model would eventually include as predictors. We then looked at how the team itself could impact winning, and found a slight correlation there as well. Lastly, we analyzed how some important features changed throughout time to see if the time period had an important effect on the statistics that teams were putting up. At the end, we incorporated machine learning and created, refined, and evaluated two models: multiple linear regression and random forest regression. We found that the linear regression model was slightly more accurate, at .932.
For further future analyses, we could analyze the more advanced stats which we excluded such as net rating and SRS, to see whether or not these more refined advanced stats could more accurately predict the team's win percentage. We could also incorporate the trend over time for different statistics that we analyzed during our visualization step, and include this in the model so that the model also takes into account how the team statitics are changing over time. If we were to do this, it would be able to take the power of our model to a whole new level: not only would we be using our predictors, but we would account for the change in team statistics over time.
I hope you enjoyed reading this, and learned a thing or two about NBA analytics!
-Suhas