"Additionally, we decided to use an “ensemble” of these algorithms to fit our SPM model (here’s wikipedia’s
explanation). Given that we wanted to use our model for both long-term and in-season (read: small sample) analysis and evaluation, we found using a collection of various algorithms blended together allowed us to better fit the long-term RAPM outputs for the ultimate use in our WAR model. After much testing and tuning, we found that using three algorithms for each component was the best approach."
Now any statistician would tell you that the statistical properties for such an amalgamation of models is unknown.
This is data mining, pure and simple:
"Each “tuning” iteration consisted of 300 cross-validation runs where a model was trained on 80% of the data and tested on the 20% of data that was held out. The results of these 300 runs were then averaged. This process was repeated, each time adjusting the features that were included in each algorithm until the best set of features was achieved based on the aggregated
root-mean-square error for a given tuning iteration. This was done for all 5 of the algorithms for
each of the individual components (EV Offense, SH Defense etc). In total, we allowed five algorithms to be used for the 8 component models, which means 40 total algorithms were trained and tuned using the above process. Of those 40 algorithms, we selected the three best for each component, which means 24 algorithms were used in total to create the eight SPM component models (remember 4 components, split by position)."
"Finally, what we might actually want to look at in this case is something akin to the t-values in a linear regression output"
Which is very generous. "Akin." What they're measuring is how well their ad hoc model fits their data.
The team strength adjustment is also ad hoc "We then determine which multiplication factor results in the player summed values that most closely mirror the 11-year RAPM that the models were trained on."
I could go on and on, but the point is that they are "curve fitting" through applying a set of models and picking the one that gives the best results. Which isn't a statistically valid approach, but they're just creating a descriptive model that basically assumes results aren't stochastic, so it doesn't matter - they're fitting an ad hoc model to results, not an explanatory model to predict outcomes.
Point is that their results are suggestive, but nothing more.
Nothing wrong with that, but using EV numbers as if they're "gospel" is foolish.