Blending is a combination of two or more models in order to create a model which is better than any of them. Since different models have different weak and strong sides, blending may significantly improve performance.
Blending of many (10-1000) models is necessary to achive the best possible performance in contests. In real world problems, blending is often very useful, but the number of models is not that big. For instance, for the Netflix prize contest, even the first year winning submission was a blend of 107 different models. For the real world, Netflix selected two models: Matrix Factorization and Restricted Boltzmann Machines. Their RMSE were 0.89 and 0.90 correspondently, the blend's RMSE was 0.88 . Note that RMSE of the best model was 0.8567. This was achieved by blending probably hundreds of models [1].
Definitions Edit
Let
- $ y_i $ is the dependent variable for the observation $ i $
- $ \hat y_{mi} $ is the prediction for $ y_i $ by the model m
- $ \hat y_i $ is the final prediction for $ y_i $
Linear regression Edit
If there are M models, and the prediction of the model m for observation i is $ \hat p_{mi} $, then the total prediction is
- $ \hat y_i = \sum_i \beta_m \hat y_{mi} $
The vector $ \beta $ are obtained via linear regression.
It is important that the observation used for calculating $ \beta $ and the observations used for training the models themselves will be non-intersecting. This is because most models perform better on the training set than on the test set, therefore if we use the training set (or part of it) for estimating $ \beta $, the coefficients will be wrong.
Even estimating the metaparameters might overfit the model, therefore it may be useful to divide the set of observations into 3 parts:
- training set, for training the models
- validation set 1, for estimating metaparameters of the model (i. e. learning ratio)
- validation set 2, for estimating $ \beta $.
Lasso or ridge regression Edit
When there are many models, it may be useful to use lasso, ridge regression, stepwise elimination, or other relaxation methods for $ \beta $. See the linear regression article for more details. The metaparameters of relaxation methods are obtained via cross-validation of the validation set 2.
Generalized Additive Model Edit
For generalized additive model (GAM), target function is a sum of one-dimensional functions:
- $ \hat y_i = \alpha + \sum_m f_m\left(\hat y_{mi}\right)\! $
Algorithm: estimate $ f_m $ one by one in a loop. Each time estimate the residual $ \mathbf{y}-\sum_{l\ne m} \hat f_l(\mathbf{\hat y_l})\! $ as a smooth spline and subtract its average (so that the average of $ f_m $ will be zero).
Alternatively, we can calibrate each model separately:
- $ \hat y^{(final)}_{mi} = f_m\left(\hat y^{(raw)}_{mi}\right) $
As in the previous section, the validation set 2 should be used for calibrations and parameter estimations.
Neural networks Edit
Sometimes, neural network is used to estimate the final prediction
- $ \hat y_{mi} = \mathrm{NeuralNet}\left(\hat y_{1i} , ... , \hat y_{Mi}\right) $
Dependence on parameters Edit
Different models performs better on different parts of the input space. The linear model can be upgraded as:
- $ \hat y_i = \sum_i \beta_m(\mathbf{x_i}) \hat y_{mi} $
where $ \mathbf{x_i} $ are explanatory variables.
For recommender engines at the Netflix contents, the following was useful:
- $ \hat r_{ui} = \sum_m \left( \beta_m + \beta^\prime_m \log (1+S_u) + \beta^{\prime\prime}_m \log (1+S_i) \right) \hat r^{(m)}_{ui} $ ,
where
- $ \hat r_{ui} $ is the final prediction of the rating for user u, item i.
- $ \hat r^{(m)}_{ui} $ is the model m's prediction of the rating for user u, item i.
- $ S_u $ is the support for the user u, that is, number of items rated by him.
- $ S_i $ is the support for the item, that is, number of users who rated it.
Liang Sun's models Edit
Oracle blending Edit
The oracle blending model is for use in competitions. In order to describe it, recall that the ridge regression is given by the formula:
- $ \beta = \left(\hat Y^T \hat Y + \lambda I \right)^{-1} \hat Y y $ ,
where
- $ \beta $ are oracle weights
- $ \hat Y_{mi} $ is the prediction of the model m for the observation i
- $ y_i $ is the value of the dependent variable for the observation i
- $ \lambda $ is the ridge regression parameter
If we can obtain RMSE of any vector of predictions on the test set, can we use it to estimate $ \beta $?
On the test set, we know the covariance matrix $ Y^T Y $. As for $ \hat Y y $ , we have:
- $ \left( \hat Y y \right)_m = \sum_i \hat y_{mi} y_i = \frac{1}{2} \left( \lVert \hat y_m - y \rVert^2 - \lVert \hat y_m \rVert^2 - \lVert y \rVert^2 \right) $ ,
where $ \hat y_m = \left( \hat y_{m1} , ... , \hat y_{mn} \right) $, n is the number of observations
The first term can be found as:
- $ \lVert \hat y_m - y \rVert^2 = n \cdot \operatorname{RMSE}(\hat y_m)^2 $
Similarly,
- $ \lVert y \rVert^2 = n \cdot \operatorname{RMSE}(0)^2 $ ,
where $ \operatorname{RMSE}(0) $ means RMSE of the submission containing zeros in all positions.
Finally, $ \lVert \hat y_m \rVert^2 $ is known.
Therefore,
- $ \left( \hat Y y \right)_m = \frac{1}{2} \left( n \cdot \operatorname{RMSE}(\hat y_m)^2 - n \cdot \operatorname{RMSE}(0)^2 - \lVert y \rVert^2 \right) $
Therefore, we can find $ \beta $ on the test set.
Bagged oracle blending Edit
For bagged oracle blending, we take $ N $ random subsets of the initial set of oracles, each subset contains $ k_0 $ oracles, $ N $ and $ k_0 $ are model parameters. For each subset, we determine weights via oracle blending. Finally, we take average weights of all $ N $ subsets.
This procedure reduces overfitting.
Constrained oracle blending Edit
Constrained oracle blending imposes constraints on both weights and errors on the validation set:
- $ \min_\beta f(\beta) = \lVert y - \hat Y \beta \rVert^2 $
subject to:
- $ -w_0 \le \beta_m \le w_0 $ for $ m=1,...,M $
- $ \lVert y^{(v)} - \hat Y^{(v)} \beta \rVert^2 \le t $
where
- $ y^{(v)} $ is the vector of dependent variables on the validation set, and $ \hat Y^{(v)} $ is the matrix of model predictions, also on the validation set.
- $ w_0 $ and $ t $ are parameters of the model
These constraints do not let any (potentially overfit) model become too big. Finding the weights is a quadratic optimization problem.