10 Pages

## Tool Edit

We have a tool for training SVD models (factorize sparse matrices) that takes very general input in the form of CSV files. The documentation needs some work, but you can find it here jssvd.

## Mathematical Description Edit

### Prediction Formula Edit

$\hat{r}_{ui} = \mu +b_u + b_i + p_u^{T}q_i$

where

$\mu$ is the overall average rating
$b_u$ is the user bias for user $u$
$b_i$ is the movie bias for movie $i$
$p_u$ is the first feature vector corresponding to the $u^{th}$ user
$q_i$ is the second feature vector corresponding to the $i^{th}$ movie

Sometimes biases are not used. In this case, the prediction formula is:

$\hat{r}_{ui} = p_u^{T}q_i$

Note that two models are inherently the same, since we can increase the dimensionality of all u and v by 2, and write:

$p_{u,n+1} = b_u + \mu$
$q_{u,n+1} = 1$
$p_{u,n+2} = 1$
$q_{i,n+2} = b_i$

However, their training may produce different results.

### Training Algorithm Edit

Objective:

$\min_{q_{*}, p_{*}, b_{*}}{\sum_{(u,i) \in K}} ( r_{ui} - \mu - b_u -b_i - p_u^{T}q_i ) ^ 2 + \lambda ( b_u^2 +b_i^2 + ||q_i||^2 + ||p_u ||^2 )$

### Weight update formula Edit

For each pair (u, i) from the training set do:

Calculate residual: $d_{ui} = r_{ui} - p_u q_i\!$
$\Delta q_i = \eta \left(d_{ui} * p_u -\lambda q_i\right)\!$
$\Delta p_u = \eta \left(d_{ui} * q_i -\lambda p_u\right)\!$
$q_i \leftarrow q_i + \Delta q_i \!$
$p_u \leftarrow p_u + \Delta p_u \!$

We will refer to these as standard formulas.

### Variants Edit

#### Regression to average Edit

$\Delta q_i = \eta \left(d_{ui} p_u -\lambda (q_i-\bar q)\right)$
$\Delta p_u = \eta \left(d_{ui} q_i -\lambda (p_u-\bar p)\right)$

where

$\bar q = {\sum_{j=1}^{N_\mathrm{items}} q_j \over N_\mathrm{items}}$
$\bar p = {\sum_{j=1}^{N_\mathrm{users}} p_j \over N_\mathrm{users}}$

The reasoning behind this is that the prior distribution for $p_{jk}$ is $N\left({\bar p}_k, \sigma^2\right)$ rather than $N\left(0, \sigma^2\right)$

#### Support-dependent relaxation rate Edit

Note that the standard formulas do NOT minimize the function above. Indeed, if $\eta$ is very small, and using the original formula, the total change of $p_u$ at each iteration is:

$\left(\Delta p_u\right)^\mathrm{total} = \eta\left(\sum_{i\in I(u)} d_{ui} q_i - \lambda p_u S_u \right)$

where $I(u)$ is the set of items which appear together with u in the training set, $S_u$ is the support for the user u.

Note the $S_u$ factor in the second term. It should not be here, if we indeed moved in the direction of the gradient.

This brings the idea to decrease relaxation rate with support.

$\Delta q_i = \eta \left(d_{ui} p_u -\lambda_i q_i\right)$
$\Delta p_u = \eta \left(d_{ui} q_i -\lambda_u p_u\right)$

where $\lambda_i=f(S_i)\!$, $\lambda_u=g(S_u)\!$, f and g are non-increasing functions.

For Netflix, the following formula was useful:

$\lambda_u(S_u) = \lambda^{(1)} + {\lambda^{(2)}\over \max(S_u, \bar S)}$,

where

$S_u\!$ is the support for the user u.
$\bar S$ is average support for all users

and similar for items.

#### Support-dependent relaxation rate with regression to average Edit

This combines both approaches above:

$\Delta q_i = \eta \left(d_{ui} p_u -\lambda_i (q_i-\bar q)\right)$
$\Delta p_u = \eta \left(d_{ui} q_i -\lambda_u (p_u-\bar p)\right)$

#### Different learning rates and relaxation rates Edit

Relaxation rates for users and items might be different. Learning rates also might be different.

When using biases ($\mu$, $b_i$, $b_u$), they may have their own learing and relaxation rates.

Community content is available under CC-BY-SA unless otherwise noted.