*This article is about Restricted Boltzmann Machine as a recommender model.*

*RBM* stands for Restricted Boltzmann Machine.

## Notation Edit

Below:

- The letters:
- u denotes a user
- i denotes an item
- j also denotes an item (if there are two items, they are denoted by i and j)
- r denotes rating (or anything else which a recommender tries to forecast, i. e. probability to select an item)
- R denotes the maximal rating
- f denotes a feature number
- F denotes the total number of features
- U denotes the number of users
- I denotes the number of items

## Model Edit

Each user also has a vector v of ratings $ v = (v_{11}, ..., v_{1R}, v_{21}, ..., v_{2R}, ..., v_{I1}, ..., v_{IR}) $. Here $ v_{ir} $ is 1 if the item i received the rating r, otherwise 0. For instance, if the item 7's received 3 in the 1...5 rating system, then $ v_{73}=1 $, while $ v_{71}=v_{72}=v_{74}=v_{75}=0 $.

Each user also has a *hidden vector* $ h\in \mathbb{R}^F $. The hidden vector determines the probability distribution for the vector of ratings. However, as we'll see, the RBM model does not try to restore the hidden vectors explicitly, thats why the name.

The joint distribution of v and h is:

- $ P(v, h) = \exp(-E(v, h)) / \sum_{v', h'} \exp(-E(v', h')) $ $ (1.1a) $

where $ E(v,h) $ is called energy and defined as

- $ E(v,h)= - \sum_{ir}b_{ir}v_{ir} - \sum_{f} b_f h_f - \sum_{ifr}W_{ifr} v_{ir} h_f + \sum_i \log Z_i $ $ (1.1b) $
- $ Z_i = \sum_r \exp\left(b_{ir}+\sum_f h_f W_{ifr}\right) $ $ (1.1c) $

Here b are called *biases* and W are called *weights*. It is biases and weights, not hidden vectors, which the learning process should restore.

Missing (unrated) items are excluded from all sums.

The term $ \sum_i \log Z_i $ is the normalization term that ensures $ \sum_r p(v_{ir}=1|h)=1 $ for each i.

From the joint distribution of v and h, we can obtain conditional distributions and the distribution of v:

- $ P(v_{ir}|h) = {\exp \left( b_{ir}+\sum_f h_f W_{ifr} \right) \over \sum_r \exp \left( b_{ir}+\sum_f h_f W_{ifr} \right)} $ $ (1.2) $
- $ P(h_f=1|v) = \sigma \left( b_f + \sum_i \sum_r v_{ir} W_{ifr} \right) $ $ (1.3) $,
- where $ \sigma(t)=1/\left(1+e^{-t}\right) $

- $ P(v) = \sum_h {\exp \left( -E(v, h) \right) \over \sum_{v', h'} \exp \left( -E(v', h') \right)} $ $ (1.4) $

## Learning Edit

Given the expression (1.4) for $ P(v) $, learning should maximize the likelihood with respect to $ W $ and $ b $, which is performed by the gradient ascent:

- $ \Delta W_{ifr} = \epsilon {\partial \log P(v) \over \partial W_{ifr}} $

Denoting the gradient by $ W_{ifr} $ as $ \nabla_{ifr} $, we have

- $ \Delta W_{ifr} = \epsilon \nabla_{ifr} \log P(v) $
- $ \nabla_{ifr} \log P(v) = \nabla_{ifr} \log \sum_h \exp \left( -E(v, h) \right) - \nabla_{ifr} \log \sum_{v', h'} \exp \left( -E(v', h') \right) $

The first term is

- $ \nabla_{ifr} \log \sum_h \exp \left( -E(v, h) \right) = $
- $ = {\nabla_{ifr} \sum_h \exp \left( -E(v, h) \right) \over \sum_h \exp \left( -E(v, h) \right)} = $
- $ = - {\sum_h \exp \left( -E(v, h) \right) \nabla_{ifr} E(v, h) \over \sum_h \exp \left( -E(v, h) \right)} = $
- $ = {\sum_h \exp \left( -E(v, h) \right) v_{ir} h_f \over \sum_h \exp \left( -E(v, h) \right)} = $
- $ = \left\langle v_{ir} h_f \right\rangle_{data} $

The notation $ \left\langle ... \right\rangle_{data} $ means that the expression in braces is calculated for given v and averaged over the conditional distribution $ P(h|v) $.

The second term is

- $ \nabla_{ifr} \log \sum_{v', h'} \exp \left( -E(v', h') \right) = $
- $ = {\nabla_{ifr} \sum_{v', h'} \exp \left( -E(v', h') \right) \over \sum_{v', h'} \exp \left( -E(v', h') \right)} = $
- $ = - {\sum_{v', h'} \exp \left( -E(v', h') \right) \nabla_{ifr} E(v', h') \over \sum_{v', h'} \exp \left( -E(v', h') \right)} = $
- $ = {\sum_{v', h'} \exp \left( -E(v', h') \right) v^\prime_{ir} h^\prime_f \over \sum_{v', h'} \exp \left( -E(v', h') \right)} = $
- $ = \left\langle v_{ir} h_f \right\rangle_{model} $

The notation $ \left\langle ... \right\rangle_{model} $ means that the expression in braces is averaged over the joint distribution $ P(v, h) $.

- $ \Delta W_{ifr} = \epsilon \left( \left\langle v_{ir} h_f \right\rangle_{data} - \left\langle v_{ir} h_f \right\rangle_{model} \right) $

The first term can be calculated pretty easily. Given v, we can calculate the distribution for each element of h. But the second term needs exponential time to be calculated exactly.

However, the function can be approximated by another function

- $ \Delta W_{ifr} = \epsilon \left( \left\langle v_{ir} h_f \right\rangle_{data} - \left\langle v_{ir} h_f \right\rangle_T \right) $

This function is called "Contrastive Divergence" (CD) and proposed by Hinton in 2002. The notation $ \left\langle ... \right\rangle_{T} $ means averaging over the value running Gibbs sampler for T full steps. Below is the excerpt from the wikipedia.org article, modified:

- Take a training sample $ v $, compute the probabilities of the hidden units with (1.3), and sample a hidden activation vector $ h $ from this probability distribution.
- Compute the outer product of $ v $ and $ h $ and call this the
*positive gradient*. - Set h'=h
- Repeat the following step T times:
- From $ h' $, sample a reconstruction $ v' $ of the visible units using (1.2), then resample the hidden activations $ h' $ from this using (1.3).

- Compute the outer product of $ v' $ and $ h' $ and call this the
*negative gradient*. - Let the weight update to $ W_{ifr} $, $ b_{ir} $ and $ b_f $ be the positive gradient minus the negative gradient, times the learning rate $ \epsilon $

In practice, T=3...10 is high enough. It is possible to start learning with small T and then increase it.

## Predictions Edit

After we have all the $ W $ and $ b $, we are able to make predictions. Given a user $ u $, a set of items with known ratings $ I(u) $, and a set of known ratings $ v_{ir} $, $ i \in I(u) $, we are to predict the ratings $ v_{jr} $ for some $ j \notin I(u) $:

- $ p(v_{jr}=1) = {e^{b_{jr}} \prod_f \left(p(h_f=1|v)e^{-W_{jfr}}+p(h_f=0|v)\right) \over \sum_{r'} e^{b_{jr'}} \prod_f \left( p(h_f=1|v)e^{-W_{jfr'}}+p(h_f=0|v) \right) } $,

where $ p(h_f=1|v) $ is given by (1.3) which we reproduce here:

- $ P(h_f=1|v) = \sigma \left( b_f + \sum_{i \in I(u)} \sum_r v_{ir} W_{ifr} \right) $
- $ \sigma(t)=1/\left(1+e^{-t}\right) $

## Modifications Edit

### Gaussian Hidden Units Edit

For Gaussian Hidden Units, the formula (1.1b)

- $ E(v,h)= - \sum_{ir}b_{ir}v_{ir} - \sum_{f} b_f h_f - \sum_{ifr}W_{ifr} v_{ir} h_f + \sum_i \log Z_i $

is replaced by

- $ E(v,h)= - \sum_{ir}b_{ir}v_{ir} - \sum_{f} {(h_f-b_f)^2 \over 2\sigma_f^2} - \sum_{ifr}W_{ifr} v_{ir} h_f + \sum_i \log Z_i $ $ (1.1b') $

This modifies (1.3) from

- $ P(h_f=1|v) = \sigma \left( b_f + \sum_i \sum_r v_{ir} W_{ifr} \right) $

to

- $ P(h_f=h|v) = {1 \over \sqrt{2\pi} \sigma_f} \exp \left( - { \left( h - b_f - \sigma_f \sum_{ir} v_{ir} W_{ifr} \right)^2 \over 2 \sigma_f^2} \right) $

In order to avoid updating the variances, they can be fixed as $ \sigma_f=1 $.

### Conditional Factored RBM Edit

This approach factorizes the matrix of weights:

- $ W_{ifr} = \sum_{k=1}^C A_{ikr} B_{kfr} $

## Parameters Edit

For Netflix prize, the following parameters were selected:

- F=100 for RBM
- F=500, C=30 for conditional factored RBM
- The weight used learning rate of 0.01/batch size, momemtum of 0.9, weight decay of 0.01
- T started as 1, then increased (to 9 for RBM, to 3 for conditional factored RBM).
- After about 40-50 iterations (epochs), the model converged.