**Linear ordering isotonic regression** can be understood as approximating given series of 1-dimensional observations with non-decreasing function. It is similar to inexact smoothing splines, with the difference that we use monotonicity, rather than smoothness, to remove noise from the data.

**General isotonic regression** is approximating given series of values with values satisfying a given partial ordering.

## Problem

Given values $ a_1 $, ..., $ a_n $ and their positive weights $ w_1 $, ..., $ w_n $, approximate them by $ y_1 $, ... $ y_n $ as close as possible, subject to a set of constraints of kind $ y_i \ge y_j $ :

- Given
- values $ \mathbf{a} \in \Bbb{R}^n $,
- weights $ \mathbf{w} \in \Bbb{R}^n $ such that $ w_i>0 $ for all i,
- set of constraints $ E \subset \{1,...,n\}^2 $,

- minimize $ \sum_i w_i \left( y_i-a_i \right)^2 $ with respect to $ \mathbf{y} \in \Bbb{R}^n $ subject to $ y_i \ge y_j $ for all $ (i,j)\in E $

If all weights equal to 1, the problem is called *unweighted isotonic regression*, otherwise it is called *weighted isotonic regression*.

The graph $ G=(V=\{1,2,...n\},E=E) $ is supposed to be acyclic. For example, it cannot be that, among others, there are constraints $ x_2 \ge x_5 $ and $ x_5 \ge x_2 $.

### Example

Minimize $ 3(y_1-2)^2+(y_2-1)^2+(y_3-4)^2+2y_4^2 $ subject to

- $ y_1\le y_2 $
- $ y_2\le y_3 $
- $ y_2\le y_4 $

### Linear ordering

Of a particular interest is *linear ordering isotonic regression*:

- Given
- values $ \mathbf{a} \in \Bbb{R}^n $,
- weights $ \mathbf{w} \in \Bbb{R}^n $ such that $ w_i>0 $ for all i,

- minimize $ \sum_i w_i \left( y_i-a_i \right)^2 $ with respect to $ \mathbf{y} \in \Bbb{R}^n $ subject to $ y_1 \le y_2\le\;...\;\le y_n $

For linear ordering isotonic regression, there is a simple linear algorithm, called Pool Adjacent Violators Algorithm (PAVA).

If all weights are equal 1, the problem is called *unweighted linear ordering isotonic regression*.

#### Example

- Isotonic regression for $ 10^6 $ points.
- Left: points $ (x_i, a_i) $, where $ a_i=0 \text{ or }1 $. The probability that $ a_i=1 $ is determined by logistic function. Only 1000 points shown.
- Right: outcome of isotonic regression (black graph) versus logistic function (red graph). The logistic function had been restored with high precision.

### Linear ordering isotonic regression as a model

Sometimes, linear ordering isotonic regression is applied to a set of observations $ (x_i, y_i) $, $ 1 \le i \le n $, where x is the explanatory variable, y is the dependent variable. The observations are sorted by their x's, then the isotonic regression is applied to y's with additional restriction $ x_i=x_{i+1}\Rightarrow y_i=y_{i+1} $ for all $ i $.

### Other variants

#### Non-euclidean metrics

Sometimes other metrics are used instead of the Euclidean metric, for instance $ L_1 $ metrics:

- $ \sum_i w_i |y_i-a_i|\! $

or unweighted $ L_\infty $ metrics:

- $ \max_i |y_i-a_i| $

#### Points at a grid

Sometimes, values are placed at 2d or higher-dimensional grid. The fit value must increase at each dimension, i. e.

- Minimize $ \sum_{ij} w_{ij}\left(y_{ij}-x_{ij}\right)^2 $ with respect to $ y $ subject to
- $ y_{ij} \le y_{kl} \text{ if } i \le j,\ k \le l $

## Algorithms

### Pool Adjacent Violators Algorithm

Pool Adjacent Violators Algorithm (PAVA) is a linear time (and linear memory) algorithm for linear ordering isotonic regression.

#### Preliminary considerations

The algorithm is based on the following theorem:

**Theorem**: For an optimal solution, if $ a_i \ge a_{i+1} $, then $ y_i = y_{i+1} $

**Proof**: suppose the opposite, i. e. $ y_i < y_{i+1} $. Then for sufficiently small $ \varepsilon $, we can set

- $ y_i^\mathrm{new} = y_i + w_{i+1} \varepsilon $
- $ y_{i+1}^\mathrm{new} = y_{i+1} - w_i \varepsilon $

which decreases the sum $ \sum_i w_i(y_i-a_i)^2\! $ without violating the constraints. Therefore, our original solution was not optimal. Contradiction.

Since $ y_i = y_{i+1} $, we can combine both points $ (w_i, a_i) $ and $ (w_{i+1}, a_{i+1}) $ to a new point $ \left(w_i + w_{i+1}, {w_i a_i + w_{i+1} a_{i+1} \over w_i + w_{i+1}}\right) $.

However, after we combine two points $ (w_i, a_i) $ and $ (w_{i+1},a_{i+1}) $ to the new point $ \left(w_i^\prime, a_i^\prime\right) $, this new point can violate the constraint $ a_{i-1}\le a_i^\prime $. In this case it should be combined with the (i-1)-th point. If the combined point violates its constraint, it should be combined with the previous point, etc.

#### Algorithm

**Input**:

- values $ a_1,...,a_n\! $
- weights $ w_1,...,w_n\! $

**Output**:

- non-decreasing values $ y_1,...,y_n\! $ minimizing $ \sum_i w_i (y_i-a_i)^2 \! $

**The Algorithm**

- $ a'_1 \leftarrow a_1 $
- $ w'_1 \leftarrow w_1 $
- $ j \leftarrow 1 $
- $ S_0 \leftarrow 0 $
- $ S_1 \leftarrow 1 $
- for $ i=2,3,...,n\! $ do
- $ j \leftarrow j+1 $
- $ a'_j \leftarrow a_i $
- $ w'_j \leftarrow w_i $
- while $ j>1\! $ and $ a'_j < a'_{j-1} \! $ do
- $ a'_{j-1} \leftarrow {w'_j a'_j + w'_{j-1} a'_{j-1} \over w'_j + w'_{j-1}} $
- $ w'_{j-1} \leftarrow w'_j + w'_{j-1} $
- $ j \leftarrow j-1 $

- $ S_j \leftarrow i $

- for $ k=1,2,...,j\! $ do
- for $ l=S_{k-1}+1,...,S_k\! $ do
- $ y_l=a'_k\! $

- for $ l=S_{k-1}+1,...,S_k\! $ do

Here S defines to which old points each new point corresponds.

### Arbitrary case algorithms

In the arbitrary case, this can be solved as a quadratic problem. The best algorithm takes $ \Theta(n^4) $ time, see:

- Maxwell, WL and Muckstadt, JA (1985), "Establishing consistent and realistic reorder intervals in production-distribution systems",
*Operations Research*33, pp. 1316-1341. - Spouge J, Wan H, and Wilbur WJ (2003), "Least squares isotonic regression in two dimensions", J. Optimization Theory and Apps. 117, pp. 585-605.

## Implementations

### R

#### isoreg

The function **isoreg** performs unweighted linear ordering isotonic regression. It does not require any packages. For many simple cases, it is enough.

Example of usage:

x=sort(rnorm(10000)) y=x+rnorm(10000) y.iso=isoreg(y)$yf plot(x,y,cex=0.2); lines(x,y.iso,lwd=2,col=2)

The **isoreg** function also implemenets linear ordering isotonic regression as a model:

x=rnorm(10000) y=x+rnorm(10000) y.iso=isoreg(x,y)$yf plot(x,y,cex=0.2); lines(sort(x),y.iso,lwd=2,col=2)

#### Iso

The package **Iso** contains three functions:

- pava - linear ordering isotonic regression, weighted or unweighted.
- biviso - 2-d isotonic regression
- ufit - unimodal order (increasing then decreasing)

Example of usage:

install.packages("Iso") # should be done only once library("Iso") # should be done once per session x=sort(rnorm(10000)) y=x+rnorm(10000) y.iso=pava(y) plot(x,y,cex=0.2); lines(x,y.iso,lwd=2,col=2)

#### isotone

**isotone** is the most advanced package.

- gpava - linear ordering isotonic regression, weighted or unweighted, for any metrics. Similarly to isoreg, gpava can implement linear ordering isotonic regression as a model.
- activeSet - general isotonic regression for any metrics.

Example of usage:

install.packages("isotone") # should be done only once library("isotone") # should be done once per session x=sort(rnorm(10000)) y=x+rnorm(10000) y.iso=gpava(y)$x plot(x,y,cex=0.2); lines(x,y.iso,lwd=2,col=2)

#### Comparison of speed

Since all three libraries somehow implement the PAVA algorithm, we can compare their speed.

As we can see on the graph below, for unweighted linear order isotonic regression (LOIR) with $ n>200 $, **isoreg** should be used. For weighted LOIR and unweighted LOIR with $ n \le 200 $, **pava** should be used. **gpava** should be used only for non-Euclidean metrics.

Also, the implementations of weighted simple ordering isotonic regression on R are far from perfect.

### Java

Weka, a free software collection of machine learning algorithms for data mining tasks written in the University of Waikato, contains, among others, an isotonic regression classifier. The classifier is deeply engrained into the Weka's hierarchy of classes and cannot be used without Weka.

### Python

While scikit-learn implements the isotonic regression, Andrew Tulloch made the Cython implementation of the algorithm for linear ordering isotnic regression, which is 14-5000 times faster, depending on the data size. See Speeding up isotonic regression in scikit-learn by 5,000x. If you just want the code, click here.

## Usage

### Calibration of output for categorical probabilities

*For more information, see "Predicting Good Probabilities With Supervised Learning", by Alexandru Niculescu-Mizil and Rich Caruana.*

Most of the supervised learning algorithms, for example Random Forests [1], boosted trees, SVM etc. are good in predicting the most probable category, *but not the probability*. Some of them tend to overestimate high probabilities and underestimate low probabilities. (One notable exception is neural networks, which themselves produce a well calibrated prediction.)

In order to obtain the correct probability, unweighted linear order isotonic regression can be used:

- $ \hat y^\mathrm{final} = f(\hat y) $

where

- $ \hat y^\mathrm{final} $ is a final prediction for probability
- $ \hat y $ is a prediction given by the model
- $ f $ is a non-decreasing function

In order to find f, model's predictions on the validation set are matched with output variable, and the isotonic regression is applied to the pairs.

Alternative approach is to pass $ \hat p $ via a sigmoid function:

- $ \hat y^\mathrm{final} = {1\over 1 + e^{a-b\hat y}} $

This is called Platt calibration.

At small dataset sizes, Platt calibration is better than isotonic regression. Starting at 200-5000 observations, the isotonic regression surpasses Platt calibration slightly. Note also that this kind of isotonic regression is easier and faster than the logistic regression needed by Platt calibration.

### Calibration of recommendation models

#### The problem

We have a model M which predicts residuals r of another model, trained on the training set. For cases with small number of users or items, the model is overfitted, and predictions need relaxation:

- $ \hat r_{ui}^\mathrm{final} = f(\hat S_{ui})\hat r_{ui} $

where

- $ \hat r_{ui}^\mathrm{final} $ is a final prediction for user u, item i
- $ \hat r_{ui} $ is a prediction given by the model
- $ f $ is a non-decreasing function
- $ S_{ui} $ may be either $ S_u $ (number of observations with user u in the training set), or $ S_i $ (number of observations with item i in the training set), or $ \min(S_u, S_i) $, depending on the nature of the model. The letter S stands for Support.

#### Solution

Given set of triplets $ (\hat r_k, r_k, S_k) $ from validation database, we have:

- $ \min_f \sum_k \left(r_k - \hat r_k f(S_k)\right)^2 = \min_f \sum_k \hat r_k^2 \left( f(S_k) - \frac{r_k}{\hat r_k} \right)^2 + \operatorname{const} $

Therefore, for k-th observation we should set weight $ w_k=\hat y_k^2 $ and value $ y_k = r_k / \hat r_k $. The triplets $ (S_k, w_k, y_k) $ then sorted with respect to S, and the isotonic regression applied to them.

If a function directly predicts the rating or acceptance rate, rather than predicting the residual of another model, we can define $ r_{ui} $ as the difference between this model and some simpler model (i. e. average rating for the item plus average rating correction for the user). This model is called *basic model*.

### Analysis and calibration of input variables

Often, we know an output variable depends on an input variable monotonically, but other than this we have very little prior knowledge about the dependence. In this case, isotonic regression can provide important hints about the dependence.

Below is a real-world example of bankrupcy flag vs revolving utilization of unsecured lines, from the Give Me Some Credit contest.

The dependence can be used in several ways:

- for calibration of the input variable, maybe after smoothing. In the case given by the example, uncalibrated indicator produced the correlation of -0.002, due to outliers. The standard calibration, log(x+1), produced the correlation of 0.18. The isotonic regression calibration produced the correlation of 0.30.
- by the person who makes the final decision, in the context of human-machine cooperation.
- for filtering outliers. In this case, we see that the outlier filter should be set very low, probably around 1.1 or lower. The horizontal line starts at x=1.029.

### Nonmetric multidimensional scaling

A multidimensional scaling (MDS) algorithm, given the matrix of item–item similarities, puts items into N-dimensional space such that similar items would be closer to each other than dissimilar items. This is particulary useful for visualization purposes.[2]

There are several kinds of MDS, depending on the relationship between similarity and distance, as well as the definition of distance. For nonmetric MDS, the distance may be an arbitrary monotonic function of similarity. The function is typically found using isotonic regression. [3]. Nonmetric multidimensional scaling is implemented as isoMDS in the MASS library. [4]

## Advantages, disadvantages, and post-processing

Advantages:

- Non-parametric
- Fast (linear time)
- Simple

Disadvantages:

- one or several points at the ends of the interval are sometimes noisy
- it compares favorably to other approaches only if there is big enough statistics ($ n\gtrapprox 10^3 $ and ideally $ n>10^4\! $)

These disadvantages can be improved by smoothing the outcome of isotonic regression. This way, we can get the best of both worlds (smoothness and monotonicity).

## External links

- Isotonic Regression Algorithms, by Quentin F. Stout - review of the best current existing algorithms

### Articles

- Predicting Good Probabilities With Supervised Learning, by Alexandru Niculescu-Mizil, Rich Coruana
- Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, by John C. Platt