Maximum Entropy Classifier


Maximum Entropy modeling has been successfully applied to many fields such as Computer Vision, Spatial Physics and Natural Language Processing. Maximum Entropy models can be used for learning from many heterogeneous information sources. Maximum entropy classifiers are considered alternatives to Naive Bayes classifiers, because they do not require the naive assumption of the independent features. However, they are significantly slower than Naive Bayes classifiers, and thus may not be appropriate when there is a very large number of classes.

A typical binary maximum entropy classifier can be implemented using logistic regression. Logistic regression can be explained with the logistic function as follows:

$$ L_f(z) = \frac{1}{1 + e^{-z} } $$

where the variable \(z\) is defined as:

$$z=\beta_0 + \beta_1y_1 + \beta_2y_2 + \beta_3y_3 + \cdots + \beta_ky_k$$
where \\(\beta_0\\) is called the "intercept", and \\(\beta_1, \beta_2,\dots,\beta_k\\) are called the "regression coefficients" of features \\(y_1, y_2,\dots y_k\\) respectively. Logistic regression describes the relationship between one or more independent features and a binary label, expressed as a probability.

Generally, maximum entropy classifiers are implemented using the principle of maximum entropy. Suppose we have some testable label \(L\) about a quantity \(x\) taking values in \({x_1, x_2,..., x_n}\) where \(x_i\) is a feature vector of size \(m\). This label can be expressed by \(m\) constraints on the expectations of the feature values; that is, the following probability distribution must be satisfied, where \(f_k(x_i)\) returns the \(k\)th element of the feature vector \(x_i\):

$$\sum_{i=1}^n p(x_i|L)f_k(x_i) = C_k \qquad k = 1, \ldots,m.$$

In addition, the probabilities must sum to one.

$$\sum_{i=1}^n p(x_i|L) = 1.$$

The probability distribution with maximum information entropy subject to above constraints is:

$$p(x_i|L) = \frac{e^{\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)}}{\sum_{i=1}^n e^{\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)}}$$

The \(\lambda_k\) parameters are Lagrange multipliers whose particular values are determined by the constraints according to

$$C_k = \frac{\partial}{\partial \lambda_k} \log \sum_{i=1}^n e^{\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)} \qquad k = 1, \ldots,m.$$

The above equation presents \(m\) simultaneous equations, which are usually solved by numerical methods.