 # Introduction to Supervised Learning

Given a set of input data points $$\{x_1,...,x_m\}$$ associated to a set of output data $$\{y_1,...,y_m\}$$ the goal is to train a model that is able to predict $$y$$ from $$x$$.

The problem of supervised learning can be formulated as follows: Given an unknown function $$g: \mathcal{X}\rightarrow \mathcal{Y}$$ which is known as the ground truth and maps input instances $$\boldsymbol{x} \in \mathcal{X}$$ to output labels $$y \in \mathcal{Y}$$, along with training data $$\mathbf{D} = {(\boldsymbol{x}_1,y_1),\dots,(\boldsymbol{x}_n, y_n)}$$, the goal is to find a function $$h:\mathcal{X}\rightarrow\mathcal{Y}$$ that approximates as closely as possible the correct mapping $$g$$. In decision theory, the condition "approximates as closely as possible" is defined by specifying a loss function that assigns a specific value to "loss" resulting from producing an incorrect label. The goal then is the expected loss minimization. The distribution of $$\mathcal{X}$$ and the ground truth function $$g:\mathcal{X}\rightarrow\mathcal{Y}$$ can only be computed empirically by providing a large number of samples of $$\mathcal{X}$$ and their corresponding labels $$\mathcal{Y}$$. The type of label being predicted determines the particular loss function needed. For example, in the case of binary classification, a simple zero-one loss function would be sufficient. The zero-one loss function corresponds to assigning a loss of 1 to any incorrect labeling. In other words it is equivalent to computing the accuracy of the classification procedure over a set of test data where the goal is to maximize this accuracy.

From a probabilistic point of view, the supervised learning problem is to estimate a function $$f$$ which is the probability of each possible output label $$y$$ given a particular input instance $$x$$: $$p({\rm y}|\boldsymbol{x},\boldsymbol\theta) = f\left(\boldsymbol{x};\boldsymbol{\theta}\right)$$

where the feature vector input is $$x$$, and the function $$f$$ is typically parametrized by some parameters $$\theta$$. There are two approaches to the problem:

1. Discriminative approach, where the function $$f$$ is estimated directly;
2. Generative approach, where the inverse probability $$p(x|{\rm y})$$ is estimated and combined with the prior probability $$p({\rm y}|\boldsymbol\theta)$$ using Bayes' rule, as follows: $$p({\rm y}|\boldsymbol{x},\boldsymbol\theta) = \frac{p({\boldsymbol{x}|\rm y}) p({\rm y|\boldsymbol\theta})}{\sum_{y^{'} \in \mathcal{Y}} p(\boldsymbol{x}|y^{'}) p(y^{'} |\boldsymbol\theta)}$$

In the case that the labels are continuously distributed[^1], the denominator will be an integration:

$$p({\rm y}|\boldsymbol{x},\boldsymbol\theta) = \frac{p({\boldsymbol{x}|\rm y}) p({\rm y|\boldsymbol\theta})}{\int_{y^{'} \in \mathcal{Y}} p(\boldsymbol{x}|y^{'}) p(y^{'}|\boldsymbol\theta) \operatorname{d}y^{'}}$$

The following supervised learning methods belong to the set of most commonly applied machine learning methods: