Machine Learning class note 3 - Logistic Regression [ trile.github.io ]

II. Logistic regression

0. Presentation

Logistic Regression

Idea: classify $y=0$ (negative class) or $y=1$ (positive class)

From linear regression $h_\theta(x) = \theta^TX$ We need to choose hypothesis function such as $0 \leq h(x) \leq 1$

1. Hypothesis function

$h$

Notes: Octave implementation of sigmoid function

g = 1 ./ ( 1 + e .^ (-z));

Vectorized form:
Since $\sum_{i=0}^{n}\theta_ix$ . The vectorized form of $h\theta(x)$ is

$\frac{1}{1 + e^{-\theta^TX}} \quad \textrm{or} \quad sigmoid(\theta^TX)$

Octave implementation

h = sigmoid(theta' *  X)

Logistic Regression

$h(x)$ is the estimate probability that $y=1$ on input $x$

When $sigmoid(\theta^TX) \geq 0.5$ then we decide $y=1$ . As we know $sigmoid(\theta^TX) \geq 0.5$ when $\theta^TX \geq 0$

So for $y=1$ , $\theta^TX \geq 0$ . We call $\theta^TX$ is the line that define the Decision boundary that separate the area where $y=0$ and $y=1$ . It does not need to be linear since X can contain polynomial term.

Decision boundary is the property of the hypothesis and parameter $\theta$ , not of the training set

2. The cost function

We need to choose the cost function so that it is “convex” toward one single global minimum

Cost function

$\begin{aligned} &J$

A simplified form of the cost function is:
$J$

Vectorized form:
$J_{(\theta)} = \frac{1}{m}(-y^Tlog(sigmoid(X\theta)) - (1-y^T)log(1-sigmoid(X\theta)))$

Code in Octave to compute cost function

J = (1/m) * ( -y' * log(sigmoid(X*theta) ) - (1-y') * log(1-sigmoid(X*theta)) );

We need to get the parameter $\theta$ where $J$ is min. Then we can make a prediction when given new $x$ using
$h\theta(x) = \frac{1}{1 + e^{-\theta^TX}}$

3. Gradient Descent algorithm

Using Gradient Descent to find value of $\theta$ when $J_{(\theta)}$ is min, we have

Repeat until convergence
$\theta_j := \theta_j - \alpha \frac {\partial }{\partial \theta_j}J(\theta)$

Repeat until convergence
$\theta_j := \theta$

Vectorized form:
$\theta := \theta - \frac{\alpha}{m} \left( X^T (sigmoid(X\theta) - \vec{y}) \right)$

Code in Octave to compute gradient step $\frac {\partial }{\partial \theta_j}J(\theta)$

grad = (1 / m) * (X' * (sigmoid( X * theta) - y) );

4. Adding Regularization parameter

Regularized cost function:
$J$

The second sum, $\sum_{j=1}^{n}\theta_j^2$ means to explicitly exclude the bias term, $\theta_0$ . I.e. the $\theta$ vector is indexed from 0 to n (holding n+1 values, $\theta_0$ through $\theta_n$ ), and this sum explicitly skips $\theta_0$ , by running from 1 to n, skipping 0.

Octave code to compute cost function with regularization

J = (1/m) * (-y' * log(sigmoid(X*theta)) - (1-y')*log(1-sigmoid(X*theta))) + lambda/(2*m)*sum(theta(2:end).^2);

Thus, when computing the equation, we should continuously update the two following equations:
Regularized Gradient Descent:
Repeat
{
$\begin{aligned}\theta_0 &:= \theta$
}

Octave code to compute gradient step $\frac {\partial }{\partial \theta_j}J(\theta)$

grad = (1 / m) * (X' * (sigmoid( X * theta) - y)) + (lambda/m)*[0; theta(2:end)];

Notice that we don’t add the regularization term for $\theta_0$

5. Advanced Optimization

Prepare a function that can compute $J_{(\theta)}$ and $\frac {\partial }{\partial \theta_j}J(\theta)$ for a given $\theta$

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

Then with this function Octave can provide us some advanced algorithms to compute min of $J_{(\theta)}$ . We should not implement these below algorithms by ourselves.

Conjugate gradient
BFGS
L-BFGS

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We give to the function fminunc() our cost function, our initial vector of theta values, and the options object that we created beforehand.

Advantages:
No need to pick up $\alpha$ .
Often faster than gradient descent.

Disadvantages:
More complex.
Practical advice: try a couple of different libraries.