Machine Learning class note 3 - Logistic Regression

II. Logistic regression

0. Presentation

Logistic Regression

Idea: classify y=0y=0 (negative class) or y=1y=1 (positive class)

From linear regression hθ(x)=θTX h_\theta(x) = \theta^TX We need to choose hypothesis function such as 0h(x)10 \leq h(x) \leq 1

1. Hypothesis function

hθ(x)=g(i=0nθixi)forg(z)=11+ez
h\theta(x) = g(\sum{i=0}^{n}\theta_ix_i) \quad \textrm{for} \quad g(z) = \frac{1}{1 + e^{-z}}

Notes: Octave implementation of sigmoid function

g = 1 ./ ( 1 + e .^ (-z));

Vectorized form:
Since i=0nθixi=θTX\sum_{i=0}^{n}\theta_ixi = \theta^TX. The vectorized form of hθ(x)h\theta(x) is

11+eθTXorsigmoid(θTX)
\frac{1}{1 + e^{-\theta^TX}} \quad \textrm{or} \quad sigmoid(\theta^TX)

Octave implementation

h = sigmoid(theta' *  X)

Logistic Regression

h(x)h(x) is the estimate probability that y=1y=1 on input xx

When sigmoid(θTX)0.5sigmoid(\theta^TX) \geq 0.5 then we decide y=1y=1. As we know sigmoid(θTX)0.5sigmoid(\theta^TX) \geq 0.5 when θTX0\theta^TX \geq 0

So for y=1y=1 , θTX0\theta^TX \geq 0. We call θTX\theta^TX is the line that define the Decision boundary that separate the area where y=0y=0 and y=1y=1. It does not need to be linear since X can contain polynomial term.

Decision boundary is the property of the hypothesis and parameter θ\theta, not of the training set

2. The cost function

We need to choose the cost function so that it is “convex” toward one single global minimum

Cost function

J(θ)=1mi=1mCost(hθ(x(i),y(i)))withCost(hθ(x(i),y(i)))={log(hθ(x))ify=1log(1hθ(x))ify=0
\begin{aligned}
&J{(\theta)} = \frac{1}{m} \sum{i=1}^{m}Cost( h\theta(x^{(i)}, y ^ {(i)})) \
&\textrm{with} \quad \
&Cost( h
\theta(x^{(i)}, y ^ {(i)})) =
\left{
\begin{array}{c}
-log(h\theta(x))&\text{if} &y = 1 \
-log(1 - h
\theta(x))&\text{if} &y = 0
\end{array}
\right.
\end{aligned}

A simplified form of the cost function is:
J(θ)=1m[i=1my(i)log(hθ(x(i))+(1y(i))log(1hθ(x(i)))]
J{(\theta)} = -\frac{1}{m}\left[ \sum{i=1}^{m} y^{(i)} log(h\theta(x^{(i)}) + (1-y^{(i)})log(1 - h\theta(x^{(i)}))\right]

Vectorized form:
J(θ)=1m(yTlog(sigmoid(Xθ))(1yT)log(1sigmoid(Xθ)))
J_{(\theta)} = \frac{1}{m}(-y^Tlog(sigmoid(X\theta)) - (1-y^T)log(1-sigmoid(X\theta)))

Code in Octave to compute cost function

J = (1/m) * ( -y' * log(sigmoid(X*theta) ) - (1-y') * log(1-sigmoid(X*theta)) );

We need to get the parameter θ\theta where J(θ)J{(\theta)} is min. Then we can make a prediction when given new xx using
hθ(x)=11+eθTX
h\theta(x) = \frac{1}{1 + e^{-\theta^TX}}

3. Gradient Descent algorithm

Using Gradient Descent to find value of θ\theta when J(θ)J_{(\theta)} is min, we have

Repeat until convergence
θj:=θjαθjJ(θ)
\theta_j := \theta_j - \alpha \frac {\partial }{\partial \theta_j}J(\theta)

Repeat until convergence
θj:=θjα1mi=1m(hθ(x(i))y(i))x(i)
\theta_j := \thetaj - \alpha \frac{1}{m} \sum{i=1}^{m} \left( h_\theta ( x^{(i)} ) - y^{(i)} \right) x^{\left(i\right)}

Vectorized form:
θ:=θαm(XT(sigmoid(Xθ)y))
\theta := \theta - \frac{\alpha}{m} \left( X^T (sigmoid(X\theta) - \vec{y}) \right)

Code in Octave to compute gradient step θjJ(θ)\frac {\partial }{\partial \theta_j}J(\theta)

grad = (1 / m) * (X' * (sigmoid( X * theta) - y) );

4. Adding Regularization parameter

Regularized cost function:
J(θ)=1m[i=1my(i)log(hθ(x(i))+(1y(i))log(1hθ(x(i)))]+λ2mj=1nθj2
J{(\theta)} = -\frac{1}{m}\left[ \sum{i=1}^{m} y^{(i)} log(h\theta(x^{(i)}) + (1-y^{(i)})log(1 - h\theta(x^{(i)}))\right] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2

The second sum, j=1nθj2\sum_{j=1}^{n}\theta_j^2 means to explicitly exclude the bias term, θ0\theta_0. I.e. the θ\theta vector is indexed from 0 to n (holding n+1 values, θ0\theta_0 through θn\theta_n), and this sum explicitly skips θ0\theta_0, by running from 1 to n, skipping 0.

Octave code to compute cost function with regularization

J = (1/m) * (-y' * log(sigmoid(X*theta)) - (1-y')*log(1-sigmoid(X*theta))) + lambda/(2*m)*sum(theta(2:end).^2);

Thus, when computing the equation, we should continuously update the two following equations:
Regularized Gradient Descent:
Repeat
{
θ0:=θ0α1mi=1m(hθ(x(i))y(i))x0(i)θj:=θjα[(1mi=1m(hθ(x(i))y(i))x(i))+λmθj]forj1
\begin{aligned}
\theta_0 &:= \theta0 - \alpha \frac{1}{m} \sum{i=1}^{m} \left( h_\theta ( x^{(i)} ) - y^{(i)} \right) x_0^{\left(i\right)} \
\theta_j &:= \thetaj - \alpha \left[ \left(\frac{1}{m} \sum{i=1}^{m} \left( h_\theta ( x^{(i)} ) - y^{(i)} \right) x^{\left(i\right)} \right) + \frac{\lambda}{m}\theta_j \right] \quad \textrm{for} \quad j \geq 1
\end{aligned}

}

Octave code to compute gradient step θjJ(θ)\frac {\partial }{\partial \theta_j}J(\theta)

grad = (1 / m) * (X' * (sigmoid( X * theta) - y)) + (lambda/m)*[0; theta(2:end)];

Notice that we don’t add the regularization term for θ0\theta_0

5. Advanced Optimization

Prepare a function that can compute J(θ)J_{(\theta)} and θjJ(θ)\frac {\partial }{\partial \theta_j}J(\theta) for a given θ\theta

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

Then with this function Octave can provide us some advanced algorithms to compute min of J(θ)J_{(\theta)}. We should not implement these below algorithms by ourselves.

  • Conjugate gradient
  • BFGS
  • L-BFGS
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

We give to the function fminunc() our cost function, our initial vector of theta values, and the options object that we created beforehand.

Advantages:
No need to pick up α\alpha.
Often faster than gradient descent.

Disadvantages:
More complex.
Practical advice: try a couple of different libraries.