Here’s a step-by-step explanation of logistic regression:
Step 1: Understand the Data
You need a dataset with:
- Independent variables (X): Features or predictors.
- Dependent variable (Y): A binary outcome (0 or 1).
For example, you might want to predict whether an email is spam (1) or not spam (0) based on various features like the frequency of certain words.
Step 2: Define the Hypothesis Function
Logistic regression models the probability of the dependent variable being 1 given the independent variables. The probability is modeled using the logistic function (or sigmoid function):
Where:
- P(Y=1∣X) is the probability of the outcome being 1.
- b0 is the intercept.
- b1,b2,...,bn are the coefficients (weights) of the independent variables.
- e is the base of the natural logarithm.
Step 3: Compute the Cost Function
To find the best-fitting model, you need to minimize the cost function (also called the loss function). For logistic regression, the cost function is the logistic loss or binary cross-entropy loss:
Where:
- m is the number of samples.
- yi is the actual class label (0 or 1) for the i-th sample.
- y^i is the predicted probability for the i-th sample.
Step 4: Optimize the Cost Function
The goal is to find the parameters b0,b1,...,bn that minimize the cost function. This is typically done using gradient descent, an iterative optimization algorithm that adjusts the parameters in the direction that reduces the cost function.
The update rule for gradient descent is:
Where α is the learning rate, and ∂bj∂J is the partial derivative of the cost function with respect to parameter bj.
Step 5: Make Predictions
Once the parameters are learned, use the logistic function to predict probabilities. To classify a sample, compare the predicted probability to a threshold (typically 0.5). If the probability is greater than or equal to 0.5, classify the sample as 1; otherwise, classify it as 0.
Step 6: Evaluate the Model
Evaluate the performance of your logistic regression model using metrics such as:
- Accuracy: The proportion of correctly classified samples.
- Precision: The proportion of positive identifications that were actually correct.
- Recall (Sensitivity): The proportion of actual positives that were correctly identified.
- F1 Score: The harmonic mean of precision and recall.
- ROC Curve and AUC: The Receiver Operating Characteristic curve plots the true positive rate versus the false positive rate, and the Area Under the Curve (AUC) measures the model’s ability to distinguish between classes.
Step 7: Assumptions of Logistic Regression
Logistic regression makes some assumptions about the data:
- Linearity of the Logit: The log-odds of the dependent variable is a linear combination of the independent variables.
- Independence of Observations: The observations should be independent of each other.
- Absence of Multicollinearity: Independent variables should not be highly correlated with each other.
Step 8: Improve the Model (Optional)
- Feature Engineering: Create new features or transform existing ones to better capture the relationship between X and Y.
- Regularization: Apply techniques like L1 (Lasso) or L2 (Ridge) regularization to prevent overfitting and improve model generalization.