With the number of Machine Learning algorithms constantly growing it is nice to have a reference point to brush up on some of the fundamental models, be it for an interview or just a quick refresher. I wanted to provide a resource of some of the most common models pros and cons and sample code implementations of each of these algorithms in Python.
1. Multiple Linear Regression
Pros
- Easy to implement, theory is not complex, low computational power compared to other algorithms.
- Easy to interpret coefficients for analysis.
- Perfect for linearly separable datasets.
- Susceptible to overfitting, but can avoid using dimensionality reduction techniques, cross-validation, and regularization methods.
Cons
- Unlikely in the real world to have perfectly linearly separable datasets, model often suffers from under-fitting in real-word scenarios or is outperformed by other ML and Deep Learning algorithms.
- Parametric, has a lot of assumptions that needs to be met for its data in regards to its distribution. Assumes a linear relationship between the dependent and independent variables.
- Examples of assumptions: There is a linear relationship between the dependent variable and the independent variables. The independent variables aren’t too highly correlated with each other. Your observations for the dependent variable are selected independently and at random. Regression residuals are normally distributed.
2. Logistic Regression
Pros
- Simple algorithm that is easy to implement, does not require high computation power.
- Performs extremely well when the data/response variable is linearly separable.
- Less prone to over-fitting, with low-dimensional data.
- Very easy to interpret, can give a measure of how relevant a predictor is and the association (positive or negative impact on response variable).
Cons
- Logistic regression has a linear decision surface that separates its classes in its predictions, in the real world it is extremely rare that you will have linearly separable data.
- Need to perform careful data exploration, logistic regression suffers with datasets with high multicollinearity between their variables, repetition of information can lead to wrong training of parameters.
- Requires that independent variables are linearly related to the log odds (log(p/(1-p)).
- Algorithm is sensitive to outliers.
- Hard to capture complex relationships, deep learning and classifiers such as Random Forest can outperform with more realistic datasets.
3. k-Nearest Neighbors (KNN)
Pros
- A lazy-learning algorithm, no actual training step, new data is simply tagged to a majority class, based on historical data. Very easy to understand and implement.
- Can be used for both classification and regression.
- Only one hyper-parameter: k value.
- Non-parametric, makes no assumptions about the data/parameters.
Cons
- Struggles with large number of dimensions, the greater the number of dimensions the harder for the algorithm to efficiently calculate distance (Curse of Dimensionality). Often need to use dimensionality reduction techniques, especially in regression tasks with noisy data.
- Very sensitive to outliers & noise.
- Can become very computationally expensive as the dataset grows, need a lot of memory and can become very slow with a large sized dataset.
- K value selection, often need to estimate ranges or combine with cross-validation techniques to obtain the optimal k value selection. Frequent technique is to plot an elbow graph to find optimal k value
4. k-Means Clustering
Pros
- Very easy to interpret the results and highlighting conclusions in a visual manner.
- Very flexible and fast, also scalable for large datasets.
- Always yields a result.
Cons
- Struggles with a high number of dimensions, need to use PCA or spectral clustering to help fix issue.
- Choosing K value manually/based off of your domain knowledge of the problem. Need to use elbow method to assess best K value.
- Sensitive to outliers.
- Sensitive to initialization, if the initial centroids you pick are inaccurate this can cause problems with later points.
5. Decision Trees/Random Forest
Pros
- Do not need to scale and normalize data.
- Handles missing values very well.
- Less effort in regards to preprocessing.
Cons
- Very prone to overfitting.
- Sensitive to outliers and changes in the data.
- Takes a long time to train and expensive complexity wise
- Weak in terms of regression.
6. Support Vector Machine (SVM)
Pros
- Very effective with highly dimensional data.
- Works extremely well when there is a clear margin of separation.
Cons
- Selecting an appropriate kernel can be computationally expensive/need to know the dataset very well to be able to pick the right kernel.
- Can take a large amount of time with a large dataset.
7. Naive Bayes
Pros
- Speed, assumptions of feature independence allows the algorithm to be very fast. If this assumption holds true, performs exceptionally well.
- Performs well with multi-class prediction.
Cons
- Assumes all features are independent, this is rarely accurate in real life.
- Zero Frequency: If the categorical variable has a category in the test data set, which was not observed in the training data set, the model assigns a zero probability to this category and fails at making a prediction. Use smoothing to deal with this issue.