This post is taken from Microsoft’s blog and gives a good approach on the dilemma of algorithm selection.

### Considerations when choosing an algorithm

#### Accuracy

Getting the most accurate answer possible isn’t always necessary. Sometimes an approximation is adequate, depending on what you want to use it for. If that’s the case, you may be able to cut your processing time dramatically by sticking with more approximate methods. Another advantage of more approximate methods is that they naturally tend to avoid overfitting.

#### Training time

The number of minutes or hours necessary to train a model varies a great deal between algorithms. Training time is often closely tied to accuracy—one typically accompanies the other. In addition, some algorithms are more sensitive to the number of data points than others. When time is limited it can drive the choice of algorithm, especially when the data set is large.

#### Linearity

Lots of machine learning algorithms make use of linearity. Linear classification algorithms assume that classes can be separated by a straight line (or its higher-dimensional analog). These include logistic regression and support vector machines (as implemented in Azure Machine Learning). Linear regression algorithms assume that data trends follow a straight line. These assumptions aren’t bad for some problems, but on others they bring accuracy down.

Despite their dangers, linear algorithms are very popular as a first line of attack. They tend to be algorithmically simple and fast to train.

#### Number of parameters

Parameters are the knobs a data scientist gets to turn when setting up an algorithm. They are numbers that affect the algorithm’s behavior, such as error tolerance or number of iterations, or options between variants of how the algorithm behaves. The training time and accuracy of the algorithm can sometimes be quite sensitive to getting just the right settings. Typically, algorithms with large numbers parameters require the most trial and error to find a good combination.

Having many parameters typically indicates that an algorithm has greater flexibility. It can often achieve very good accuracy. Provided you can find the right combination of parameter settings.

#### Number of features

For certain types of data, the number of features can be very large compared to the number of data points. This is often the case with genetics or textual data. The large number of features can bog down some learning algorithms, making training time unfeasibly long. Support Vector Machines are particularly well suited to this case (see below).

### Advantages of particular algorithms*

Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

### Special cases

Some learning algorithms make particular assumptions about the structure of the data or the desired results. If you can find one that fits your needs, it can give you more useful results, more accurate predictions, or faster training times.

Algorithm | Accuracy | Training time | Linearity | Parameters | Notes |
---|---|---|---|---|---|

Two-class classification | |||||

Logistic regression | ● | ● | 5 | ||

Decision forest | ● | ○ | 6 | ||

Decision jungle | ● | ○ | 6 | Low memory footprint | |

Boosted decision tree | ● | ○ | 6 | Large memory footprint | |

Neural network | ● | 9 | Additional customization is possible | ||

Averaged perceptron | ○ | ○ | ● | 4 | |

Support vector machine | ○ | ● | 5 | Good for large feature sets | |

Locally deep support vector machine | ○ | 8 | Good for large feature sets | ||

Bayes’ point machine | ○ | ● | 3 | ||

Multi-class classification | |||||

Logistic regression | ● | ● | 5 | ||

Decision forest | ● | ○ | 6 | ||

Decision jungle | ● | ○ | 6 | Low memory footprint | |

Neural network | ● | 9 | Additional customization is possible | ||

One-v-all | - | - | - | - | See properties of the two-class method selected |

Regression | |||||

Linear | ● | ● | 4 | ||

Bayesian linear | ○ | ● | 2 | ||

Decision forest | ● | ○ | 6 | ||

Boosted decision tree | ● | ○ | 5 | Large memory footprint | |

Fast forest quantile | ● | ○ | 9 | Distributions rather than point predictions | |

Neural network | ● | 9 | Additional customization is possible | ||

Poisson | ● | 5 | Technically log-linear. For predicting counts | ||

Ordinal | 0 | For predicting rank-ordering | |||

Anomaly detection | |||||

Support vector machine | ○ | ○ | 2 | Especially good for large feature sets | |

PCA-based anomaly detection | ○ | ● | 3 | ||

K-means | ○ | ● | 4 | A clustering algorithm |

**Algorithm properties:**

**●** – shows excellent accuracy, fast training times, and the use of linearity

**○** – shows good accuracy and moderate training times

*http://blog.echen.me/

An AI evangelist and a multi-disciplinary engineer. Loves to read business and psychology during leisure time. Connect with him any time on LinkedIn for a quick chat on AI!