Machine Learning Done Wrong

mdbco · on Feb 28, 2015

"Statistical modeling is a lot like engineering."

I can certainly see why this is a good comparison, because it's true that both engineering methods and statistical methods rely on sets of given assumptions, but it's also really important not to take this analogy too far. Engineering is ultimately something that is done in a mechanistic world with primarily deterministic outcomes, whereas statistical modeling is conducted in a stochastic world with probabilistic outcomes, so it wouldn't be good to think about machine learning as predominantly mechanistic in nature (in spite of its name). Of course, a lot of the seven points that follow in the post actually emphasize the importance of stochastic factors (e.g. outliers, variance issues, collinearity, etc), so the author is clearly not making this mistake, but it might be good to clarify for anyone else who is reading.

"6. Use linear model without considering multi-collinear predictors"

This is a great point, and just to expand on it a bit, you can also have situations where you have simultaneity, i.e. two or more of your features or predictors are either functions of each other and/or functions of some third variable. This type of problem is more difficult to detect but can cause serious problems with interpreting the regression coefficients as it's ultimately a type of endogeneity, which means that common approaches like OLS will not be consistent.

chengtao · on March 1, 2015

great comment and +1

dustintran · on March 1, 2015

I strongly disagree with not using linear models, at least to build some theory and intuition before continuing with more sophisticated algorithms. What I find to be more egregiously misused when doing machine learning in practice is that everyone too often flocks to the state of the art with little understanding why. There's no reason for example to spend weeks (or months) tuning a incredibly deep neural network if the current predictive ability is enough and there are higher priority matters to work on.

Moreover, there's just too much of an emphasis on prediction. Design and analysis of experiments, handling missing data and the context of the data sets, and quantifying one's uncertainty about parameters in a principled manner for robust estimators are very underappreciated skills in the community. Using p values arbitrarily and "95% confidence intervals" based on an unchecked normal approximation is incredibly more harmful than not doing anything at all. There's just so much more to machine learning than supervised learning.

kylebgorman · on March 1, 2015

In natural language processing, we can get close to state of the art performance on nearly every major task with a linear model; usually, the feature sets contain what are essentially conjunctions of features, but these are chosen by hand, by domain experts, rather than produced with, say, a polynomial kernel.

chengtao · on March 1, 2015

To add on top of that, even with great data analysis skill, I had another blog-post talking about it requires all the product, data, and engineering skills together to make a good data science team. http://ml.posthaven.com/why-building-a-data-science-team-is-...

karthikv2k · on March 1, 2015

"2. Use plain linear models for non-linear interaction" It should be noted that Linear models are only linear in the model parameters, while the features can be transformed using non-linear functions. This trick makes linear models very powerful. Also if you have big data (in millions/billions) then you are better off with linear models, as SVM is very difficult to scale.

In my experience (all in big data), I rarely seen people use SVM, usual choices are logistic regressions and tree based models. In some finance and insurance industries you are restricted to use only interpretable models, which linear models are.

chengtao · on March 1, 2015

As you pointed out the transforming features is powerful, I believe that's the exact reason which makes SVM powerful. Though the way features can be combined with SVM is limited, the limitation makes SVM training fast in the dual space.

On the other hand, if you wanna compare logistic regression with SVM. While the detail is pretty tricky. One simplified view is to compare linear SVM which is essentially hinge loss with L2 regularization against logistic regression with L2 regularization which is essentially negative binomial log likelihood loss with L2 regularization. If you plot the loss functions, it's easy to see how they penalize negative & positive cases differently.

orting · on Feb 28, 2015

I think the points are good, but I am not very happy about this statement

"When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly."

If done correctly, then I agree. But we have to be carefull about overfitting when we try out several models or make an initial analysis to determine which model to use. In this sense, choosing a model is no different from fitting the parameters of the model.

idunning · on Feb 28, 2015

If you are disciplined, and separate data into training and testing sets, you can try as many models as you want without fear of overfitting. Indeed, optimizing over the parameters of a model on the training set is essential (pruning parameters in a tree, regularization weights, etc.) and can be thought of as training large number of models.

If you aren't doing this correctly, then you can't really interpret the performance of even a single model. Seen people screw this up in so many ways - my favorite recent one that was quite high on HN was someone using the full dataset for variable selection, before doing a training-testing split afterwards.

stiff · on March 1, 2015

If you use performance on the test set for model selection, this is not true. It follows from simple probabilistic reasoning, the more models you try the higher the chance one will score well on both the training set and the test set by "luck", and this is especially true with small datasets. In fact it is a best practice to use a separate validation set for model selection and use the test set only for final performance evaluation, see e.g. the answer to this question:

http://stats.stackexchange.com/questions/9357/why-only-three...

chengtao · on March 1, 2015

I personally love the topic of bayesian optimization over all the possible parameters including model choice. My point was more about given the resource is always constrained, it typically pays off long term for practitioners to analyze the data, understand the underlying mechanics before jumping into modeling.

texthompson · on March 1, 2015

I thought exactly the same thing. Statistics is about uncertainty, and it's very easy to be misled when you don't correct for trying lots of hypotheses.

ma2rten · on Feb 28, 2015

Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance.

I agree, that non-linear models are often able to beat linear ones, but if you have limited amounts of data feature engineering will always beat clever algorithms.

chengtao · on March 1, 2015

Yes, and IMO, most of the time, the insight behind the data is far more important than the modeling algorithms to achieve high performance with few exceptions (say computer vision, NLP, etc which really requires A LOT OF data). Even in some large data set, take page rank as an example. The fundamental insight was the popularity of the site would be a great signal for ranking the search result, and random walk would be a great way to approximate the popularity. As a result, Google made a great success in search ranking.

ai_maker · on Feb 28, 2015

Good post! A bunch of mistakes is a bunch of opportunities to improve. It kind of complements Domingos' paper:

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

chengtao · on Feb 28, 2015

Also, thank you all for reading the post. I'm the author and I'll be happy to clarify any of the points in the blog~

mystique · on Feb 28, 2015

Good list. I am new to Machine Learning with only ~1 year of real work and sometimes I slip and make one of these mistakes.

I have a question on #7. I have not used the co-efficients to mean feature importance but some times get tempted to use them. How do you explain which factors are the most important factors behind some outcome to non-stat people?

mdbco · on Feb 28, 2015

Point #7 is just referring to the magnitudes (or absolute values) of the coefficients. You can still determine which features are relatively important using the coefficient p-values if those are available. This of course is dependent on the necessary assumptions of the regression method that you are using being satisfied, as otherwise the p-values will be biased.

In terms of explaining this to non-stats people, you might want to avoid explaining the p-values directly to them (as it's very easy for people to get confused about what p-values actually mean), so instead you might simply show them which features are "statistically significant". In other words, try to explain the results in a qualitative way rather than a strictly quantitative one.

kafkaesk · on March 1, 2015

I have no professional experience with ML, so I might be missing something obvius here that's part of the industry paradigm.

But the article gives two points why you shouldn't use coefficient values to determine feature importance, which I think are only valid to some extent.

>a) changing the scale of the variable changes the absolute value of the coefficient

and

>(b) if features are multi-collinear, coefficients can shift from one feature to others.

Regarding a), well, that's what standardized coefficients are for.

b) is a bit trickier, but most regression models are based on the assumption of non-collinearity. This is of course a problem with real-world data, because you will quite often find some level of collinearity. That's when you (1) test for this issue and (2) look towards multilevel models.

chengtao · on March 1, 2015

All the other comments are great. Just bear in mind that it's important to really understand the mechanics behind each importance measurement. Some can use information gain, some can use the t-test on coefficient, while some use random forest and see if removing a feature makes big impact, etc. They all make different assumptions and the key point is again, understand whether those assumptions applied to your situation.

idunning · on Feb 28, 2015

Some techniques, e.g. random forest, give variable importance indicators for free. If you can test it out, give it a go - don't have to use the random forest as the final model.

ma2rten · on Feb 28, 2015

You can use correlation or the coefficients of a linear model iff the features on the same scale. Another method is to train a model leaving out each feature once, then you see how much accuracy drops.

moultano · on Feb 28, 2015

I'm glad #1 is #1. If I had to partition these into two sets, they would be "make sure it works (2-7)" and "make sure your definition of 'works' works (1)."

antonb2011 · on Feb 28, 2015

You suggest up/down sampling rare cases. Can you please elaborate on the standard approaches for this kind of problem? For both linear and nonlinear classifiers. Thank you.

chengtao · on Feb 28, 2015

Great question and my main point is less about up sampling the rare cases but more about the default loss function used in the model training might not directly align with the final business metric (which is the metric practitioners should care more about). As a result, it's important to align the both. For some algorithms, it's easier to incorporate different loss function, while for some others, it might not be the case. Over or under sampling is one fairly generally applicable way to tweak the loss function.

While I'm not an expert of the theory behind sampling, if you do find the need to tweak sampling to align the default loss function and the business metric, I would say doing grid search first, and validate the result with the business insight, e.g. if you find getting the rare cases right is much more important that getting the common cases right, does that align with the business insight?)

trunkation · on Feb 28, 2015

Is it somehow inspired by Linear Algebra Done Wrong by Treil which itself was probably inspired by Linear Algebra Done Right by Axler ? :)

chengtao · on Feb 28, 2015

Not really, it was actually more inspired by Statistics Done Wrong.

trunkation · on Feb 28, 2015

Oh, Ok. Never knew there was a book by that title.

kimolas · on March 1, 2015

It was written by someone in my PhD cohort—it's available for preorder now. http://www.statisticsdonewrong.com