Anomaly detection is a not novel problem. Indeed, it is a very old problem and, arguably, also one of the most interesting problems in statistics and machine learning. So, it comes as a surprise that most lessons learned throughout history seem to be lost and some companies nowadays treat it like a stepchild. Even worse, many companies seem not to be aware of the problem at all and I wonder how many opportunities have been lost due to this ignorance.
For this article, I identified 5 big misconceptions about anomaly detection that pop up frequently and tried to debunk them. But before we jump right into the list, let us first revisit what an anomaly actually is. As Hawking in his book on the Identification of outliers (1980) put it:
“[…] an outlier [is] an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism”. (Hawking, Identification of outliers, 1980)
So, we are looking for uncommon occurrences, strong deviations from the norm, surprises, or, in other words, ultimately, real insights.
Without further ado, here is the list of the 5 most common misconceptions about anomaly detection:
It is a common misconception that all anomaly detection methods are unsupervised. This is actually based on some true fact: anomaly detection is considered an inherently unsupervised task. However, this does neither mean that labels are to be ignored if they are available nor that you don’t need any. Keep in mind that every piece of information helps to make better decisions.
While you do not necessarily need labels to apply anomaly detection methods to your data set, you definitely gonna need labels in order to evaluate the effectiveness of your method. Here, labels can be either implicit or explicit. You can either rely on some domain expert (implicit) or that you have a corresponding class label (explicit). If nobody can verify your solution then you could as well generate random class labels as a result which, of course, does not make any sense.
Now that we have established that labels are necessary for anomaly detection, you should put some effort into getting as many as possible. Due to the special setting (lots of normal data only a few anomalies), anomalies carry way more information than other data and it should be a primary goal to find anomalies (isn’t it ironic that you need anomalies in order to find anomalies?). A very good and common approach is to include a human expert in the loop:
Quint-essence here: do not neglect any information that helps unveil anomalies, ie. labeled anomalies.
Must be a very boring data set then indeed. Well, I am exaggeration here of course but the essential idea of anomaly detection is actually not to filter out noise. Don’t get me wrong, this is a valid use case. However, anomalies are generally not noise (random perturbations of valid data) and rather single instances that are unlikely and surprising. Hence, they live in the tails of the underlying data distributions. If there is no need for anomaly detection that would mean that every data point is equally likely which is, well, boring.
Summary: Anomaly detection is not (only) noise removal.
In a way, this is related to the first point of the list. To reiterate: you’ll need labels. Period. However, what defines a label is a bit fuzzier than you’d expect it to be. A label is not only a categorical variable that comes with each data point. It also refers to insights, knowledge, or an information source such as an expert that can tell you if a certain data point is anomalous or not. It might be very costly or time-consuming to get this information (e.g. expert is busy and is expensive). Nonetheless, you definitely have a way to tell normal data and anomalies apart because if not, any solution, ie. randomly generated ones, would be equally valid.
On a side note: inherent information in labels is not equally distributed. Due to the scarcity of anomalies, those carry way more information than normal data. In one of my papers, toward supervised anomaly detection, all empirical evidence showed that finding and incorporating anomalies gives huge accuracy gains when compared to labeled normal data and unsupervised scenarios.
If you do not possess any stored label information, there are basically two scenarios of how to get there. The most common one is described in Point 1, where we assume that anomalies are present in the data set and we employ a so-called human-in-the-loop approach to acquire the label information. Now, there might be a situation where gathering anomalies is so expensive that you can not wait until they occur. This is a very rare scenario, ie. nuclear reactor failures and such. Here, you need to put some effort into studying and simulating the problem including the anomalies. This is extremely time-consuming and costly but in the end, you’d have a lot of in-depth insights into your problem. However, these cases are rare and the first approach is usually the way to go.
Summary: You will need labels and specially labeled anomalies. Use the human-in-the-loop approach to acquire labels.
Are they? Really? While robust methods are a real thing, most of the research papers containing the words “our method is robust” don’t actually mean it that way. To be clear, we are specifically talking about the impact of anomalies in the training data set here.
Unless you theoretically designed your method to be robust against certain anomalies (e.g. large deviations) and you empirically tested your method, there is no reason to believe that your method is robust. Moreover, a key characteristic of anomalies is that they might stem from multiple, independent processes. Even though your method is robust against one characteristic, it doesn’t make it magically robust against all kinds of anomalies.
A very common setting is the following. 1. We assume anomalies are large deviations. 2. We know that squaring tiny numbers results in tiny numbers and squaring a large number results in a humongous number. So, whenever you see a squared norm or distance measure, you should be suspicious of its “robustness” against large-deviation outliers. Many common methods use an L2 distance measure such as least squares and deep neural nets (not always, of course).
Actually, a standard question for Master students in machine learning is: What happens if you apply PCA and your data set contains an outlier? Answer: this single data point can have a huge influence and distort the whole solution.
Take-away: If a method is said to be robust against anomalies you should always ask why and against what kind of an anomaly because many methods really are not.
The other day, I was listening to a Tim Ferriss podcast where he was interviewing Andy Rachleff about entrepreneurship. Andy is a co-founder and Executive Chairman of Wealthfront and co-founded Benchmark Capital in 1995. He invested early in companies including well-known eBay, OpenTable, Snapchat, Twitter, and Uber. He also teaches at Stanford and became famous for his product/market fit (PMF) concept.
In his view, the road to success for a startup is a 2×2 matrix: you can either be right or wrong and you can be agreeable or non-agreeable. What you should aim for is to be right (of course) and non-agreeable. He talks about the reasoning in detail in the podcast but it boils down to if you would be agreeable and right, there is already a market with fierce competition. Non-agreeable means that you have some insights that others don’t have and don’t share (yet).
Now, these non-agreeable insights, in data science terms would refer to surprises, something that is odd but true, ie. anomalies. Finding these should be a prime effort of companies and especially young startups. This will give them the edge over their competition. According to Andy, it is a waste of time to run evaluations of your product and acquire data, in a possibly time-consuming and costly manner, that just confirms your believes and biases. Only the surprises and new insights will enable you to make decisions and place your product in ways others can not.
Summary: Savor the surprises. Look out for anomalies.
You can not go wrong with applying anomaly detection to your data. The benefits can be huge, ranging from more insights into your problems and customers to making the better decisions to having a cleaner data set for downstream analysis. The only drawback here is that getting meaningful results requires some knowledge and hard work in order to find the right answer and not just any.