on 17 October 18
What are outliers? Why are they important, and why do we need to be sure that we are giving them due attention?
In the context of data analytics, outliers are individual readings of data that differ very greatly in value relative to most other values in a set. Why they are important is because depending on just how different they are they might disproportionately bias the results of a statistical analysis of the data set as a whole.
For example, consider this set of temperature readings taken once a day at the same time every day for 10 days: 15, 15, 16, 16.5, 15, 15.5, 25, 16, 15, 16 (degrees Centigrade). A simple average of the entire set of readings is 16.5 degrees. That’s a straightforward mathematical perspective taken without questioning the validity of any of the readings. But considering them more closely, wouldn’t it seem odd that when all the readings hover around the 15 or 16 degree mark, there’s one reading of 25 degrees? There’s no gradual increase of temperature to 25 in the preceding days, followed by a gradual decrease back to the 15 degree mark in the succeeding days. It’s a sudden spike. Do things like that usually happen? Maybe certain freak weather conditions could cause such an oddity, but how likely could that be? If we exclude that particular reading from our calculation, the average changes to 15.55 degrees, which is almost a whole degree lower.
A number of additional questions arise:
- Is it possible that the reading of 25 degrees is the result of a measurement error? A faulty thermometer perhaps? Or is it possible that the reading was real and correct, but that there were extraordinary factors that caused it to be so different that day?
- Should we include the reading of 25 degrees while computing the average or should we exclude it?
- What if we had taken a 100 readings instead of just 10? Would we have had more readings of 25, or between 15 and 25? Is there a chance there could have been any readings of above 25?
- What if there was no reading of 25, but there were two other significantly different readings of 20 each?
- How different does a value have to be in order for it to be considered to be so different that it could distort the result of a statistical analysis?
- If we consider the 25 degree reading as an outlier, would other analysts also do the same with this set of data?
- If we were to use such data over time in a machine learning system would it delay or retard its achievement of maximum effectiveness?
- Is it possible to have a single generic method by which we can decide whether a value should be considered to be an outlier or not?
In actuality, the decision about whether or not a data value should be treated as an outlier is at least to some degree a subjective one. The decision may initially be based on some set of objective identification rules using a standard mathematical technique, but must then be reconsidered further in a subjective manner that questions the data (and also the entire data set) within the context of its business meaning. The same subjectivity would need to be applied in considering the results of any statistical analysis run on the overall data set, with an awareness of whether or not the outliers were included in the analysis.
It is because of all the questions that arise that it is important to be able to identify outliers and evaluate them fully before deciding how to treat them. There are a number of various mathematical methods to identify outliers, starting with John Tukey’s IQR or box plot method, the simple z-score method, and going on to others that may be more robust in the face of various factors that might stress the analysis in any way. It is important to have both a statistical feel as well as qualitative business feel to the data so that the most appropriate choice of method is made to identify and treat the outliers.