Contrast Analysis in High Dimensional Databases - Detection and Description of Subspace Outliers
Speaker:Dr. Emmanuel Müller
University:Karlsruhe Institute of Technology
Outlier mining is a major task in data analysis. Outliers show up in the data if they are highly deviating from regular objects in their local neighborhood. For outlier detection, local outlier ranking methods score each object based on the degree of deviation. Unfortunately, in many applications these methods fail due to high dimensional databases, which show only low contrast between outliers and regular objects. Thus, local outlier rankings degenerate to random listings. Outliers do not show up in the full space, they are hidden in multiple high contrast subspace projections. For example in health surveillance, for one patient attributes such as “age” and “skin humidity” might be important to detect the abnormal “dehydration” status of this patient. Other attributes such as “heart beat rate” are irrelevant for the detection of this outlier, but are relevant for the detection of abnormal patients with a heart disease. The selection of these subspaces is a major open research challenge.
We propose a novel subspace search method that selects high contrast subspaces for local outlier ranking. Our selection is based on statistical tests. It searches for high contrast subspaces with a significant amount of conditional dependency among the selected dimensions. With our statistical test, we exclude scattered subspaces and select only few projections for outlier ranking. Thus, we enhance the quality of traditional outlier rankings by computing scores in these lower-dimensional projections. In addition to this ranking, we provide descriptive information about the reasons why an object seems outlying. In our example, providing information about the high deviation in “age” and “skin humidity” while showing normal measurements in all other attributes assists health professionals in verifying this automatically detected outlier.