Sampling labeled data for anomaly detection

Question

Sampling labeled data for anomaly detection

Prager

2022年5月10日 05:03

I'm currently working on a project that requires the use of unsupervised anomaly detection, but I'm unable to find a relevant data set, so I'm considering the following option:

Assuming I have a data set $X$ of $m$ examples labeled using $K$ classes. Let $X(k)$ be the subset of $X$ where all examples are labeled as $k$, and $k_{max}$ be the larget class. Can I use $X(k_{max})$ as a training set for an anomaly detector, whose task is to flag elements who weren't labeled as $k_{max}$, as an anomaly? Using $p [m - size(X(k))]$ of the remaining examples in $X$ for cv and test sets as the anomalous examples.

Topic anomaly-detection dataset

Category Data Science

Samarth · Accepted Answer · 2019年11月5日 21:15

I guess it depends on how you define an anomaly. If you already know and are sure that everything other than k_max can be defined as an anomaly then sure what you mention makes sense, assuming the fact that there is not a significant overlap between k_max and other classes in your feature space. I would train an auto encoder and learn an error function, classify anything that doesn't fit the error bounds as anomalous.

If you don't know what should be considered an anomaly, I would suggest looking at the Isolation Forrest Algorithm (https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf) which is a tree based method used for outlier detection. This would allow you to use an unsupervised (untagged with classes) dataset and figure out what should be considered as a 'clean/healthy' example for your anomaly detection algorithm and the outliers can be used as anomalous examples.

Here is another article that you might find useful, trying to do a similar thing I am trying to suggest here: https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e559c1

Sampling labeled data for anomaly detection

About