Professional Data Engineer Exam - Question 14

Question

You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two. ).

Examice · Accepted Answer

For an unsupervised anomaly detection method, it is crucial that anomalies (mutations in this case) are rare compared to normal instances. This condition ensures that the method can effectively detect deviations from the norm. Therefore, having very few occurrences of mutations relative to normal samples supports this method. Additionally, if future mutations are expected to have different features from those currently in the database, an unsupervised method would be beneficial. This is because it does not rely on predefined patterns but instead identifies new anomalies based on their deviation from the normal data patterns in the database.

jvg637 · Answer

I think that AD makes more sense. D is the explanation you gave. In the rest, A makes more sense, in any anomaly detection algorithm it is assumed a priori that you have much more "normal" samples than mutated ones, so that you can model normal patterns and detect patterns that are "off" that normal pattern. For that you will always need the no. of normal samples to be much bigger than the no. of mutated samples.

jvg637 · Answer

A instead of B:
"anomaly detection (also outlier detection[1]) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data

azmiozgen · Answer

D should be correct. You expect future samples will correlate with the training samples. That's the whole point of learning procedure. If you do not expect that they have similar features, then why would you use features in the training samples in the first place? A is also correct, since anomaly labels would be seen rarely.

charline · Answer

D. You expect future mutations to have similar features to the MUTATED samples in the database.

juliobs · Answer

The question is more about why *unsupervised* *anomaly detection*.
A explains the *anomaly detection*
C explains why *unsupervised*

If the mutations were like the database you could simply do supervised learning.

despee · Answer

Guys if there are equals in the DB, it becomes a classification problem not an anomaly detection.

cchen8181 · Answer

I would choose A and C.

Not B because mutations should be rare.
Not D because mutations can be unpredictable and if D were true it would point to supervised learning.
Not E since it would point to supervised learning.

imran79 · Answer

For unsupervised anomaly detection to be effective, it works best when anomalies (or mutations in this case) are rare compared to normal instances. Moreover, if future mutations are expected to have different features from those currently in the database, an unsupervised method would be beneficial since it doesn't rely on previously seen patterns of mutations.

The two characteristics that support the use of an unsupervised anomaly detection method in this scenario are:

A. There are very few occurrences of mutations relative to normal samples.
C. You expect future mutations to have different features from the mutated samples in the database.

axantroff · Answer

AC; might also be interesting - https://towardsdatascience.com/unsupervised-learning-for-anomaly-detection-44c55a96b8c1 as comments below

rocky48 · Answer

A. There are very few occurrences of mutations relative to normal samples. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying rare events or anomalies in large amounts of data. By training the algorithm on the normal tissue samples in the database, it can then identify new tissue samples that have different features from the normal samples and classify them as mutated.

D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying patterns or anomalies in the data. By training the algorithm on the mutated tissue samples in the database, it can then identify new tissue samples that have similar features and classify them as mutated.

spicebits · Answer

Unsupervised anomaly detection is best for scenarios without labels or when the anomalies are unknown or ever-changing

tdum76000 · Answer

As A is a good answer, i'd like to give my point of view on the second right answer. I initially thought D was the correct one, as you would normally train your model to detect mutations seen in the training dataset. But the goal of unsupervised learning is to detect unidentified patterns. If you were sure the mutations would always look the same, you'd rather use supervised learning and labels the "normal" and "mutated" tissues, which would result in better performances in my point of view.

Mathew106 · Answer

Answers B and C are both dumb, sorry to say. There are different approaches to anomaly detection. Some expect different features from the training dataset anomalies and some don't. If you cluster the training data and assign an anomaly label to any data point in an anomaly cluster then you expect them to have similar features. If you disregard the anomaly clusters and you simply set a rule saying "a data point is an anomaly if it it's further away from X than the clusters 1,2,3 with healthy tissues, then you don't care about having similar features, as long as they are not similar to the healthy tissues.

Mark_86 · Answer

A & C
100% sure, as you would only use unsupervised learning if you cannot supervise your algorithm. The other answers imply that you have enough, expectedly similar mutations to supervise on.

gaurav0480 · Answer

A is definitely true. Next comes the tricky difference between C & D. We can in fact even use supervised learning for case D where future mutations are similar to mutations in the training sample given that we have enough samples in the anomalous class then over-sample the anomalous class and under-sample the other class. Therefore I am inclined to choose C instead of D.

pandey_0307 · Answer

A. There are very few occurrences of mutations relative to normal samples.

Unsupervised anomaly detection is particularly useful in situations where anomalies (mutations) are rare compared to the normal instances. This characteristic aligns well with unsupervised methods that can detect outliers or rare events in a dataset dominated by normal samples.
C. You expect future mutations to have different features from the mutated samples in the database.

Anomaly detection methods are effective when future anomalies do not follow the same patterns as the known anomalies. These methods aim to identify instances that significantly deviate from the norm, which suits the scenario where future mutations might exhibit different characteristics from those currently known.

Roulle · Answer

That's A and D. The aim of unsupervised classification of anomalies is to identify sub-groups with characteristics in common that may resemble anomalies. So, when a new mutation appears, we can determine whether it shares characteristics with previously discovered anomaly subgroups. If this mutation is an anomaly and has very different characteristics from our detected anomaly subgroup, it is likely to be associated with an incorrect group.

Professional Data Engineer Exam - Question 14

Discussion