Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 14


You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

Show Answer
Correct Answer: ABC

For an unsupervised anomaly detection method, it is crucial that anomalies (mutations in this case) are rare compared to normal instances. This condition ensures that the method can effectively detect deviations from the norm. Therefore, having very few occurrences of mutations relative to normal samples supports this method. Additionally, if future mutations are expected to have different features from those currently in the database, an unsupervised method would be beneficial. This is because it does not rely on predefined patterns but instead identifies new anomalies based on their deviation from the normal data patterns in the database.

Discussion

17 comments
Sign in to comment
jvg637Options: AD
Mar 15, 2020

I think that AD makes more sense. D is the explanation you gave. In the rest, A makes more sense, in any anomaly detection algorithm it is assumed a priori that you have much more "normal" samples than mutated ones, so that you can model normal patterns and detect patterns that are "off" that normal pattern. For that you will always need the no. of normal samples to be much bigger than the no. of mutated samples.

BigQuery
Dec 6, 2021

Guys its A & C. Anomaly detection has two basic assumptions: ->Anomalies only occur very rarely in the data. (a) ->Their features differ from the normal instances significantly. (c) link -> https://towardsdatascience.com/anomaly-detection-for-dummies-15f148e559c1#:~:text=Unsupervised%20Anomaly%20Detection%20for%20Univariate%20%26%20Multivariate%20Data.&text=Anomaly%20detection%20has%20two%20basic,from%20the%20normal%20instances%20significantly.

szefco
Dec 15, 2021

I don't agree on C. Anomaly detection assumes "Their features differ from the NORMAL INSTANCES significantly" and in the C option you have: "You expect future mutations to have different features from the MUTATED SAMPLES IN THE DATABASE". IMHO Answer D fits better: "D. You expect future mutations to have similar features to the mutated samples in the database." - in other words: Expect future anomalies to be similar to the anomalies that we already have in database

jvg637Options: AC
Mar 11, 2020

A instead of B: "anomaly detection (also outlier detection[1]) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data

azmiozgenOptions: AD
Jul 12, 2023

D should be correct. You expect future samples will correlate with the training samples. That's the whole point of learning procedure. If you do not expect that they have similar features, then why would you use features in the training samples in the first place? A is also correct, since anomaly labels would be seen rarely.

charlineOptions: AD
Mar 15, 2023

D. You expect future mutations to have similar features to the MUTATED samples in the database.

juliobsOptions: AC
Mar 17, 2023

The question is more about why *unsupervised* *anomaly detection*. A explains the *anomaly detection* C explains why *unsupervised* If the mutations were like the database you could simply do supervised learning.

shabfat
Mar 18, 2023

I think it should be D instead of C, because for a good clustering you want the intra cluster distance to be low --> that would imply you want similar mutations.

despeeOptions: AC
May 2, 2023

Guys if there are equals in the DB, it becomes a classification problem not an anomaly detection.

momosoundz
May 6, 2023

agree!! :)

cchen8181Options: AC
May 17, 2023

I would choose A and C. Not B because mutations should be rare. Not D because mutations can be unpredictable and if D were true it would point to supervised learning. Not E since it would point to supervised learning.

imran79Options: AC
Oct 7, 2023

For unsupervised anomaly detection to be effective, it works best when anomalies (or mutations in this case) are rare compared to normal instances. Moreover, if future mutations are expected to have different features from those currently in the database, an unsupervised method would be beneficial since it doesn't rely on previously seen patterns of mutations. The two characteristics that support the use of an unsupervised anomaly detection method in this scenario are: A. There are very few occurrences of mutations relative to normal samples. C. You expect future mutations to have different features from the mutated samples in the database.

axantroffOptions: AC
Oct 29, 2023

AC; might also be interesting - https://towardsdatascience.com/unsupervised-learning-for-anomaly-detection-44c55a96b8c1 as comments below

rocky48Options: AD
Nov 4, 2023

A. There are very few occurrences of mutations relative to normal samples. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying rare events or anomalies in large amounts of data. By training the algorithm on the normal tissue samples in the database, it can then identify new tissue samples that have different features from the normal samples and classify them as mutated. D. You expect future mutations to have similar features to the mutated samples in the database. This characteristic is supportive of using an unsupervised anomaly detection method, as it is well suited for identifying patterns or anomalies in the data. By training the algorithm on the mutated tissue samples in the database, it can then identify new tissue samples that have similar features and classify them as mutated.

spicebitsOptions: AC
Nov 9, 2023

Unsupervised anomaly detection is best for scenarios without labels or when the anomalies are unknown or ever-changing

tdum76000Options: AC
Dec 19, 2023

As A is a good answer, i'd like to give my point of view on the second right answer. I initially thought D was the correct one, as you would normally train your model to detect mutations seen in the training dataset. But the goal of unsupervised learning is to detect unidentified patterns. If you were sure the mutations would always look the same, you'd rather use supervised learning and labels the "normal" and "mutated" tissues, which would result in better performances in my point of view.

Mathew106Options: AD
Jul 24, 2023

Answers B and C are both dumb, sorry to say. There are different approaches to anomaly detection. Some expect different features from the training dataset anomalies and some don't. If you cluster the training data and assign an anomaly label to any data point in an anomaly cluster then you expect them to have similar features. If you disregard the anomaly clusters and you simply set a rule saying "a data point is an anomaly if it it's further away from X than the clusters 1,2,3 with healthy tissues, then you don't care about having similar features, as long as they are not similar to the healthy tissues.

Mark_86Options: AC
Jul 26, 2023

A & C 100% sure, as you would only use unsupervised learning if you cannot supervise your algorithm. The other answers imply that you have enough, expectedly similar mutations to supervise on.

gaurav0480Options: AC
Aug 20, 2023

A is definitely true. Next comes the tricky difference between C & D. We can in fact even use supervised learning for case D where future mutations are similar to mutations in the training sample given that we have enough samples in the anomalous class then over-sample the anomalous class and under-sample the other class. Therefore I am inclined to choose C instead of D.

pandey_0307Options: AC
Jun 6, 2024

A. There are very few occurrences of mutations relative to normal samples. Unsupervised anomaly detection is particularly useful in situations where anomalies (mutations) are rare compared to the normal instances. This characteristic aligns well with unsupervised methods that can detect outliers or rare events in a dataset dominated by normal samples. C. You expect future mutations to have different features from the mutated samples in the database. Anomaly detection methods are effective when future anomalies do not follow the same patterns as the known anomalies. These methods aim to identify instances that significantly deviate from the norm, which suits the scenario where future mutations might exhibit different characteristics from those currently known.

RoulleOptions: AD
Jul 9, 2024

That's A and D. The aim of unsupervised classification of anomalies is to identify sub-groups with characteristics in common that may resemble anomalies. So, when a new mutation appears, we can determine whether it shares characteristics with previously discovered anomaly subgroups. If this mutation is an anomaly and has very different characteristics from our detected anomaly subgroup, it is likely to be associated with an incorrect group.