Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 67


The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Show Answer
Correct Answer: AD

Delta Lake statistics are not optimized for free text fields with high cardinality. The high cardinality of free-form text fields like the 'review' column means that each review text is unique or nearly unique. Consequently, Delta Lake's statistics and data skipping optimizations, which are most effective for columns with a limited set of distinct values, do not significantly improve the performance of queries that search for specific keywords within the text.

Discussion

4 comments
Sign in to comment
aragorn_bregoOption: A
Nov 21, 2023

Delta Lake uses statistics and data skipping to improve query performance, but these optimizations are most effective for columns with low to medium cardinality (i.e., columns with a limited set of distinct values). Free-form text fields like the review column typically have high cardinality, meaning each value in the column (each review text) is unique or nearly unique. Consequently, statistics on such columns do not significantly improve the performance of queries searching for specific keywords within the text.

sturcuOption: A
Oct 24, 2023

Collecting statistics on long strings is an expensive operation

mouad_attaqiOption: A
Oct 26, 2023

A is correct

DileepvikramOption: A
Nov 9, 2023

answer is A