Certified Data Engineer Professional Exam - Question 31

Question

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Examice · Accepted Answer

Setting types manually provides greater assurance of data quality enforcement. Relying on inferred schema can be risky due to the possibility of encountering unexpected data types, which may result in errors or inefficient processing. Specifying the schema manually ensures that the data is accurately interpreted and conforms to expected standards, which is crucial for maintaining data integrity in production environments.

RafaelCFC · Answer

A is wrong, because Tungsten is a project around improving Spark's efficiency on memory and CPU usage;
B is wrong because Parquet does not support file editing, it only supports overwrite and create operations by itself;
C is wrong because completely automating schema declaration for tables will incur in reduced previsibility for data types and data quality;
E is false because unlucky sampling can yield bad inferences by Spark;

hammer_1234_h · Answer

D is correct. 
we can use `schema hint`  to enforce the schema information that we know and expect on an inferred schema.

sturcu · Answer

correct

AziLa · Answer

Correct Ans is D

guillesd · Answer

Only answer that makes sense

hal2401me · Answer

from my exam today, both C & D are no longer available, so they can't be correct.
E & A are available. E states "always accurate" so I hesitate to choose it.
There is a new option stating like "delta lake indexes first 32column in delta log for Z order and optimization"(not sure I remember exactly, it looks statementfully correct). and I chosed this "new" option. Because, this should impact the schema decision by putting high-usage field in the first 32 columns.

Certified Data Engineer Professional Exam - Question 31

Discussion