Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 7


The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

Show Answer
Correct Answer: A

To accomplish the task of saving predictions to a Delta Lake table with the ability to compare all predictions across time while minimizing potential compute costs, the provided code block in option A is suitable. It uses the `saveAsTable` method with the mode set to `append`, ensuring that predictions are continually added to the table without overwriting previous entries. This approach maintains a historical record of predictions as required and fits the batch processing context since the churn predictions are made at most once per day. It's important to note that in Databricks, databases and table creation typically default to Delta Lake format, so there's no need to explicitly specify the format when using `saveAsTable`.

Discussion

9 comments
Sign in to comment
thxsgodOption: A
Sep 7, 2023

You need: - Batch operation since it is at most once a day - Append, since you need to keep track of past predictions A is the correct answer. You don't need to specify "format" when you use saveAsTable.

EertyyOption: B
Sep 21, 2023

answer is B

Eertyy
Sep 21, 2023

Here's why: A. saves the data as a managed table, which may not be efficient for large-scale data or frequent updates. It doesn't utilize Delta Lake capabilities. C.is used for streaming operations, not batch processing. Also, using "overwrite" as output mode will replace the existing data each time, which is not suitable for keeping historical predictions. D.is similar to option A but with "overwrite" mode. It will replace the entire table each time, which is not suitable for maintaining a historical record of predictions. E. is also for streaming operations and not for batch processing. Additionally, it uses the "table" method, which is not typically used for writing batch data into Delta Lake tables. Option B is suitable for batch processing, writes data in Delta Lake format, and allows you to efficiently maintain a historical record of predictions while minimizing compute costs.

pradyumn9999
Oct 23, 2023

Its also said they want to compare past values as well, so mode needs to be append. By default is error mode.

Starvosxant
Oct 9, 2023

First: the default node Databricks saves tables IS Delta Format. So no reason why you say it wouldn't benefit from Lakehouse features. Second: the default write mode is Error, means that if you try to write to a location and that already exists there, it will prone a Error. And the question specify that you gonna write once a day. You better revisit basic topics before continue to the professional level certification, or buy the dump entirely.

buggumasterOption: B
Aug 28, 2023

Selected answer is wrong, not writeMode is specified in A.

buggumasterOption: A
Aug 28, 2023

Selected answer is wrong, not write Format is specified in A.

sturcu
Oct 11, 2023

Correct

sturcuOption: A
Oct 11, 2023

Correct

kz_dataOption: A
Dec 21, 2023

A is correct

Jay_98_11Option: A
Jan 13, 2024

A is correct

coercionOption: A
May 19, 2024

default table format is delta so no need to specify the format. As per the requirement, "append" mode is required to maintain the history. Default mode is "ErrorIfExists"