Certified Data Engineer Professional Exam - Question 123

Question

The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.

Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?

Examice · Accepted Answer

The correct code block uses the `select` method to specify columns from the DataFrame and applies the model to the selected columns (`columns`). By aliasing the output of the model as `predictions`, it ensures that the resulting DataFrame will have the schema `customer_id LONG, predictions DOUBLE`. This aligns with PySpark's DataFrame API for transforming and applying models to specific DataFrame columns, which is appropriate in the Databricks environment.

Freyr · Answer

Correct Answer: B
This option uses select to specify columns from the DataFrame and applies the model to the specified columns (columns). The output of the model is aliased as "predictions", which ensures the output DataFrame will have the column names "customer_id" and "predictions" with appropriate data types assuming the model returns a double type. This syntax aligns with PySpark's DataFrame transformations and is a typical way to apply a machine learning model to specific columns in Databricks.

vexor3 · Answer

B is correct

Certified Data Engineer Professional Exam - Question 123

Discussion