Certified Data Engineer Professional Exam - Question 105

Question

The data science team has created and logged a production model using MLflow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, loads the customers table containing the customer_id key column into a DataFrame, and defines the feature columns needed for the model.

Which code block will output a DataFrame with the schema "customer_id LONG, predictions DOUBLE"?

Examice · Accepted Answer

The correct code block uses Spark SQL to apply the UDF to the selected columns and alias the result as 'predictions'. Specifically, df.select("customer_id", model(*columns).alias("predictions")) selects the customer_id column and the predictions generated by the model. The model function, which is a Spark UDF, is applied to the specified columns, and the output is named predictions. This approach correctly combines the customer_id column with the newly generated predictions column, resulting in a DataFrame with the desired schema: customer_id LONG, predictions DOUBLE.

divingbell17 · Answer

B is correct. It's a spark udf not pandas

aragorn_brego · Answer

This code block applies the Spark UDF created from the MLflow model to the DataFrame df by selecting the existing customer_id column and the new column produced by the model, which is aliased to predictions. The model(*columns) part is where the UDF is applied to the columns specified in the columns list, and alias("predictions") is used to name the output column of the model's predictions. This will result in a DataFrame with the desired schema: "customer_id LONG, predictions DOUBLE".

60ties · Answer

I think it is B

Certified Data Engineer Professional Exam - Question 105

Discussion