Certified Machine Learning Professional Exam - Question 41

Question

A machine learning engineer has developed a random forest model using scikit-learn, logged the model using MLflow as random_forest_model, and stored its run ID in the run_id Python variable. They now want to deploy that model by performing batch inference on a Spark DataFrame spark_df.

Which of the following code blocks can they use to create a function called predict that they can use to complete the task?

Examice · Accepted Answer

To perform batch inference using a scikit-learn model stored with MLflow on a Spark DataFrame, one must load the model as a Spark UDF (User-Defined Function) and apply it to the DataFrame. This can be achieved by using the 'mlflow.pyfunc.spark_udf' function which requires the Spark session and the model URI. The correct procedure involves creating the UDF with 'mlflow.pyfunc.spark_udf', then utilizing it to transform the given DataFrame. Consequently, this approach ensures that the model is applied correctly within the Spark environment.

spaceexplorer · Answer

E is correct

BokNinja · Answer

E.
import mlflow
logged_model = 'runs:/e905f5759d434a131bbe1e54a2b/best-model'

# Load model as a Spark UDF.
loaded_model = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model)

# Predict on a Spark DataFrame.
df.withColumn('predictions', loaded_model(*columns)).collect()

JaydeepT · Answer

spark_df is the frame to be used for variable evaluation in runtime

64934ca · Answer

The spark session is passed as the first argument to mlflow.pyfunc.spark_udf to provide the necessary context for creating and executing the UDF within the Spark environment. The model_uri is passed as the second argument to specify which MLflow model to load and use for predictions. This order is required by the function's design to ensure proper integration with Spark.

Mircuz · Answer

You need the spark env

Certified Machine Learning Professional Exam - Question 41

Discussion