Databricks Certified Machine Learning Associate Exam Questions

Question 6 of 35

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.

Which of the following lines of code can the data scientist run to accomplish the task?

spark_df.summary ()

spark_df.stats()

spark_df.describe().head()

spark_df.printSchema()

spark_df.toPandas()

Correct Answer: A

Question 7 of 35

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

One-hot encoding is not supported by most machine learning libraries.

One-hot encoding is dependent on the target variable’s values which differ for each application.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Correct Answer: C

One-hot encoding is computationally intensive and can lead to a large increase in the dimensionality of the data, especially when dealing with categorical variables that have many unique values. This can make storage and computation more resource-intensive and inefficient. Therefore, it is often more practical to perform one-hot encoding on smaller samples of training sets rather than within the feature repository for broader applications.

Question 8 of 35

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.

Which of the following possible explanations for this difference is invalid?

The second model is much more accurate than the first model

The data scientist failed to exponentiate the predictions in the second model prior to computing the RMSE

The data scientist failed to take the log of the predictions in the first model prior to computing the RMSE

The first model is much more accurate than the second model

The RMSE is an invalid evaluation metric for regression problems

Correct Answer: E

RMSE (Root Mean Squared Error) is a widely accepted and valid evaluation metric for regression problems. Therefore, the explanation stating that the RMSE is an invalid evaluation metric for regression problems is invalid.

Question 9 of 35

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

• 10.0

• 12.0

• 17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

13.0

17.0

12.0

39.0

10.0

Correct Answer: A

Question 10 of 35

A machine learning engineer is trying to scale a machine learning pipeline pipeline that contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:

Exam Certified Machine Learning Associate Question 10

A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to the estimator parameter and then placing the updated cv object as the final stage of the pipeline in place of the original model.

Which of the following is a negative consequence of the approach suggested by the colleague?

The model will take longer to train for each unique combination of hyperparameter values

The feature engineering stages will be computed using validation data

The cross-validation process will no longer be parallelizable

The cross-validation process will no longer be reproducible

The model will be refit one more per cross-validation fold

Correct Answer: B