Certified Machine Learning Associate

Here you have the best Databricks Certified Machine Learning Associate practice exam questions

You have 35 total questions to study from
Each page has 5 questions, making a total of 7 pages
You can navigate through the pages using the buttons at the bottom
This questions were last updated on July 22, 2025
This site is not affiliated with or endorsed by Databricks.

Question 1 of 35

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.

Which of the following lines of code will return the metadata description?

There is no way to return the metadata description programmatically.

fs.create_training_set("new_table")

fs.get_table("new_table").description

fs.get_table("new_table").load_df()

fs.get_table("new_table")

Correct Answer: C

To retrieve the metadata description of a Feature Table that was created using the Feature Store Client, you can access the description attribute of the table object. Therefore, the line of code 'fs.get_table("new_table").description' will return the metadata description.

Question 2 of 35

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.

Which of the following code blocks will accomplish this task?

spark_df[spark_df["price"] > 0]

spark_df.filter(col("price") > 0)

SELECT * FROM spark_df WHERE price > 0

spark_df.loc[spark_df["price"] > 0,:]

spark_df.loc[:,spark_df["price"] > 0]

Correct Answer: B

To filter rows in a Spark DataFrame where the value in a specific column meets a certain condition, the `filter` method is used. The correct way to reference a column within this method is by using the `col` function from `pyspark.sql.functions`. Therefore, the correct code block to create a new DataFrame containing only rows where the 'price' column is greater than 0 is `spark_df.filter(col('price') > 0)`. Other provided options either use incorrect syntax or methods not suitable for a Spark DataFrame.

Question 3 of 35

A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.

Which of the following classification metrics should be used to evaluate the model?

RMSE

Precision

Area under the residual operating curve

Accuracy

Recall

Correct Answer: E

Recall should be used to evaluate the model since it measures the proportion of actual positive cases that were correctly identified. This is particularly important for the health organization because they want to maximize the number of positive cases detected, thereby ensuring fewer positive cases are missed.

Question 4 of 35

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

When the features are of the categorical type

When the features are of the boolean type

When the features contain a lot of extreme outliers

When the features contain no outliers

When the features contain no missing values

Correct Answer: C

When features contain a lot of extreme outliers, imputing missing values with the median is preferable over the mean because the median is less affected by extreme values. The mean can be skewed by these outliers, which would not provide an accurate central tendency of the data.

Question 5 of 35

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.

Which of the following approaches can they take to include as much information as possible in the feature set?

Impute the missing values using each respective feature variable’s mean value instead of the median value

Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

Remove all feature variables that originally contained missing values from the feature set

Create a binary feature variable for each feature that contained missing values indicating whether each row’s value has been imputed

Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Correct Answer: D

Creating a binary feature variable for each feature that contained missing values allows the model to identify whether any missing value was imputed. This retains the information about the missing values, which can sometimes hold predictive power and improve model performance.