Question 6 of 163

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Which statement describes what will happen when the above code is executed?

    Correct Answer: A

    When using Databricks secrets with the `dbutils.secrets.get` method, the library handles secrets in a secure manner by redacting the output to avoid unintentional exposure. Therefore, when `print(password)` is executed, it will print the string 'REDACTED'. Since the connection details and permissions were configured correctly, the connection to the external table should succeed. Thus, the statement describing the result will be that the connection to the external table will succeed and the string 'REDACTED' will be printed.

Question 7 of 163

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.

Which code block accomplishes this task while minimizing potential compute costs?

    Correct Answer: A

    To accomplish the task of saving predictions to a Delta Lake table with the ability to compare all predictions across time while minimizing potential compute costs, the provided code block in option A is suitable. It uses the `saveAsTable` method with the mode set to `append`, ensuring that predictions are continually added to the table without overwriting previous entries. This approach maintains a historical record of predictions as required and fits the batch processing context since the churn predictions are made at most once per day. It's important to note that in Databricks, databases and table creation typically default to Delta Lake format, so there's no need to explicitly specify the format when using `saveAsTable`.

Question 8 of 163

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

    Correct Answer: B

    Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table. The provided code uses the .dropDuplicates method, which removes duplicates in the current batch of data based on the specified keys (customer_id and order_id) before writing to the orders table. However, this method does not check for duplicates that might already exist in the target table from previous writes. Therefore, while each new batch of data will be de-duplicated, duplicates may still persist in the orders table if they were written in earlier batches.

Question 9 of 163

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

    Correct Answer: E

    Cmd 1 will succeed because it correctly uses PySpark to filter and collect the list of countries from the 'geo_lookup' table where the continent is Africa ('AF'). This will result in 'countries_af' being a Python list of country names. However, Cmd 2 will fail because the SQL command does not recognize the 'countries_af' Python variable, leading to an error. SQL and Python operate in separate execution contexts within Databricks; therefore, Python variables are not accessible in SQL commands. Hence, 'countries_af' being a Python list cannot be used within the SQL 'IN' clause in Cmd 2.

Question 10 of 163

A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

    Correct Answer: D

    The Delta engine identifies which files to load by scanning the Delta log for min and max statistics for the latitude column. Delta Lake captures statistics for each data file, including minimum and maximum values in each column, and uses these statistics to optimize query execution by determining which files may contain the data that matches the query filter.