Certified Data Engineer Professional

Here you have the best Databricks Certified Data Engineer Professional practice exam questions

  • You have 163 total questions to study from
  • Each page has 5 questions, making a total of 33 pages
  • You can navigate through the pages using the buttons at the bottom
  • This questions were last updated on November 16, 2024
Question 1 of 163

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format("parquet").load(f"/mnt/source/(date)")

Which code block should be used to create the date Python variable used in the above code block?

    Correct Answer: D

    To create the date variable for use in the notebook, the code block should interact with the Databricks Jobs API to retrieve the parameter that has been passed by the upstream system. The appropriate method for this purpose is to use dbutils.notebooks.getParam, which is designed to retrieve parameters passed to the notebook from external sources like the Databricks Jobs API. This ensures that the value of 'date' is accurately obtained from the API, aligning with the requirements of the given scenario.

Question 2 of 163

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

    Correct Answer: D

    The minimal permissions a user needs to start and attach to an already configured cluster in Databricks are 'Can Restart' privileges on the required cluster. 'Can Restart' allows users to start a terminated cluster without unnecessary additional permissions.

Question 3 of 163

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

    Correct Answer: D

    To schedule Structured Streaming jobs for production with an automatic recovery from query failures while keeping costs low, you should use a new job cluster to ensure an isolated and streamlined environment, set retries to unlimited to handle and recover from any failures automatically, and limit maximum concurrent runs to 1 to avoid resource contention and ensure only one instance of the query runs at a time. This configuration efficiently manages resources and ensures the job's reliability and cost-effectiveness.

Question 4 of 163

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

    Correct Answer: E

    The alert is set to trigger when the mean temperature exceeds 120. Given that notifications were raised for three consecutive minutes and then stopped, it must be true that the average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query.

Question 5 of 163

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

    Correct Answer: B

    The correct approach is to use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch. This will allow the developer to access the most recent and desired logic for the notebook. Pulling changes from the remote repository ensures that the latest updates are incorporated and selecting the specific branch will provide the correct version of the code.