Professional Data Engineer on Google Cloud Platform

Here you have the best Google Professional Data Engineer practice exam questions

You have 319 total questions to study from
Each page has 5 questions, making a total of 64 pages
You can navigate through the pages using the buttons at the bottom
This questions were last updated on May 16, 2025
This site is not affiliated with or endorsed by Google.

Question 1 of 319

Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

Threading

Serialization

Dropout Methods

Dimensionality Reduction

Correct Answer: C

The poor performance of the model on new data despite fitting well on the training data indicates overfitting. Overfitting occurs when a model learns the details and noise in the training data to an extent that it negatively impacts the model's performance on new data. Dropout Methods are a regularization technique used to prevent overfitting in neural networks. By randomly dropping neurons during training, dropout helps to ensure that the model does not rely too heavily on any individual neurons, thus promoting generalization and improving the model's performance on new data.

Question 2 of 319

You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

Continuously retrain the model on just the new data.

Continuously retrain the model on a combination of existing data and the new data.

Train on the existing data while using the new data as your test set.

Train on the new data while using the existing data as your test set.

Correct Answer: B

To maintain accuracy and relevance in a clothing recommendation model, it is crucial to continuously retrain the model using both existing data and new data. This approach leverages the historical data to provide context and stability while incorporating the latest trends to keep the model up-to-date. Simply retraining on new data might make the model overly reactive to recent trends and lose the broader perspective provided by historical data. Conversely, using new or old data exclusively for testing is not effective for continuous learning and adaptability. Therefore, integrating both data sources ensures the model remains balanced and effective in reflecting changing user preferences.

Question 3 of 319

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

Add capacity (memory and disk space) to the database server by the order of 200.

Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.

Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Correct Answer: C

Since the database must store significantly more patient records, it is important to improve the efficiency and scalability of the design. Normalizing the master patient-record table into separate tables for patients and visits will reduce data redundancy and improve query performance. This approach will help the database handle the increased data volume and allow for more efficient report generation by avoiding the performance issues associated with self-joins.

Question 4 of 319

You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?

Disable caching by editing the report settings.

Disable caching in BigQuery by editing table details.

Refresh your browser tab showing the visualizations.

Clear your browser history for the past hour then reload the tab showing the virtualizations.

Correct Answer: A

In Google Data Studio 360, the mechanism that causes visualizations to not show recent data (less than 1 hour old) is caching. By default, Data Studio uses caching to enhance performance by reducing the number of queries sent to the data source. To ensure that the data displayed in your report is always the most up-to-date, you should disable caching in the report settings. This forces Data Studio to retrieve the latest data directly from the data source (in this case, Google BigQuery) each time the report is viewed. Disabling caching can impact performance, but it ensures data accuracy. Therefore, the correct action is to disable caching by editing the report settings.

Question 5 of 319

An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values

(CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

Use federated data sources, and check data in the SQL query.

Enable BigQuery monitoring in Google Stackdriver and create an alert.

Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.

Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.

Correct Answer: D

Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery and push errors to another dead-letter table for analysis. Dataflow allows you to preprocess the data, making it possible to handle corrupted or incorrectly formatted rows effectively. By pushing problematic rows to a dead-letter table, you ensure only clean and correctly formatted data is loaded into BigQuery for accurate analysis while also retaining the problematic data for further inspection and resolution.