Professional Data Engineer on Google Cloud Platform Exam Questions

Here you have the best Google Professional Data Engineer practice exam questions.

Some things you may want to keep in mind about this practice exam questions:

  • You have 311 total questions to study from
  • Each page has 5 questions, making a total of 63 pages
  • You can navigate through the pages using the buttons at the bottom
  • This questions were last updated on September 13, 2024

Question 1 of 311


Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The model fits well for the training data. However, when tested against new data, it performs poorly. What method can you employ to address this?

    Correct Answer: C

    C

    Reference:

    https://medium.com/mlreview/a-simple-deep-learning-model-for-stock-price-prediction-using-tensorflow-30505541d877

Question 2 of 311


You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

    Correct Answer: B

    To maintain accuracy and relevance in a clothing recommendation model, it is crucial to continuously retrain the model using both existing data and new data. This approach leverages the historical data to provide context and stability while incorporating the latest trends to keep the model up-to-date. Simply retraining on new data might make the model overly reactive to recent trends and lose the broader perspective provided by historical data. Conversely, using new or old data exclusively for testing is not effective for continuous learning and adaptability. Therefore, integrating both data sources ensures the model remains balanced and effective in reflecting changing user preferences.

Question 3 of 311


You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

    Correct Answer: C

    Since the database must store significantly more patient records, it is important to improve the efficiency and scalability of the design. Normalizing the master patient-record table into separate tables for patients and visits will reduce data redundancy and improve query performance. This approach will help the database handle the increased data volume and allow for more efficient report generation by avoiding the performance issues associated with self-joins.

Question 4 of 311


You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old. What should you do?

    Correct Answer: A

    A

    Reference:

    https://support.google.com/datastudio/answer/7020039?hl=en

Question 5 of 311


An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values

(CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?

    Correct Answer: D

    Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery and push errors to another dead-letter table for analysis. Dataflow allows you to preprocess the data, making it possible to handle corrupted or incorrectly formatted rows effectively. By pushing problematic rows to a dead-letter table, you ensure only clean and correctly formatted data is loaded into BigQuery for accurate analysis while also retaining the problematic data for further inspection and resolution.