Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 129


You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

Show Answer
Correct Answer: D

To handle modifications and errors efficiently, data should be organized in separate tables for each month. This helps isolate issues to specific time periods. Using snapshot decorators allows you to restore a table to a state prior to the corruption, offering a cost-effective and integrated solution for maintaining backup data. This method also ensures that storage costs are optimized as only the differences between snapshots and the base table are stored, rather than duplicating the entire dataset.

Discussion

17 comments
Sign in to comment
[Removed]Option: B
Mar 22, 2020

Should be B

GanshankOption: B
Apr 14, 2020

B The questions is specifically about organizing the data in BigQuery and storing backups.

LanroOption: D
Jul 31, 2023

From BigQuery documentation - Benefits of using table snapshots include the following: - Keep a record for longer than seven days. With BigQuery time travel, you can only access a table's data from seven days ago or more recently. With table snapshots, you can preserve a table's data from a specified point in time for as long as you want. - Minimize storage cost. BigQuery only stores bytes that are different between a snapshot and its base table, so a table snapshot typically uses less storage than a full copy of the table. So storing data in GCS will make copies of data for each table. Table snapshots are more optimal in this scenario.

ckanaarOption: B
Sep 21, 2023

The answer is B: Why not D? Because snapshot costs can become high if a lot of small changes are made to the base table: https://cloud.google.com/bigquery/docs/table-snapshots-intro#:~:text=Because%20BigQuery%20storage%20is%20column%2Dbased%2C%20small%20changes%20to%20the%20data%20in%20a%20base%20table%20can%20result%20in%20large%20increases%20in%20storage%20cost%20for%20its%20table%20snapshot. Since the question specifically states that the ETL pipeline is regularly modified, this means that lots of small changes are present. In combination with the requirement to optimize for storage costs, this means that option B is the way to go.

Bahubali1988
Oct 2, 2023

90% of questions are having multiple answers and its very hard to get into every discussion where the conclusion is not there

NircaOption: D
Oct 22, 2023

D - this solution in integrated. No core is needed

lucaluca1982
Mar 17, 2023

Why not D?

John_PongthornOption: B
Sep 23, 2022

B https://cloud.google.com/architecture/dr-scenarios-for-data#BigQuery

WillemHendrOption: B
Jun 7, 2023

"Store your data in different tables for specific time periods. This method ensures that you need to restore only a subset of data to a new table, rather than a whole dataset." "Store the original data on Cloud Storage. This allows you to create a new table and reload the uncorrupted data. From there, you can adjust your applications to point to the new table." B

sdi_studiersOption: D
Jun 9, 2023

D "With BigQuery time travel, you can only access a table's data from seven days ago or more recently. With table snapshots, you can preserve a table's data from a specified point in time for as long as you want." [source: https://cloud.google.com/bigquery/docs/table-snapshots-intro]

zellckOption: B
Dec 3, 2022

B is the answer.

phidelicsOption: B
Jun 11, 2023

Organize in separate tables and store in GCS

cetanx
Jun 12, 2023

Just an additional info! Here is an example for an export job; $ bq extract --destination_format CSV --compression GZIP 'your_project:your_dataset.your_new_table' 'gs://your_bucket/your_object.csv.gz'

cetanx
Jul 5, 2023

I will update my answer to D. Think of a scenario that you are in the last week of June and an error occurred 3 weeks ago (so still in June) however you do not have an export of the June table yet therefore you cannot recover the data simply because you don't have an export just yet. So snapshots are way to go!

vamgcpOption: B
Jul 27, 2023

Organizing your data in separate tables for each month will make it easier to identify the affected data and restore it. Exporting and compressing the data will reduce storage costs, as you will only need to store the compressed data in Cloud Storage. Storing your backups in Cloud Storage will make it easier to restore the data, as you can restore the data from Cloud Storage directly

arien_chenOption: D
Aug 20, 2023

keyword: detected after 2 weeks. only snapshot could resolve the problem.

Farah_007Option: B
Apr 10, 2024

From : https://cloud.google.com/architecture/dr-scenarios-for-data#BigQuery It can't be D If the corruption is caught within 7 days, query the table to a point in time in the past to recover the table prior to the corruption using snapshot decorators. Store the original data on Cloud Storage. This allows you to create a new table and reload the uncorrupted data. From there, you can adjust your applications to point to the new table. => D

zevexWMOption: D
Apr 24, 2024

Answer is D: Snapshots are different from time travel. They can hold data as long as we want. Furthermore "BigQuery only stores bytes that are different between a snapshot and its base table" so pretty cost effective as well. https://cloud.google.com/bigquery/docs/table-snapshots-intro#table_snapshots

LenifiaOption: D
Jul 3, 2024

The best option is D. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.