Professional Data Engineer Exam - Question 129

Question

You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

Examice · Accepted Answer

To handle modifications and errors efficiently, data should be organized in separate tables for each month. This helps isolate issues to specific time periods. Using snapshot decorators allows you to restore a table to a state prior to the corruption, offering a cost-effective and integrated solution for maintaining backup data. This method also ensures that storage costs are optimized as only the differences between snapshots and the base table are stored, rather than duplicating the entire dataset.

[Removed] · Answer

Should be B

Ganshank · Answer

B
The questions is specifically about organizing the data in BigQuery and storing backups.

Lanro · Answer

From BigQuery documentation - Benefits of using table snapshots include the following:

- Keep a record for longer than seven days. With BigQuery time travel, you can only access a table's data from seven days ago or more recently. With table snapshots, you can preserve a table's data from a specified point in time for as long as you want.
- Minimize storage cost. BigQuery only stores bytes that are different between a snapshot and its base table, so a table snapshot typically uses less storage than a full copy of the table.

So storing data in GCS will make copies of data for each table. Table snapshots are more optimal in this scenario.

ckanaar · Answer

The answer is B:

Why not D? Because snapshot costs can become high if a lot of small changes are made to the base table: https://cloud.google.com/bigquery/docs/table-snapshots-intro#:~:text=Because%20BigQuery%20storage%20is%20column%2Dbased%2C%20small%20changes%20to%20the%20data%20in%20a%20base%20table%20can%20result%20in%20large%20increases%20in%20storage%20cost%20for%20its%20table%20snapshot.

Since the question specifically states that the ETL pipeline is regularly modified, this means that lots of small changes are present. In combination with the requirement to optimize for storage costs, this means that option B is the way to go.

Bahubali1988 · Answer

90% of questions are having multiple answers and its very hard to get into every discussion  where the conclusion is not there

Nirca · Answer

D - this solution in integrated. No core is needed

lucaluca1982 · Answer

Why not D?

John_Pongthorn · Answer

B
https://cloud.google.com/architecture/dr-scenarios-for-data#BigQuery

WillemHendr · Answer

"Store your data in different tables for specific time periods. This method ensures that you need to restore only a subset of data to a new table, rather than a whole dataset."

"Store the original data on Cloud Storage. This allows you to create a new table and reload the uncorrupted data. From there, you can adjust your applications to point to the new table."

B

sdi_studiers · Answer

D
"With BigQuery time travel, you can only access a table's data from seven days ago or more recently. With table snapshots, you can preserve a table's data from a specified point in time for as long as you want." [source: https://cloud.google.com/bigquery/docs/table-snapshots-intro]

zellck · Answer

B is the answer.

phidelics · Answer

Organize in separate tables and store in GCS

vamgcp · Answer

Organizing your data in separate tables for each month will make it easier to identify the affected data and restore it.
Exporting and compressing the data will reduce storage costs, as you will only need to store the compressed data in Cloud Storage.
Storing your backups in Cloud Storage will make it easier to restore the data, as you can restore the data from Cloud Storage directly

arien_chen · Answer

keyword: detected after 2 weeks.
only snapshot could resolve the problem.

Farah_007 · Answer

From : https://cloud.google.com/architecture/dr-scenarios-for-data#BigQuery
It can't be D
If the corruption is caught within 7 days, query the table to a point in time in the past to recover the table prior to the corruption using snapshot decorators.
Store the original data on Cloud Storage. This allows you to create a new table and reload the uncorrupted data. From there, you can adjust your applications to point to the new table. => D

zevexWM · Answer

Answer is D: 
Snapshots are different from time travel. They can hold data as long as we want.
Furthermore "BigQuery only stores bytes that are different between a snapshot and its base table" so pretty cost effective as well.

https://cloud.google.com/bigquery/docs/table-snapshots-intro#table_snapshots

Lenifia · Answer

The best option is D. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.

Professional Data Engineer Exam - Question 129

Discussion