Professional Data Engineer Exam QuestionsBrowse all questions from this exam

Professional Data Engineer Exam - Question 58


You architect a system to analyze seismic data. Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive. Then you discover that a sensor calibration step has been omitted. How should you change your ETL process to carry out sensor calibration systematically in the future?

Show Answer
Correct Answer: AB

Introducing a new MapReduce job to apply sensor calibration to raw data and ensuring all other MapReduce jobs are chained after this is the best approach. This systematically ensures that all data is calibrated right at the start of the ETL process, maintaining data integrity and consistency for downstream processing. Modifying the existing multiple transformMapReduce jobs to apply sensor calibration could lead to complexities and potential errors, and it is less efficient since calibration needs to be applied uniformly to all data before any further processing occurs.

Discussion

16 comments
Sign in to comment
SteelWarriorOption: B
Sep 22, 2020

Should go with B. Two reasons, it is a cleaner approach with single job to handle the calibration before the data is used in the pipeline. Second, doing this step in later stages can be complex and maintenance of those jobs in the future will become challenging.

Yiouk
Aug 3, 2021

B. different MR jobs execute in series, adding 1 more job makes sense in this case.

[Removed]Option: A
Mar 28, 2020

Answer: A Description: My take on this is for sensor calibration you just need to update the transform function, rather than creating a whole new mapreduce job and storing/passing the values to next job

Jphix
May 28, 2021

It's B. A would involving changing every single job (notice it said jobS, plural, not a single job). If that is computationally intensive, which it is, you're repeating a computationally intense process needlessly several times. SteelWarrior and YuriP are right on this one.

mark1223jkh
May 17, 2024

Why all jobs, change only the first job for calibration, right?

YuriPOption: B
Aug 3, 2020

Should be B. It's a Data Quality step which has to go right after Raw Ingest. Otherwise you repeat the same step unknown (see "job_s_" in A) number of times, possibly for no reason, therefore extending ETL time.

sumanshuOption: B
Jun 28, 2021

Vote for 'B' (introduce new job) over 'A', (instead of modifying existing job)

[Removed]Option: A
Mar 21, 2020

It's between A or B. Should choose A

ZIMARAKIOption: B
Jan 16, 2022

SteelWarrior explanation is correct :)

anji007Option: B
Oct 15, 2021

Ans: B Adding a new job in the beginning of chain makes more sense than updating existing chain of jobs.

hendrixlivesOption: B
Dec 18, 2021

SteelWarrior's answer is correct

medeis_jarOption: B
Jan 4, 2022

SteelWarrior explanation is correct

lord_ryderOption: B
Jan 15, 2022

SteelWarrior explanation is correct

odacirOption: B
Dec 7, 2022

Should be B. My reason, this is like an Anti corruption layer, and that's a good practice, C- , if you modify your transformMapReduce will be harder to test and debug, so it's a bad practice. C the idea de introduce manual operation is an anti patron and has a lot of problems D It's overkilling, a don't have sense in this scenario.

DGamesOption: B
Dec 14, 2022

Best approach is calibration will be separate job because if we need to tune the calibration later also it would be to maintain without worries about all other jobs.

DipTOption: B
Dec 16, 2022

It is much cleaner approach

samdhimalOption: B
Jan 23, 2023

B. Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this. This approach would ensure that sensor calibration is systematically carried out every time the ETL process runs, as the new MapReduce job would be responsible for calibrating the sensors before the data is processed by the other steps. This would ensure that all data is calibrated before being analyzed, thus avoiding the omission of the sensor calibration step in the future. It also allows you to chain all other MapReduce jobs after this one, so that the calibrated data is used in all the downstream jobs.

samdhimal
Jan 23, 2023

Option A is not ideal, as it would be time-consuming to modify all the transformMapReduce jobs to apply sensor calibration before doing anything else, and there is a risk of introducing bugs or errors. Option C is not ideal, as it would rely on users to apply sensor calibration themselves, which would be inefficient and could introduce inconsistencies in the data. Option D is not ideal, as it would require a lot of simulation and testing to develop an algorithm that can predict the variance of data output accurately and it may not be as accurate as calibrating the sensor directly.

jin0
Feb 28, 2023

What kinds of sensor calibrations exists? I don't understand how computation in pipeline would be expense due to calibration being omitted..?

Marwan95Option: A
Jul 5, 2024

I'll choose A. WHY? cause the process already takes DAYS and adding another step will increase the time more