Professional Data Engineer Exam - Question 58

Question

You architect a system to analyze seismic data. Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive. Then you discover that a sensor calibration step has been omitted. How should you change your ETL process to carry out sensor calibration systematically in the future?

Examice · Accepted Answer

Introducing a new MapReduce job to apply sensor calibration to raw data and ensuring all other MapReduce jobs are chained after this is the best approach. This systematically ensures that all data is calibrated right at the start of the ETL process, maintaining data integrity and consistency for downstream processing. Modifying the existing multiple transformMapReduce jobs to apply sensor calibration could lead to complexities and potential errors, and it is less efficient since calibration needs to be applied uniformly to all data before any further processing occurs.

SteelWarrior · Answer

Should go with B. Two reasons, it is a cleaner approach with single job to handle the calibration before the data is used in the pipeline. Second, doing this step in later stages can be complex and maintenance of those jobs in the future will become challenging.

[Removed] · Answer

Answer: A
Description: My take on this is for sensor calibration you just need to update the transform function, rather than creating a whole new mapreduce job and storing/passing the values to next job

YuriP · Answer

Should be B. It's a Data Quality step which has to go right after Raw Ingest. Otherwise you repeat the same step unknown (see "job_s_" in A) number of times, possibly for no reason, therefore extending ETL time.

sumanshu · Answer

Vote for 'B' (introduce new job) over 'A',  (instead of modifying existing job)

[Removed] · Answer

It's between A or B.
Should choose A

ZIMARAKI · Answer

SteelWarrior explanation is correct :)

anji007 · Answer

Ans: B
Adding a new job in the beginning of chain makes more sense than updating existing chain of jobs.

hendrixlives · Answer

SteelWarrior's answer is correct

medeis_jar · Answer

SteelWarrior explanation is correct

lord_ryder · Answer

SteelWarrior explanation is correct

odacir · Answer

Should be B. My reason, this is like an Anti corruption layer, and that's a good practice, 
C- , if you modify your transformMapReduce will be harder to test and debug, so it's a bad practice.
C the idea de introduce manual operation is an anti patron and has a lot of problems
D It's overkilling, a don't have sense in this scenario.

DGames · Answer

Best approach is calibration will be separate job because if we need to tune the calibration later also it would be to maintain without worries about all other jobs.

DipT · Answer

It is much cleaner approach

samdhimal · Answer

B. Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.

This approach would ensure that sensor calibration is systematically carried out every time the ETL process runs, as the new MapReduce job would be responsible for calibrating the sensors before the data is processed by the other steps. This would ensure that all data is calibrated before being analyzed, thus avoiding the omission of the sensor calibration step in the future.
It also allows you to chain all other MapReduce jobs after this one, so that the calibrated data is used in all the downstream jobs.

jin0 · Answer

What kinds of sensor calibrations exists? I don't understand how computation in pipeline would be expense due to calibration being omitted..?

Marwan95 · Answer

I'll choose A. WHY? cause the process already takes DAYS and adding another step will increase the time more

Professional Data Engineer Exam - Question 58

Discussion