DEA-C01 Exam QuestionsBrowse all questions from this exam

DEA-C01 Exam - Question 80


A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.

Which solution will meet these requirements MOST cost-effectively?

Show Answer
Correct Answer: D

To process daily incoming .csv files that are less than 100 MB in size, writing an AWS Glue Python shell job and using pandas for transformation is the most cost-effective solution. AWS Glue’s Python shell jobs are suitable for smaller-scale ETL tasks and offer lower operational costs compared to other more complex and expensive solutions like Amazon EMR or Amazon EKS. Furthermore, the Python shell job can use a minimal fraction of a DPU for execution, specifically 1/16 DPU, making it highly efficient for the given task.

Discussion

13 comments
Sign in to comment
halogiOption: C
Mar 28, 2024

AWS Glue Python Shell Job is billed $0.44 per DPU-Hour for each job AWS Glue PySpark is billed $0.29 per DPU-Hour for each job with flexible execution and $0.44 per DPU-Hour for each job with standard execution Source: https://aws.amazon.com/glue/pricing/

GustonMari
Jul 11, 2024

thats true for the 1 DPU, but thats not good because the minimum DPU for PySpark Job is 1 DPU. But for Python Job the minimum DPU is 0.0625. So the Python job is way more cheaper for small dataset and quick ETL transformation

lucas_rfsbOption: C
Apr 2, 2024

AWS Glue Python Shell Job is billed $0.44 per DPU-Hour for each job AWS Glue PySpark is billed $0.29 per DPU-Hour for each job with flexible execution and $0.44 per DPU-Hour for each job with standard execution Source: https://aws.amazon.com/glue/pricing/

atu1789Option: D
Feb 17, 2024

Option D: Write an AWS Glue Python shell job and use pandas to transform the data, is the most cost-effective solution for the described scenario. AWS Glue’s Python shell jobs are a good fit for smaller-scale ETL tasks, especially when dealing with .csv files that are less than 100 MB each. The use of pandas, a powerful and efficient data manipulation library in Python, makes it an ideal tool for processing and transforming these types of files. This approach avoids the overhead and additional costs associated with more complex solutions like Amazon EKS or EMR, which are generally more suited for larger-scale, more complex data processing tasks. Given the requirements – processing daily incoming small-sized .csv files – this solution provides the necessary functionality with minimal resources, aligning well with the goal of cost-effectiveness.

rralucard_Option: C
Feb 2, 2024

AWS Glue is a fully managed ETL service, which means you don't need to manage infrastructure, and it automatically scales to handle your data processing needs. This reduces operational overhead and cost. PySpark, as a part of AWS Glue, is a powerful and widely-used framework for distributed data processing, and it's well-suited for handling data transformations on a large scale.

pypelyncarOption: D
Jun 12, 2024

good candidate to be (2 options) for real, either spark and py have similar approaches. I would go with Pandas, although... 50/50.. it could be Spark. I hope not to find this question in the exam

agg42Option: D
Apr 1, 2024

https://medium.com/@navneetsamarth/reduce-aws-cost-using-glue-python-shell-jobs-70a955d4359f#:~:text=The%20cheapest%20Glue%20Spark%20ETL,1%2F16th%20of%20a%20DPU.&text=This%20can%20result%20in%20massive,just%20a%20better%20design%20overall!

chakka90
Apr 30, 2024

D. Because the pyspark is still being the cheap you have to use minimum of 2 DPU. Which would increase the cost anyway so, i feel that d should be correct

VerRiOption: C
May 21, 2024

PySpark with Spark(Flexible Execution): $0.29/hr for 1 DPU PySpark with Spark(Standard Execution): $0.44/hr for 1 DPU Python Shell with Pandas: $0.44/hr for 1 DPU

GiorgioGssOption: D
Mar 19, 2024

D is more cheaper than C. Not so scalable but is cheaper...

Leo87656789Option: D
Apr 27, 2024

Option D: Even though the Python Shell Job is more expensive on a DPU-Hour basis, you can select the option "1/16 DPU" in the Job details for a Python Shell Job, which is definetly cheaper than a Pyspark job.

khchan123Option: D
Apr 28, 2024

D. While AWS Glue PySpark jobs are scalable and suitable for large workloads, C may be overkill for processing small .csv files (less than 100 MB each). The overhead of using Apache Spark may not be cost-effective for this specific use case.

cloudataOption: D
May 5, 2024

Python Shell is cheaper and can handle small to medium tasks. https://docs.aws.amazon.com/whitepapers/latest/aws-glue-best-practices-build-performant-data-pipeline/additional-considerations.html

LR2023Option: D
Jul 16, 2024

going with D https://docs.aws.amazon.com/whitepapers/latest/aws-glue-best-practices-build-performant-data-pipeline/additional-considerations.html