MLS-C01 Exam QuestionsBrowse all questions from this exam

MLS-C01 Exam - Question 311


A data engineer wants to perform exploratory data analysis (EDA) on a petabyte of data. The data engineer does not want to manage compute resources and wants to pay only for queries that are run. The data engineer must write the analysis by using Python from a Jupyter notebook.

Which solution will meet these requirements?

Show Answer
Correct Answer: A

The best solution for performing exploratory data analysis (EDA) on a petabyte of data, while avoiding the management of compute resources and paying only for queries run, is to use Apache Spark from within Amazon Athena. Amazon Athena allows users to run serverless queries and pay only for the data processed. Although Amazon SageMaker provides similar functionalities and supports Jupyter notebooks and Python, it does not align as closely with the requirement of paying only for the queries run. Instead, SageMaker can incur costs even when the resources are allocated but not actively used. Therefore, the correct choice is to use Apache Spark from within Amazon Athena.

Discussion

9 comments
Sign in to comment
AIWaveOption: B
Mar 9, 2024

A: No - Athena does not support python code B: Yes - Sagemaker is serverless and SageMaker Processing allows you to run Spark jobs from a Jupyter notebook using Python. You only pay for resources used during processing jobs. C: No - involves managing the EMR cluster. You pay for running EC2 instances whether in use or not. D: No - Redshift can't run spark jobs and no native support for python/Jupiter notebooks

rav009Option: A
Mar 18, 2024

https://docs.amazonaws.cn/en_us/athena/latest/ug/notebooks-spark-getting-started.html

vkbajoriaOption: B
Mar 22, 2024

Serverless, Python, and Notebook are key elements for making the decision. It's B

vkbajoria
Mar 30, 2024

I changed my mind, Athena supports spark. It's A

ddaanndann
Mar 30, 2024

Correct Answer: A https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html

shivamgulati13
Mar 17, 2024

Just thinking out loud, how can it be not Redshift as well? The question also mentions pay for queries, and handle petabyte of data. Spark is an integration possible with Amazon Redshift, and Redshift has serverless version too. https://aws.amazon.com/blogs/aws/new-amazon-redshift-integration-with-apache-spark/

JonSno
Mar 30, 2024

It's A - Using Apache Spark on Amazon Athena https://aws-sdk-pandas.readthedocs.io/en/3.2.1/tutorials/041%20-%20Apache%20Spark%20on%20Amazon%20Athena.html

rav009Option: A
May 27, 2024

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-editor.html

pandkastOption: A
Jun 23, 2024

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-working-with-notebooks.html

eicresv2Option: A
Jul 12, 2024

A and not B also because of paying for queries that you run. Notebooks will continue to run and cost money