MLS-C01 Exam - Question 311

Question

A data engineer wants to perform exploratory data analysis (EDA) on a petabyte of data. The data engineer does not want to manage compute resources and wants to pay only for queries that are run. The data engineer must write the analysis by using Python from a Jupyter notebook.

Which solution will meet these requirements?

Examice · Accepted Answer

The best solution for performing exploratory data analysis (EDA) on a petabyte of data, while avoiding the management of compute resources and paying only for queries run, is to use Apache Spark from within Amazon Athena. Amazon Athena allows users to run serverless queries and pay only for the data processed. Although Amazon SageMaker provides similar functionalities and supports Jupyter notebooks and Python, it does not align as closely with the requirement of paying only for the queries run. Instead, SageMaker can incur costs even when the resources are allocated but not actively used. Therefore, the correct choice is to use Apache Spark from within Amazon Athena.

AIWave · Answer

A: No - Athena does not support python code
B: Yes - Sagemaker is serverless and SageMaker Processing allows you to run Spark jobs from a Jupyter notebook using Python. You only pay for resources used during processing jobs.
C: No - involves managing the EMR cluster. You pay for running EC2 instances whether in use or not.
D: No - Redshift can't run spark jobs and no native support for python/Jupiter notebooks

rav009 · Answer

https://docs.amazonaws.cn/en_us/athena/latest/ug/notebooks-spark-getting-started.html

vkbajoria · Answer

Serverless, Python, and Notebook are key elements for making the decision. It's B

ddaanndann · Answer

Correct Answer: A

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html

shivamgulati13 · Answer

Just thinking out loud, how can it be not Redshift as well? The question also mentions pay for queries, and handle petabyte of data.
Spark is an integration possible with Amazon Redshift, and Redshift has serverless version too. 
https://aws.amazon.com/blogs/aws/new-amazon-redshift-integration-with-apache-spark/

JonSno · Answer

It's A - Using Apache Spark on Amazon Athena 
https://aws-sdk-pandas.readthedocs.io/en/3.2.1/tutorials/041%20-%20Apache%20Spark%20on%20Amazon%20Athena.html

rav009 · Answer

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-editor.html

pandkast · Answer

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-working-with-notebooks.html

eicresv2 · Answer

A and not B also because of paying for queries that you run. Notebooks will continue to run and cost money

MLS-C01 Exam - Question 311

Discussion