Exam MLS-C01 All QuestionsBrowse all questions from this exam
Question 311

A data engineer wants to perform exploratory data analysis (EDA) on a petabyte of data. The data engineer does not want to manage compute resources and wants to pay only for queries that are run. The data engineer must write the analysis by using Python from a Jupyter notebook.

Which solution will meet these requirements?

    Correct Answer: A

    The best solution for performing exploratory data analysis (EDA) on a petabyte of data, while avoiding the management of compute resources and paying only for queries run, is to use Apache Spark from within Amazon Athena. Amazon Athena allows users to run serverless queries and pay only for the data processed. Although Amazon SageMaker provides similar functionalities and supports Jupyter notebooks and Python, it does not align as closely with the requirement of paying only for the queries run. Instead, SageMaker can incur costs even when the resources are allocated but not actively used. Therefore, the correct choice is to use Apache Spark from within Amazon Athena.

Discussion
AIWaveOption: B

A: No - Athena does not support python code B: Yes - Sagemaker is serverless and SageMaker Processing allows you to run Spark jobs from a Jupyter notebook using Python. You only pay for resources used during processing jobs. C: No - involves managing the EMR cluster. You pay for running EC2 instances whether in use or not. D: No - Redshift can't run spark jobs and no native support for python/Jupiter notebooks

rav009Option: A

https://docs.amazonaws.cn/en_us/athena/latest/ug/notebooks-spark-getting-started.html

ddaanndann

Correct Answer: A https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html

vkbajoriaOption: B

Serverless, Python, and Notebook are key elements for making the decision. It's B

vkbajoria

I changed my mind, Athena supports spark. It's A

eicresv2Option: A

A and not B also because of paying for queries that you run. Notebooks will continue to run and cost money

pandkastOption: A

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-working-with-notebooks.html

rav009Option: A

https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-editor.html

JonSno

It's A - Using Apache Spark on Amazon Athena https://aws-sdk-pandas.readthedocs.io/en/3.2.1/tutorials/041%20-%20Apache%20Spark%20on%20Amazon%20Athena.html

shivamgulati13

Just thinking out loud, how can it be not Redshift as well? The question also mentions pay for queries, and handle petabyte of data. Spark is an integration possible with Amazon Redshift, and Redshift has serverless version too. https://aws.amazon.com/blogs/aws/new-amazon-redshift-integration-with-apache-spark/