DP-201 Exam QuestionsBrowse all questions from this exam

DP-201 Exam - Question 7


HOTSPOT -

You are designing a data processing solution that will run as a Spark job on an HDInsight cluster. The solution will be used to provide near real-time information about online ordering for a retailer.

The solution must include a page on the company intranet that displays summary information.

The summary information page must meet the following requirements:

✑ Display a summary of sales to date grouped by product categories, price range, and review scope.

✑ Display sales summary information including total sales, sales as compared to one day ago and sales as compared to one year ago.

✑ Reflect information for new orders as quickly as possible.

You need to recommend a design for the solution.

What should you recommend? To answer, select the appropriate configuration in the answer area.

Hot Area:

Exam DP-201 Question 7
Show Answer
Correct Answer:
Exam DP-201 Question 7

Box 1: DataFrame -

DataFrames -

Best choice in most situations.

Provides query optimization through Catalyst.

Whole-stage code generation.

Direct memory access.

Low garbage collection (GC) overhead.

Not as developer-friendly as DataSets, as there are no compile-time checks or domain object programming.

Box 2: parquet -

The best format for performance is parquet with snappy compression, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.

Incorrect Answers:

DataSets -

Good in complex ETL pipelines where the performance impact is acceptable.

Not good in aggregations where the performance impact can be considerable.

RDDs -

You do not need to use RDDs, unless you need to build a new custom RDD.

No query optimization through Catalyst.

No whole-stage code generation.

High GC overhead.

Reference:

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-perf

Discussion

16 comments
Sign in to comment
kempstonjoystick
Apr 2, 2020

The highighted answer and the explanation differ. Should be dataframe I believe.

apz333
Apr 10, 2020

I think it should be dataframe as well. In most cases parquet and dataframe are the best choice.

frakcha
May 9, 2020

They say Dataset is good for complex ETL situations https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-perf

AhmedReda
Jun 25, 2020

The question need quick processing but Dataset add overhead, also the query is aggregation and Dataset not good at that DataSets : Adds serialization/deserialization overhead, High GC overhead, Not good in aggregations where the performance impact can be considerable. DataFrames : Best choice in most situations, Direct memory access.

Tombarc
Apr 23, 2020

I think it's dataframe too.

Runi
Jun 10, 2020

Data set is Not good in aggregations where the performance impact can be considerable.So. I think dataframe should be correct one. Can anyone confirm. Please Thanks.

ismaelrihawi
May 21, 2021

Data abstraction = Dataframe

serger
May 27, 2020

dataframe for sure

BaisArun
Nov 22, 2020

Dataset is not good for Aggregation, Should be dataframe.

syu31svc
Dec 9, 2020

https://docs.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage: "Parquet stores data in columnar format, and is highly optimized in Spark." "DataFrames Best choice in most situations."

BobFar
May 19, 2021

Dataframe is correct , https://docs.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage

Nihar258255
Nov 8, 2020

Can some correct the answers??

Deepu1987
Feb 17, 2021

It's wrong selection shown in the display. It's actually - Data Frame [Reason for elimination Not as developer-friendly as DataSets, as there are no compile-time checks or domain object programming,don't need to use RDDs, unless you need to build a new custom RDD] Anyhow "Parquet" is selected

HichemZe
Jul 27, 2021

1- DATFRAME 2 - Data Format = Avro Because only Avro support Streaming (Against Parquet)

satyamkishoresingh
Aug 11, 2021

The practical combination is Dataframe + Parquet . Here answer clarification is ambiguous.

mchatrvd
Aug 12, 2021

Anyone knows why Exam Topics have taken AWS certification questions offline? There is nothing related to AWS certifications which used to be there earlier.

victor90
Nov 30, 2021

Hi, I found the link to the associate SA exam. https://www.examtopics.com/exams/amazon/aws-certified-solutions-architect-associate-saa-c02/view/

hsetin
Sep 3, 2021

1. Dataframe 2. Parquet Confirmed