Exam DP-203 All QuestionsBrowse all questions from this exam
Question 261

HOTSPOT -

You develop a dataset named DBTBL1 by using Azure Databricks.

DBTBL1 contains the following columns:

✑ SensorTypeID

✑ GeographyRegionID

✑ Year

✑ Month

✑ Day

✑ Hour

✑ Minute

✑ Temperature

✑ WindSpeed

✑ Other

You need to store the data to support daily incremental load pipelines that vary for each GeographyRegionID. The solution must minimize storage costs.

How should you complete the code? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

    Correct Answer:

    Box 1: .partitionBy -

    Incorrect Answers:

    ✑ .format:

    Method: format():

    Arguments: "parquet", "csv", "txt", "json", "jdbc", "orc", "avro", etc.

    ✑ .bucketBy:

    Method: bucketBy()

    Arguments: (numBuckets, col, col..., coln)

    The number of buckets and names of columns to bucket by. Uses Hive's bucketing scheme on a filesystem.

    Box 2: ("Year", "Month", "Day","GeographyRegionID")

    Specify the columns on which to do the partition. Use the date columns followed by the GeographyRegionID column.

    Box 3: .saveAsTable("/DBTBL1")

    Method: saveAsTable()

    Argument: "table_name"

    The table to save to.

    Reference:

    https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch04.html https://docs.microsoft.com/en-us/azure/databricks/delta/delta-batch

Discussion
PallaviPatel

1. Partition by 2. GeographyRegionID, Year, Month, Day as the pipelines are per region this seems right choice 3. Parquet

uzairahm

regarding point 2 Solution needs to support daily incremental load so having Year, Month, Day first would be more useful

jpgsa11

Partition on date prerably, and hash on non date fields so that it is balanced...

petulda

I suggest storing the data in parquet

Aurelkb

it is the same question on Topic 1 Question 36. Then 1. Partition by 2. GeographyRegionID, Year, Month, Day 3. Parquet

dev2dev

its recommend to use partitions first before Y/M/D so that they can be managed eazily such as assigning security, or processing by business unit such as zone/country/area etc., GeographyRegionId/Year/Month/Day and Paraquet are answers

xmety

1. Patition by 2.Year,Month, Day, GeographyRegionID (it said to minimize storage cost, not query performance. if GeographyRegionID goes first, each regionID will have repeated folders for different dates) 3. Parquet

jpgsa11

Completely true. The amount of date folders per RegionID would be huge

Deeksha1234

Agree with Pallavi 1. Partition by 2. GeographyRegionID, Year, Month, Day 3. Parquet

OldSchool

Agree on 1) & 3) but for 2) it should be year/month/day/GeographyRegionId and for each day we would generate several GeographyRegionId.parquet files

OldSchool

Disregard my comment on 2). Provided answer is the correct one.

bad_atitude

Mes chers amis: 1.Sortby 2.GeographyRegionId, Year, Month, Day 3.Parquet

VikkiC

This question is similar to #36Topic 1, if you reference that question, the answer should be 1. Partitioned By, 2. GeographyRegionID, Year, Month, Day, 3. Parquet

Backy

// the correct answer is df.write.partitionBy("GeographyRegionID").mode("append").parquet("/DBTBL1") // or df.write.partitionBy("GeographyRegionID","Year","Month","Day").mode("append").parquet("/DBTBL1") // Question says "minimize storage costs" so I would select the first one

Davico93

Agree, but if you choose the first one, you won't have the daily data

allagowf

no mentionning for daily data in the question

Spinozabubble

daily incremental load pipelines

Azure_2023

1. Partition by 2. GeographyRegionID, Year, Month, Day 3. Parquet

j888

Agreed with Geographyregionid first before the Year, month and day

pperf

https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices

dsp17

Parquet is must (offer higher compression rates)- "The solution must minimize storage costs."

Amsterliese

I was wondering if the incremental load is supported for parquet, but since "append" mode is used, this should be alright. The question asks to minimize costs, so I go for parquet (not saveAsTable). partitionBy GeopgraphyRegionID, Year,Month,Day (pipelines per region; daily load) parquet

jv2120

only reason for using .parquet is option seems to be dataset path not table else saveastable is right.

e56bb91

ChatGPT 4o Given the requirement to support daily incremental loads for each GeographyRegionID, the optimal partitioning strategy would be: Partition by (GeographyRegionID, Year, Month, Day) This strategy allows the system to efficiently access the data based on GeographyRegionID first, and then further narrows down the partitions based on Year, Month, and Day. This approach is particularly useful for managing and querying large datasets in a time-series fashion, which is typical for sensor data.

learnwell

The second option given in the answer is correct which is Year, Month, Day, GeographyRegionId Why? The question states that the "solution must minimize storage cost". One of the way to do that is to have lesser number of folders getting created. Here is an example: Let's says we go with the option of Year, Month, Day, GeographyRegionId where we consider 1 year of data and say we have 10 different region. So the number of folders that would get created would be 1(Year) + 12(Month) + 365 (Days) + 3650 (No of days multiplied by 10 regions as each day's folder would contain 10 sub folders for 10 regions) which gives 3663 as count. If we go with the option of GeographyRegionId, Year, Month, Day then 1(Year) + 12(Month) + 365 (Days) i.e, 378 folders would be repeated 10 times inside 10 different region folder which gives total count as 3780(378*10) which is definitely higher than 3663.

JosephVishal

For 3.) if parquet with partitions, then it should "overwrite" mode instead of "append". Since, it is "append" mode, I think saveAsTable sis more appropriate.