Exam DP-203 All QuestionsBrowse all questions from this exam
Question 322

HOTSPOT -

You have an on-premises data warehouse that includes the following fact tables. Both tables have the following columns: DateKey, ProductKey, RegionKey.

There are 120 unique product keys and 65 unique region keys.

Queries that use the data warehouse take a long time to complete.

You plan to migrate the solution to use Azure Synapse Analytics. You need to ensure that the Azure-based solution optimizes query performance and minimizes processing skew.

What should you recommend? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point

Hot Area:

    Correct Answer:

    Box 1: Hash-distributed -

    Box 2: ProductKey -

    ProductKey is used extensively in joins.

    Hash-distributed tables improve query performance on large fact tables.

    Box 3: Hash-distributed -

    Box 4: RegionKey -

    Round-robin tables are useful for improving loading speed.

    Consider using the round-robin distribution for your table in the following scenarios:

    ✑ When getting started as a simple starting point since it is the default

    ✑ If there is no obvious joining key

    ✑ If there is not good candidate column for hash distributing the table

    ✑ If the table does not share a common join key with other tables

    ✑ If the join is less significant than other joins in the query

    ✑ When the table is a temporary staging table

    Note: A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.

    Reference:

    https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute

Discussion
lara_mia1

1. Hash Distributed, ProductKey because >2GB and ProductKey is extensively used in joins 2. Hash Distributed, RegionKey because "The table size on disk is more than 2 GB." and you have to chose a distribution column which: "Is not used in WHERE clauses. This could narrow the query to not run on all the distributions." source: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choosing-a-distribution-column

Marcello83

I agree with lara_mia1

vblessings

i agree

niceguy0371

Disagree on nr. 1 because of the reason you give for nr. 2. (choose a distribution column that is not used in where clauses. A join is also a where clause

sdokmak

nah mate, check out his link: Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause. Is not used in WHERE clauses. This could narrow the query to not run on all the distributions. Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.

Rob77

Both hash as both are > 2GB. In the 2nd table RegionKey cannot be used with round_robin distribution as round_robin does not take a distribution key...

ploer

Correct: "A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution." https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

bc5468521

I AGREE WITH BOTH HASH WITH PRODUCT KEY

zarga

1. Hash on product key 2. Hash on region key (used on group by and have 65 unique values)

Deeksha1234

1. Hash Distributed, ProductKey because table size >2GB and ProductKey is extensively used in joins . another, region key could have been considered (after join key which is product key) since its being used in grouping but 75% records belongs to one region so - NO for region key. 2. Hash Distributed, RegionKey because the table size on disk is more than 2 GB and Its being used in grouping (for this table more than 75% record doesn't fall in same region) and you have to chose a distribution column which is not used in WHERE clause.

dom271219

"Choose a distribution column with data that distributes evenly" ProductKey is more relevant in both cases

DarioEtna

as for me i guess this is the right choice: 1. Hash Distributed, RegionKey because 2. Hash Distributed, RegionKey because "When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns" [Microsoft Documentation] If we use for one ProductKey and for one RegionKey maybe the data movements would increase...or not?

DarioEtna

But we cannot use ProductKey in both because in Invoice table it is used in WHERE condition

Lucky_me

If we choose RegionKey for Sales, we would have a processing skew.

Aditya0891

DarioEtna where in the question is it mentioned that both tables will be used together in a join query? They have different set of columns in where and group by, so why are you so sure that they will be used together? Answers provided are correct here

BrennaFrenna

The sales table makes sense with hashing distribution on ProductKey and since there is no obvious joining key for invoices, you should use round robin distribution on RegionKey. When it would be a smaller table you should use replicated.

kkk5566

1. Hash Distributed, ProductKey 2. Hash Distributed, RegionKey

Nishikag

To minimize data movement, select a distribution column that: Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause. Is not used in WHERE clauses. This could narrow the query to not run on all the distributions. Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.

Amalbenrebai

Regarding the invoces table, we can use the Round-robin distribution because there is no obvious joining key in the table

evangelist

Based on the votes, the best agreement for the configuration is: Sales Table: Hash Distributed, ProductKey (134 votes) Invoices Table: Hash Distributed, RegionKey (147 votes)

Dusica

Hash - Product Key x2

Ram9198

When two large fact tables have frequent joins - in this case one is large and another is a small dimension table. Hence highlighted answer is correct

Dusica

both are fact tables

Remedios79

the provided aswers are correct

kiranSargar

Generally facts table are hash distributed. so both the table should use hash distribution and distribution key would be product_key for both.

tubis

When it says 75% of records related to one of the 40 regions, if we partition the Sales by Region, isn't it improve the reading process drastically in compare to productKey?

Preben

That's 75 % of 61 % of the regions that will be done effectively. That's only efficient for 45 % of the queries. Not a whole lot.

patricka95

No, if 75% relate to one region and we hash on region, that means that those will all be on one node and there will be skew. Correct answers are Hash, Product, Hash, Region.