DP-203 Exam - Question 322

Question

HOTSPOT -You have an on-premises data warehouse that includes the following fact tables. Both tables have the following columns: DateKey, ProductKey, RegionKey. There are 120 unique product keys and 65 unique region keys. Queries that use the data warehouse take a long time to complete. You plan to migrate the solution to use Azure Synapse Analytics. You need to ensure that the Azure-based solution optimizes query performance and minimizes processing skew. What should you recommend? To answer, select the appropriate options in the answer area. NOTE: Each correct selection is worth one pointHot Area:.

Examice · Accepted Answer

Box 1: Hash-distributed -Box 2: ProductKey -ProductKey is used extensively in joins. Hash-distributed tables improve query performance on large fact tables. Box 3: Hash-distributed -Box 4: RegionKey -Round-robin tables are useful for improving loading speed. Consider using the round-robin distribution for your table in the following scenarios:✑ When getting started as a simple starting point since it is the default✑ If there is no obvious joining key✑ If there is not good candidate column for hash distributing the table✑ If the table does not share a common join key with other tables✑ If the join is less significant than other joins in the query✑ When the table is a temporary staging tableNote: A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm. Reference:https://docs. microsoft. com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute.

lara_mia1 · Answer

1. Hash Distributed, ProductKey because >2GB and ProductKey is extensively used in joins
2. Hash Distributed, RegionKey because "The table size on disk is more than 2 GB." and you have to chose a distribution column which: "Is not used in WHERE clauses. This could narrow the query to not run on all the distributions."

source: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute#choosing-a-distribution-column

Rob77 · Answer

Both hash as both are > 2GB. In the 2nd table RegionKey cannot be used with round_robin distribution as round_robin does not take a distribution key...

bc5468521 · Answer

I AGREE WITH BOTH HASH WITH PRODUCT KEY

zarga · Answer

1. Hash on product key
2. Hash on region key (used on group by and have 65 unique values)

Deeksha1234 · Answer

1. Hash Distributed, ProductKey because table size >2GB and ProductKey is extensively used in joins . another, region key could have been considered (after join key which is product key) since its being used in grouping but 75% records belongs to one region so - 
NO for region key.

2. Hash Distributed, RegionKey because the table size on disk is more than 2 GB and Its being used in grouping (for this table more than 75% record doesn't fall in same region) and you have to chose a distribution column which is not used in WHERE clause.

dom271219 · Answer

"Choose a distribution column with data that distributes evenly"
ProductKey is more relevant in both cases

BrennaFrenna · Answer

The sales table makes sense with hashing distribution on ProductKey and since there is no obvious joining key for invoices, you should use round robin distribution on RegionKey. When it would be a smaller table you should use replicated.

DarioEtna · Answer

as for me i guess this is the right choice:
1. Hash Distributed, RegionKey because 
2. Hash Distributed, RegionKey because 
"When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns" [Microsoft Documentation]
If we use for one ProductKey and for one RegionKey maybe the data movements would increase...or not?

Amalbenrebai · Answer

Regarding the invoces table, we can use the Round-robin distribution because  there is no obvious joining key in the table

Nishikag · Answer

To minimize data movement, select a distribution column that:

Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses. When two large fact tables have frequent joins, query performance improves when you distribute both tables on one of the join columns. When a table is not used in joins, consider distributing the table on a column that is frequently in the GROUP BY clause.
Is not used in WHERE clauses. This could narrow the query to not run on all the distributions.
Is not a date column. WHERE clauses often filter by date. When this happens, all the processing could run on only a few distributions.

kkk5566 · Answer

1. Hash Distributed, ProductKey 
2. Hash Distributed, RegionKey

tubis · Answer

When it says 75% of records related to one of the 40 regions, if we partition the Sales by Region, isn't it improve the reading process drastically in compare to productKey?

kiranSargar · Answer

Generally facts table are hash distributed. so both the table should use hash distribution and distribution key would be product_key for both.

Remedios79 · Answer

the provided aswers are correct

Ram9198 · Answer

When two large fact tables have frequent joins - in this case one is large and another is a small dimension table. Hence highlighted answer is correct

Dusica · Answer

Hash - Product Key x2

evangelist · Answer

Based on the votes, the best agreement for the configuration is:

Sales Table: Hash Distributed, ProductKey (134 votes)
Invoices Table: Hash Distributed, RegionKey (147 votes)

DP-203 Exam - Question 322

Discussion