DP-203 Exam - Question 39

Question

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from suppliers for a retail store. FactPurchase will contain the following columns.

FactPurchase will have 1 million rows of data added daily and will contain three years of data.

Transact-SQL queries similar to the following query will be executed daily.

SELECT -

SupplierKey, StockItemKey, COUNT(*)

FROM FactPurchase -

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -

GROUP By SupplierKey, StockItemKey

Which table distribution will minimize query times?

Examice · Accepted Answer

When optimizing query performance for large fact tables in an Azure Synapse Analytics dedicated SQL pool, choosing an appropriate distribution method is crucial. Hash distribution on a column frequently used in the WHERE clause, such as DateKey, can provide significant query performance benefits. Distributing on DateKey aligns with the primary filtering condition in the query, thereby ensuring that query segments are processed effectively across distributions. While it is advised against using a date column for distribution to prevent data skew, in this case, the DateKey is represented as an integer, distributing data over multiple integers effectively. This distribution approach avoids excessive data movement and optimizes the parallel processing of query segments by focusing on the common filter criteria used. Hence, hash-distributing on DateKey would minimize query times for the given scenario.

AugustineUba · Answer

From the documentation the answer is clear enough. B  is the right answer. 
When choosing a distribution column, select a distribution column that: "Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work."

waterbender19 · Answer

I think the answer should be D for that specific query. If you look at the datatypes, DateKey is an INT datatype not a DATE datatype.

MarkJoh · Answer

You want to distribute by productKey and partition by date.  Then all distributions will be looked at in parallel and then, within each distribution, only the desired partitions will be looked at.   Thereby, the query is fully scaled out and the quickest it can be.

AlejandroU · Answer

B) the chosen distribution column should not be used in WHERE clauses; thus, we can discard DateKey (even though it is not a Date data type) to minimize data movement. The chosen distribution column must have many unique values; thus we potentially have 2 candidates: PurchaseKey or PurchaseOrderID; however, the chosen one should have no NULLS or only a few, making PurchaseKey the ideal in order to distribute evenly.
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

jongert · Answer

Something not immediately clear to me was that distributing and partitioning are different, hence I was confused that one should not distribute over date columns.

Bottom line is, do not distribute over date columns but you can partition over them. In this question they specifically ask about distribution method. Query optimization for large tables directly points to hashing.

SHENOOOO · Answer

B is the correct Answer

Vanq69 · Answer

Is there any "official" answer to this?
    A. Replicated: Replicated tables have copies of the entire table on each distribution. While this option can eliminate data movement, it may not be the most efficient choice for very large tables with frequent updates.
    B. Hash-Distributed on PurchaseKey: Hash distribution on "PurchaseKey" may lead to data skew if "PurchaseKey" doesn't have a wide range of unique values. Additionally, it doesn't align with the primary filtering condition on "DateKey."
    C. Round-Robin: Round-robin distribution ensures even data distribution, but it doesn't take advantage of data locality for specific types of queries.
    D. Hash-Distributed on DateKey: Distributing on "DateKey" aligns with your primary filtering condition, but it's a date column. This could lead to clustering by date, especially if many users filter on the same date.

None of the answers seem to fit. D could be the best guess but it's a date column.

AKTommy · Answer

B is my correct answer

kitesh1994 · Answer

Question 20 and 39 Is same

gozdek · Answer

B is total nonsense if PurchaseKey has a unique value for every row it would end up distributing it evenly so same as round-robin. Distributing by date would slow down the query because in a situation presented in the question only 31 out of 60 distributions would be used. So in my opinion C is the correct answer.

kkk5566 · Answer

B is correct

moneytime · Answer

For me.
I chose D.
The reason is that the" datekey"  is of tyoe "integer" not " Date" .This  qualifies it to be used as a  non-auto-incremental  surrogate key  for the fact table.

pawades · Answer

Explain me something - if you use Purchasekey as a hash distribution, and then want to do a partition, which column will you use for partition, we mostly date column for partition, but if we use date column then during query execution where you want to query data for let's say Jan month, wouldn't the query will need data from multiple nodes? eventually slowing the results? isn't it easy to keep data on a single node get faster results. Am I missing anything here?

Danweo · Answer

You don't want to hash on the Date column generally, definitely not when its being included in the Where clause, PurchaseKey is the only acceptable option given as the table is too large for roundrobin. I want to know if those other columns in the group by could possibly be used also?

evangelist · Answer

Answer could be only D

evangelist · Answer

total votes on B:85 on D; 23

evangelist · Answer

The real approach in actual project is limiting the dateKey:
 DistributionKey AS HASHBYTES('MD5', CONCAT(YEAR(CAST(DateKey AS DATE)), FORMAT(CAST(DateKey AS DATE), 'MM')))
)
WITH
(
    DISTRIBUTION = HASH(DistributionKey),
    CLUSTERED COLUMNSTORE INDEX
);

DP-203 Exam - Question 39

Discussion