Question 6 of 206

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You are designing an Azure SQL Database that will use elastic pools. You plan to store data about customers in a table. Each record uses a value for

CustomerID.

You need to recommend a strategy to partition data based on values in CustomerID.

Proposed Solution: Separate data into shards by using horizontal partitioning.

Does the solution meet the goal?

    Correct Answer: A

    Horizontal partitioning, also known as sharding, involves distributing the rows of a table across multiple databases but maintaining the same schema across each one. This approach efficiently manages large datasets by breaking them into smaller, more manageable pieces. For scenarios involving an Azure SQL Database with elastic pools and partitioning based on CustomerID, horizontal partitioning is a suitable solution. It will distribute the data evenly based on CustomerID, facilitating better performance and scalability. Therefore, the proposed solution meets the goal.

Question 7 of 206

HOTSPOT -

You are designing a data processing solution that will run as a Spark job on an HDInsight cluster. The solution will be used to provide near real-time information about online ordering for a retailer.

The solution must include a page on the company intranet that displays summary information.

The summary information page must meet the following requirements:

✑ Display a summary of sales to date grouped by product categories, price range, and review scope.

✑ Display sales summary information including total sales, sales as compared to one day ago and sales as compared to one year ago.

✑ Reflect information for new orders as quickly as possible.

You need to recommend a design for the solution.

What should you recommend? To answer, select the appropriate configuration in the answer area.

Hot Area:

    Correct Answer:

    Box 1: DataFrame -

    DataFrames -

    Best choice in most situations.

    Provides query optimization through Catalyst.

    Whole-stage code generation.

    Direct memory access.

    Low garbage collection (GC) overhead.

    Not as developer-friendly as DataSets, as there are no compile-time checks or domain object programming.

    Box 2: parquet -

    The best format for performance is parquet with snappy compression, which is the default in Spark 2.x. Parquet stores data in columnar format, and is highly optimized in Spark.

    Incorrect Answers:

    DataSets -

    Good in complex ETL pipelines where the performance impact is acceptable.

    Not good in aggregations where the performance impact can be considerable.

    RDDs -

    You do not need to use RDDs, unless you need to build a new custom RDD.

    No query optimization through Catalyst.

    No whole-stage code generation.

    High GC overhead.

    Reference:

    https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-perf

Question 8 of 206

You are evaluating data storage solutions to support a new application.

You need to recommend a data storage solution that represents data by using nodes and relationships in graph structures.

Which data storage solution should you recommend?

    Correct Answer: B

    Azure Cosmos DB supports graph structures through its Gremlin API, making it an ideal solution for applications that need to represent data using nodes and relationships. Blob Storage, Azure Data Lake Store, and HDInsight do not provide native capabilities for graph databases.

Question 9 of 206

HOTSPOT -

You have an on-premises data warehouse that includes the following fact tables. Both tables have the following columns: DataKey, ProductKey, RegionKey.

There are 120 unique product keys and 65 unique region keys.

Queries that use the data warehouse take a long time to complete.

You plan to migrate the solution to use Azure Synapse Analytics. You need to ensure that the Azure-based solution optimizes query performance and minimizes processing skew.

What should you recommend? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

    Correct Answer:

    Box 1: Hash-distributed -

    Box 2: ProductKey -

    ProductKey is used extensively in joins.

    Hash-distributed tables improve query performance on large fact tables.

    Box 3: Round-robin -

    Box 4: RegionKey -

    Round-robin tables are useful for improving loading speed.

    Consider using the round-robin distribution for your table in the following scenarios:

    ✑ When getting started as a simple starting point since it is the default

    ✑ If there is no obvious joining key

    ✑ If there is not good candidate column for hash distributing the table

    ✑ If the table does not share a common join key with other tables

    ✑ If the join is less significant than other joins in the query

    ✑ When the table is a temporary staging table

    Note: A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.

    Reference:

    https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute

Question 10 of 206

You are designing a data processing solution that will implement the lambda architecture pattern. The solution will use Spark running on HDInsight for data processing.

You need to recommend a data storage technology for the solution.

Which two technologies should you recommend? Each correct answer presents a complete solution.

NOTE: Each correct selection is worth one point.

    Correct Answer: A, E

    AE

    To implement a lambda architecture on Azure, you can combine the following technologies to accelerate real-time big data analytics:

    ✑ Azure Cosmos DB, the industry's first globally distributed, multi-model database service.

    ✑ Apache Spark for Azure HDInsight, a processing framework that runs large-scale data analytics applications

    Azure Cosmos DB change feed, which streams new data to the batch layer for HDInsight to process

    ✑ The Spark to Azure Cosmos DB Connector

    E: You can use Apache Spark to stream data into or out of Apache Kafka on HDInsight using DStreams.

    Reference:

    https://docs.microsoft.com/en-us/azure/cosmos-db/lambda-architecture