Amazon BDS-C00 Exam Questions

Question 6 of 85

An administrator needs to design a strategy for the schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema.

In which two circumstances would choosing EVEN distribution be most appropriate? (Choose two.)

When the tables are highly denormalized and do NOT participate in frequent joins.

When data must be grouped based on a specific key on a defined slice.

When data transfer between nodes must be eliminated.

When a new table has been loaded and it is unclear how it will be joined to dimension.

Correct Answer: A, D

Choosing EVEN distribution in Amazon Redshift is most appropriate in two primary circumstances. First, when the tables are highly denormalized and do not participate in frequent joins, the EVEN distribution helps in evenly distributing the data across the nodes without concern for join keys. Second, when a new table has been loaded and it is unclear how it will be joined to dimension tables, using EVEN distribution can help to ensure that the data is evenly spread out while the schema and data relationships are still being determined.

Question 7 of 85

A large grocery distributor receives daily depletion reports from the field in the form of gzip archives od CSV files uploaded to Amazon S3. The files range from 500MB to 5GB. These files are processed daily by an EMR job.

Recently it has been observed that the file sizes vary, and the EMR jobs take too long. The distributor needs to tune and optimize the data processing workflow with this limited information to improve the performance of the

EMR job.

Which recommendation should an administrator provide?

Reduce the HDFS block size to increase the number of task processors.

Use bzip2 or Snappy rather than gzip for the archives.

Decompress the gzip archives and store the data as CSV files.

Use Avro rather than gzip for the archives.

Correct Answer: B

To optimize EMR job performance when handling large files, it is beneficial to use a compression algorithm that balances CPU usage and compression speed while also being splittable, which allows for parallel processing. Both bzip2 and Snappy offer these advantages over gzip. Bzip2 provides higher compression with reasonable speed and is splittable, enabling efficient processing of large datasets. Snappy, while not as compressive as bzip2, offers much faster compression and decompression speeds, making it suitable for frequent access scenarios. This approach improves EMR efficiency by allowing faster data processing and optimizing resource usage.

Question 8 of 85

A web-hosting company is building a web analytics tool to capture clickstream data from all of the websites hosted within its platform and to provide near-real-time business intelligence. This entire system is built on

AWS services. The web-hosting company is interested in using Amazon Kinesis to collect this data and perform sliding window analytics.

What is the most reliable and fault-tolerant technique to get each website to send data to Amazon Kinesis with every click?

After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis PutRecord API. Use the sessionID as a partition key and set up a loop to retry until a success response is received.

After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis Producer Library .addRecords method.

Each web server buffers the requests until the count reaches 500 and sends them to Amazon Kinesis using the Amazon Kinesis PutRecord API.

After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis PutRecord API. Use the exponential back-off algorithm for retries until a successful response is received.

Correct Answer: D

The most reliable and fault-tolerant technique for sending data to Amazon Kinesis with every click is to use the Amazon Kinesis PutRecord API along with an exponential back-off algorithm for retries until a successful response is received. This method handles potential throughput issues and ensures that retries do not overwhelm the system, providing a balanced approach to reliability and fault tolerance.

Question 9 of 85

A customer has an Amazon S3 bucket. Objects are uploaded simultaneously by a cluster of servers from multiple streams of data. The customer maintains a catalog of objects uploaded in Amazon S3 using an

Amazon DynamoDB table. This catalog has the following fileds: StreamName, TimeStamp, and ServerName, from which ObjectName can be obtained.

The customer needs to define the catalog to support querying for a given stream or server within a defined time range.

Which DynamoDB table scheme is most efficient to support these queries?

Define a Primary Key with ServerName as Partition Key and TimeStamp as Sort Key. Do NOT define a Local Secondary Index or Global Secondary Index.

Define a Primary Key with StreamName as Partition Key and TimeStamp followed by ServerName as Sort Key. Define a Global Secondary Index with ServerName as partition key and TimeStamp followed by StreamName.

Define a Primary Key with ServerName as Partition Key. Define a Local Secondary Index with StreamName as Partition Key. Define a Global Secondary Index with TimeStamp as Partition Key.

Define a Primary Key with ServerName as Partition Key. Define a Local Secondary Index with TimeStamp as Partition Key. Define a Global Secondary Index with StreamName as Partition Key and TimeStamp as Sort Key.

Correct Answer: B

The most efficient DynamoDB table scheme to support querying for a given stream or server within a defined time range involves defining a primary key with StreamName as the partition key and TimeStamp followed by ServerName as the sort key. Additionally, a Global Secondary Index should be defined with ServerName as the partition key and TimeStamp followed by StreamName. This setup allows for querying based on either StreamName or ServerName within a specified time range, providing the flexibility needed to meet the requirements.

Question 10 of 85

A company has several teams of analysts. Each team of analysts has their own cluster. The teams need to run

SQL queries using Hive, Spark-SQL, and Presto with Amazon EMR. The company needs to enable a centralized metadata layer to expose the Amazon S3 objects as tables to the analysts.

Which approach meets the requirement for a centralized metadata layer?

EMRFS consistent view with a common Amazon DynamoDB table

Bootstrap action to change the Hive Metastore to an Amazon RDS database

s3distcp with the outputManifest option to generate RDS DDL

Naming scheme support with automatic partition discovery from Amazon S3

Correct Answer: B

To enable a centralized metadata layer that can expose Amazon S3 objects as tables for teams using SQL queries with Hive, Spark-SQL, and Presto on Amazon EMR, configuring the Hive Metastore to use an Amazon RDS database is the best approach. By hosting the Hive Metastore in Amazon RDS, multiple EMR clusters can share a single, consistent, and centralized metadata store. This allows all analyst teams to access the same metadata from different clusters, ensuring that the S3 objects are represented consistently as tables across all teams. EMRFS consistent view and DynamoDB are primarily used for consistency of file operations in S3 and do not serve as a centralized metadata service.