Question 6 of 164

A company has a business unit uploading .csv files to an Amazon S3 bucket. The company's data platform team has set up an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the created tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creating the Amazon Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into the Amazon Redshift table.

Which solution will update the Redshift table without duplicates when jobs are rerun?

    Correct Answer: A

    To update the Redshift table without introducing duplicates when AWS Glue jobs are rerun, the best approach is to modify the AWS Glue job to copy the rows into a staging table. After that, SQL commands should be added to replace the existing rows in the main table as postactions in the DynamicFrameWriter class. This method ensures that duplicate records are not introduced into the main table by replacing old data with new data, thus maintaining data integrity in a straightforward and efficient manner. Other options either introduce unnecessary complexity, are not directly applicable to the given task, or are incorrect solutions for removing duplicates.

Question 7 of 164

A streaming application is reading data from Amazon Kinesis Data Streams and immediately writing the data to an Amazon S3 bucket every 10 seconds. The application is reading data from hundreds of shards. The batch interval cannot be changed due to a separate requirement. The data is being accessed by Amazon

Athena. Users are seeing degradation in query performance as time progresses.

Which action can help improve query performance?

    Correct Answer: A

    As the streaming application is writing data to Amazon S3 every 10 seconds from hundreds of shards, it results in the creation of a large number of small files over time. This can degrade query performance in Amazon Athena because Athena has to scan more metadata and perform more file operations. To improve query performance, merging the files in Amazon S3 to form larger files would reduce the number of files and streamline data scanning, thereby improving efficiency.

Question 8 of 164

A company uses Amazon OpenSearch Service (Amazon Elasticsearch Service) to store and analyze its website clickstream data. The company ingests 1 TB of data daily using Amazon Kinesis Data Firehose and stores one day's worth of data in an Amazon ES cluster.

The company has very slow query performance on the Amazon ES index and occasionally sees errors from Kinesis Data Firehose when attempting to write to the index. The Amazon ES cluster has 10 nodes running a single index and 3 dedicated master nodes. Each data node has 1.5 TB of Amazon EBS storage attached and the cluster is configured with 1,000 shards. Occasionally, JVMMemoryPressure errors are found in the cluster logs.

Which solution will improve the performance of Amazon ES?

    Correct Answer: C

    The issue presented involves very slow query performance and JVMMemoryPressure errors in the Amazon ES cluster. The cluster experiences these issues due to an overly high number of shards, which creates excessive load and inefficiency. It is recommended to maintain shard sizes between 10-50 GiB for optimal performance. In this scenario, reducing the number of shards will distribute the data more efficiently across the nodes, reduce overhead, and improve performance.

Question 9 of 164

A manufacturing company has been collecting IoT sensor data from devices on its factory floor for a year and is storing the data in Amazon Redshift for daily analysis. A data analyst has determined that, at an expected ingestion rate of about 2 TB per day, the cluster will be undersized in less than 4 months. A long-term solution is needed. The data analyst has indicated that most queries only reference the most recent 13 months of data, yet there are also quarterly reports that need to query all the data generated from the past 7 years. The chief technology officer (CTO) is concerned about the costs, administrative effort, and performance of a long-term solution.

Which solution should the data analyst use to meet these requirements?

    Correct Answer: A

    The most appropriate solution involves creating a daily job in AWS Glue to offload records older than 13 months to Amazon S3 and deleting those records from Amazon Redshift. An external table in Amazon Redshift can then point to the S3 location, allowing Amazon Redshift Spectrum to be used for querying data older than 13 months. This approach effectively manages storage costs, minimizes administrative effort, and maintains performance for both recent and long-term data queries.

Question 10 of 164

An insurance company has raw data in JSON format that is sent without a predefined schedule through an Amazon Kinesis Data Firehose delivery stream to an

Amazon S3 bucket. An AWS Glue crawler is scheduled to run every 8 hours to update the schema in the data catalog of the tables stored in the S3 bucket. Data analysts analyze the data using Apache Spark SQL on Amazon EMR set up with AWS Glue Data Catalog as the metastore. Data analysts say that, occasionally, the data they receive is stale. A data engineer needs to provide access to the most up-to-date data.

Which solution meets these requirements?

    Correct Answer: D

    To ensure the data analysts have access to the most up-to-date data, the AWS Glue crawler should be triggered by an event whenever new data is added to the S3 bucket. Running the AWS Glue crawler from an AWS Lambda function triggered by an S3:ObjectCreated:* event notification on the S3 bucket ensures that the data catalog is updated in real-time as soon as new data is available, thus providing access to the freshest data promptly.