Exam SAA-C03 All QuestionsBrowse all questions from this exam
Question 59

A company hosts more than 300 global websites and applications. The company requires a platform to analyze more than 30 TB of clickstream data each day.

What should a solutions architect do to transmit and process the clickstream data?

    Correct Answer: D

    To transmit and process more than 30 TB of clickstream data each day for a company with over 300 global websites, the most suitable approach is using Amazon Kinesis Data Streams to collect the data in real-time. Kinesis Data Firehose can then transmit this data to an Amazon S3 data lake, which provides scalable storage. Finally, Amazon Redshift can be used for loading and analyzing this data efficiently. This combination leverages managed services, ensuring scalability, durability, and ease of use.

Discussion
BuruguduystunstugudunstuyOption: D

Option D is the most appropriate solution for transmitting and processing the clickstream data in this scenario. Amazon Kinesis Data Streams is a highly scalable and durable service that enables real-time processing of streaming data at a high volume and high rate. You can use Kinesis Data Streams to collect and process the clickstream data in real-time. Amazon Kinesis Data Firehose is a fully managed service that loads streaming data into data stores and analytics tools. You can use Kinesis Data Firehose to transmit the data from Kinesis Data Streams to an Amazon S3 data lake. Once the data is in the data lake, you can use Amazon Redshift to load the data and perform analysis on it. Amazon Redshift is a fully managed, petabyte-scale data warehouse service that allows you to quickly and efficiently analyze data using SQL and your existing business intelligence tools.

Buruguduystunstugudunstuy

Option A, which involves using AWS Data Pipeline to archive the data to an Amazon S3 bucket and running an Amazon EMR cluster with the data to generate analytics, is not the most appropriate solution because it does not involve real-time processing of the data. Option B, which involves creating an Auto Scaling group of Amazon EC2 instances to process the data and sending it to an Amazon S3 data lake for Amazon Redshift to use for analysis, is not the most appropriate solution because it does not involve a fully managed service for transmitting the data from the processing layer to the data lake. Option C, which involves caching the data to Amazon CloudFront, storing the data in an Amazon S3 bucket, and running an AWS Lambda function to process the data for analysis when an object is added to the S3 bucket, is not the most appropriate solution because it does not involve a scalable and durable service for collecting and processing the data in real-time.

MutiverseAgent

The question does not say that real-time is needed here

pentium75

Question asks how to "transmit and process the clickstream data", NOT how to analyze it. Thus D.

ArielSchivoOption: D

Option D. https://aws.amazon.com/es/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/

RBSK

Unsure if this is right URL for this scenario. Option D is referring to S3 and then Redshift. Whereas URL discuss about eliminating S3 :- We’re excited to launch Amazon Redshift streaming ingestion for Amazon Kinesis Data Streams, which enables you to ingest data directly from the Kinesis data stream without having to stage the data in Amazon Simple Storage Service (Amazon S3). Streaming ingestion allows you to achieve low latency in the order of seconds while ingesting hundreds of megabytes of data into your Amazon Redshift cluster.

Reckless_Jas

when you see clickstream data, think about Kinesis Data Stream

PaoloRomaOption: A

I am going to be unpopular here and I'll go for A). Even if here are other services that offer a better experience, data Pipeline can do the job here. "you can use AWS Data Pipeline to archive your web server's logs to Amazon Simple Storage Service (Amazon S3) each day and then run a weekly Amazon EMR (Amazon EMR) cluster over those logs to generate traffic reports" https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html In the question there is no specific timing requirement for analytics. Also the EMR cluster job can be scheduled be executed daily. Option D) is a valid answer too, however with Amazon Redshift Streaming Ingestion "you can connect to Amazon Kinesis Data Streams data streams and pull data directly to Amazon Redshift without staging data in S3" https://aws.amazon.com/redshift/redshift-streaming-ingestion. So in this scenario Kinesis Data Firehose and S3 are redundant.

MutiverseAgent

I think I agree with you, I does not make sense in option D) using Amazon Kinesis Data Firehose to transmit the data to an Amazon S3 data lake and then to Redshift, as you can send directly the data from Firehose to Redshift.

juanrasus2

Also the Kinesis family is related to real time or near real time services. This is not a requirement at all. We have to process data daily, but not need to do it in real time

pentium75

Question asks how to "transmit and process the clickstream data", NOT how to analyze it. This picture shows exactly scenario D: Producer - Kinesis - Intermediate S3 bucket - Redshift https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2020/07/30/StreamTransformAnalyzeKinesisLambdaRedshift1.png

sebasta

Why not A? You can collect data with AWS Data Pipeline and then analyze it with EMR. Whats wrong with this option?

bearcandy

It's not A, the wording is tricky! It says "to archive the data to S3" - there is no mention of archiving in the question, so it has to be D :)

pentium75

And, the the question is not asking about analyzing the data at all, just about "transmitting and processing".

BoboChowOption: D

D seems to make sense

awsgeek75Option: D

A: Not sure how recent this question is but Data Pipeline is not really a product AWS is recommending anymore https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html B: 30TB of clickstream data could be done with EC2 but it would be challenging C: CloudFront is for CDN and caching and mostly outgoing data, not incoming. D: Kinesis, S3 data lake and Redshift will work perfectly for this case

PS_R

Click Stream & Analyse/ process- Think KDS,

Guru4CloudOption: D

The key reasons are: Kinesis Data Streams can continuously capture and ingest high volumes of clickstream data in real-time. This handles the large 30TB daily data intake. Kinesis Firehose can automatically load the streaming data into S3. This creates a data lake for further analysis. Firehose can transform and analyze the data in flight before loading to S3 using Lambda. This enables real-time processing. The data in S3 can be easily loaded into Amazon Redshift for interactive analysis at scale. Kinesis auto scales to handle the high data volumes. Minimal effort is needed for infrastructure management.

miki111

Option D is the correct answer

cookieMrOption: D

A. This option utilizes S3 for data storage and EMR for analytics, Data Pipeline is not ideal service for real-time streaming data ingestion and processing. It is better suited for batch processing scenarios. B. This option involves managing and scaling EC2, which adds operational overhead. It is also not real-time streaming solution. Additionally, use of Redshift for analyzing clickstream data might not be most efficient or cost-effective approach. C. CloudFront is CDN service and is not designed for real-time data processing or analytics. While using Lambda to process data can be an option, it may not be most efficient solution for processing large volumes of clickstream data. Therefore, collecting the data from Kinesis Data Streams, using Kinesis Data Firehose to transmit it to S3 data lake, and loading it into Redshift for analysis is the recommended approach. This combination provides scalable, real-time streaming solution with storage and analytics capabilities that can handle high volume of clickstream data.

effiecancode

D is the best option

clumsyninja4lifeOption: A

The answer should be A. Clickstream does not mean real time, it just means they capture user interactions on the web page. Kinesis data streaming is not required. Furthermore, redshift is a data warehousing solution, it cant run complex analysis as well as EMR. My vote goes for A

pentium75

Question asks how to "transmit and process the clickstream data", NOT how to analyze it. Also question does NOT ask how to archive the data (as is mentioned in A). Thus D.

Rahulbit34

Clickstream is the key - Answer is D

career360guruOption: D

Option D

studis

It is C. The image in here https://aws.amazon.com/kinesis/data-firehose/ shows how kinesis can send data collected to firehose who can send it to Redshift. It is also possible to use an intermediary S3 bucket between firehose and redshift. See image in here https://aws.amazon.com/blogs/big-data/stream-transform-and-analyze-xml-data-in-real-time-with-amazon-kinesis-aws-lambda-and-amazon-redshift/

pentium75

Makes sense, but this is D, not C

Wpcorgan

D is correct