Certified Data Engineer Professional Exam QuestionsBrowse all questions from this exam

Certified Data Engineer Professional Exam - Question 110


A large company seeks to implement a near real-time solution involving hundreds of pipelines with parallel updates of many tables with extremely high volume and high velocity data.

Which of the following solutions would you implement to achieve this requirement?

Show Answer
Correct Answer: AB

In a scenario involving hundreds of pipelines with parallel updates to many tables handling extremely high volume and high velocity data, it is crucial to choose a solution that optimizes resource utilization and manages concurrent operations efficiently. Databricks High Concurrency clusters are specifically designed to handle multiple concurrent users and workloads. They provide fine-grained sharing of cluster resources and are optimized for operations such as running multiple parallel queries and updates, making them well-suited for this requirement. By leveraging optimized cloud storage connections, they can maximize data throughput, ensuring efficient processing of high-volume data.

Discussion

9 comments
Sign in to comment
aragorn_bregoOption: A
Nov 22, 2023

High Concurrency clusters in Databricks are designed for multiple concurrent users and workloads. They provide fine-grained sharing of cluster resources and are optimized for operations such as running multiple parallel queries and updates. This would be suitable for a solution that involves many pipelines with parallel updates, especially with high volume and high velocity data.

bacckomOption: A
Jan 11, 2024

Databricks High Concurrency cluster

Curious76Option: D
Feb 29, 2024

Why not D?

petrvOption: A
Dec 2, 2023

1) Partitioning by Time: Partitioning tables by a small time duration allows for efficient parallelism in data writes. Each time partition can be processed independently, enabling parallel updates to multiple partitions concurrently. 2)Optimizing for Parallelism: By partitioning the tables based on time, data can be ingested and processed in parallel, providing the ability to handle high volume and high velocity data effectively. Regarding option A, Databricks High Concurrency clusters are more focused on supporting a large number of concurrent users, which might not directly address the requirement for parallel updates of many tables with extremely high volume and high velocity data

petrv
Dec 2, 2023

sorry, the selected answer should have been B

Isio05
Jun 8, 2024

Usage of high conc. clusters can be beneficial both for mulitple users and jobs/queries running on them

Isio05
Jun 14, 2024

Sorry, after going through this question once more - I'll go with B also. It will allow utilize parallelism in an efficient way.

PrincipalJoeOption: B
Feb 2, 2024

The best way to deal with high volume and high velocity data is to use partitioning

vctrhugoOption: A
Feb 6, 2024

Both options A and B could be relevant depending on the specific details of the use case. If the emphasis is on optimizing concurrent queries and overall data throughput, option A might be more appropriate. If the primary concern is parallel updates of tables with high-volume, high-velocity data, option B is a more targeted approach.

Er5Option: A
Apr 11, 2024

A. B is only useful to improve performance of large tables ingestions.

svikOption: A
May 23, 2024

Since multiple pipelines are being used high concurrency cluster would give maximum resource utilization.

BrianNguyen95Option: B
Jun 5, 2024

High volume and high-velocity data ingestion often becomes a bottleneck due to limited write parallelism. By partitioning ingestion tables based on small time durations (e.g., hourly or even minutes), you create many smaller partitions. This allows concurrent writes to different partitions, significantly increasing the overall throughput of your data ingestion.