Examice

Exam DP-203 All QuestionsBrowse all questions from this exam

Question 40

You are implementing a batch dataset in the Parquet format.

Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool.

You need to minimize storage costs for the solution.

What should you do?

Use Snappy compression for the files.

Use OPENROWSET to query the Parquet files.

Create an external table that contains a subset of columns from the Parquet files.

Store all data as string in the Parquet files.

Correct Answer: A

To minimize storage costs for Parquet files, using Snappy compression is an effective strategy. Snappy is a fast and efficient compression algorithm that provides a good balance between compression ratio and processing speed. By compressing data with Snappy, the file sizes are significantly reduced, resulting in lower storage costs on Azure Data Lake Storage Gen2.

Discussion

m2shinesOption: A

Answer should be A, because this talks about minimizing storage costs, not querying costs

assU2

Isn't snappy a default compressionCodec for parquet in azure? https://docs.microsoft.com/en-us/azure/data-factory/format-parquet

jongert

Very confused at first, after thinking about it and rereading this is what I found: It says we are implementing the batch process in parquet format, so we should think about a situation where we write the file and specify snappy compression as an argument explicitly. The phrasing is very confusing I have to say, but if you argue from a 'query externally' perspective, then B and C would yield the same benefit. Therefore, A makes the most sense and connects best with the question.

Homer23

I found this comparison of compression methods, which explained that A should not be the answer. https://www.linkedin.com/pulse/comparison-compression-methods-parquet-file-format-saurav-mohapatra/ "BROTLI : This is a relatively new codec which offers very high compression ratio , but with lower compression and decompression speeds. This codec is useful when storage space is a major constraint. This technique also offers parallel processing that other methods don't."

Aslam208Option: C

C is the correct answer, as an external table with a subset of columns with parquet files would be cost-effective.

Massy

in serverless sql pool you don't create a copy of the data, so how could be cost effective?

Bro111

Don't forget that there is Transaction cost part of storage cost, so taking a subset of columns will lower transaction cost consequently storage cost.

RehanRajput

This is not correct. 1. External tables are are not saved in the database. (This is why they're external) 2. You're assuming that the SQL Serverless pools have a local storage. They don't -- > https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool

Aditya0891

well there is a possibility to create an external table and load only the required columns using openrowset in serverless sql pool to a different container in ADLS. Remember serverless sql pool does support cetas with openrowset but dedicated pool doesn't support loading data using openrowset. So basically the solution could be load the required columns using cetas using openrowset to a differnet container and delete the source data from previous container after loading the filtered data to a different container in ADLS

Aditya0891

check this https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-cetas. Answer C is correct

auwiaOption: A

I would go by exclusion: A. Use Snappy compression for the files. ---> nothing against this! B. Use OPENROWSET to query the Parquet files. --> doing this I just get a preview of the parquet files C. Create an external table that contains a subset of columns from the Parquet files. --> no body asked for a subset D. Store all data as string in the Parquet --> nobody asked that

semauni

Storing data as a string would also make the file size bigger

VittalManikondaOption: A

As per chat gpt , answer is A

ElancheOption: A

Using Snappy compression for the Parquet files helps minimize storage costs while still maintaining good compression efficiency. Snappy is a compression library that offers a good balance between compression ratio and processing speed. By compressing the data using Snappy, you can significantly reduce the amount of storage required for your dataset. Option B, using OPENROWSET to query the Parquet files, doesn't directly impact storage costs. It's a method for querying data but doesn't address storage optimization. Option C, creating an external table with a subset of columns, may help reduce query costs by minimizing the amount of data that needs to be processed during queries. However, it doesn't directly address storage costs. Option D, storing all data as strings in the Parquet files, would likely increase storage costs rather than minimize them. Storing data as strings without appropriate compression would result in larger file sizes compared to using efficient compression algorithms like Snappy.

Zen9nezOption: C

The answer is C - Parquet has default SNAPPY compression which cannot be overwritten so why would I apply SNAPPY again?

vctrhugoOption: A

Snappy compression is a popular and efficient compression algorithm for Parquet files. It provides a good balance between compression ratio and query performance. By compressing the Parquet files using Snappy, you can significantly reduce the storage footprint, leading to lower storage costs. C is not effective for minimizing storage costs: While creating an external table with a subset of columns can help reduce storage costs, it doesn't specifically address the Parquet format or compression. This option is more related to data modeling and selecting specific columns for query performance rather than minimizing storage costs.

DusicaOption: A

A, B and C they are all acceptable, D is just stupid But pay attention to "You need to minimize storage costs for the solution" that means snappy parquet compresson - A is correct

dgerokOption: A

Use Snappy compression for the files is the only answer, which is about minimizing cost of storage. While one is using serverless SQL pool, the external tables are available, which are the only metadata...

s_unsworthOption: A

Further information required for this question. There isn't enough information to go off as to what is being asked. The initial question is in regards to storage which would result in using the snappy compression answer. If you are asking about querying the data then this should be clearly defined in the question. If someone was to create a User Story with regards to this (As a Manager I want to store data in the data lake at the reduced cost) then you wouldn't be providing them with an External table. You would give them information on storage.

Ram9198Option: A

Snappy

kkk5566Option: A

using compression

DanweoOption: C

The question is confusing but I believe it is C, because you can use CETAS to store this external table in Gen2 (this is the storage solution), from there you will query it using serverless SQL pool.

ankeshpatel2112Option: A

A. Use Snappy compression for the files.

Joanna0Option: A

Snappy compression can reduce the size of Parquet files by up to 70%. This can save you a significant amount of money on storage costs.

kkk5566Option: A

To minimize storage costs for the solution, you should use Snappy compression for the files. Snappy is a fast and efficient data compression and decompression library that can be used to compress Parquet files. This will help reduce the size of the data files and minimize storage costs in Azure Data Lake Storage Gen2. So, the correct answer is A. Use Snappy compression for the files

andjurovicelaOption: C

When presented with only the options of column pruning (variant of this is C) and compression (example of it would be snappy), ChatGPT choses C.