Certified Associate Developer for Apache Spark Exam - Question 31

Question

Which of the following code blocks will most quickly return an approximation for the number of distinct values in column division in DataFrame storesDF?

Examice · Accepted Answer

To most quickly return an approximation for the number of distinct values in column 'division' in DataFrame 'storesDF', you would use the code block that allows the highest estimation error. The function 'approx_count_distinct' uses a probabilistic data structure, and its second parameter determines the maximum relative error. A higher error value results in a faster computation but less accuracy. Therefore, the option with an estimation error of 0.15 will be the quickest because it accepts the most error, trading off accuracy for speed.

4be8126 · Answer

To quickly return an approximation for the number of distinct values in column division in DataFrame storesDF, the most efficient code block to use would be:

B. storesDF.agg(approx_count_distinct(col("division"), 0.01).alias("divisionDistinct"))

Using the approx_count_distinct() function allows for an approximate count of the distinct values in the column without scanning the entire DataFrame. The second parameter passed to the function is the maximum estimation error allowed, which in this case is set to 0.01. This is a trade-off between the accuracy of the estimate and the computational cost. Option C may still be efficient but with a larger estimation error of 0.15. Option A and D are not correct as they do not specify the estimation error, which means that the function would use the default value of 0.05. Option E specifies an estimation error of 0.05, but a smaller error of 0.01 is a better choice for a more accurate estimate with less computational cost.

SonicBoom10C9 · Answer

While not an option I would use, the question says most quickly (relatively), and this will be the fastest. Note that a 15% error is too high.

singh100 · Answer

A. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.approx_count_distinct.html

thanab · Answer

C
The code block that will most quickly return an approximation for the number of distinct values in column `division` in DataFrame `storesDF` is **C**, `storesDF.agg(approx_count_distinct(col("division"), 0.15).alias("divisionDistinct"))`. The `approx_count_distinct` function can be used to quickly estimate the number of distinct values in a column by using a probabilistic data structure. The second parameter of the `approx_count_distinct` function specifies the maximum estimation error allowed, with a smaller value resulting in a more accurate but slower estimation. In this case, an error of 0.15 is specified, which will result in a faster but less accurate estimation than the other options.

TC007 · Answer

The higher the relative error parameter, the less accurate and faster. The lower the relative error parameter, the more accurate and slower.

cookiemonster42 · Answer

C - the less accurate the calculation, the faster it is

zozoshanky · Answer

B. storesDF.agg(approx_count_distinct(col("division"), 0.01).alias("divisionDistinct"))

Explanation:

approx_count_distinct(col("division"), 0.01): This uses the approx_count_distinct function to approximate the number of distinct values in the "division" column with a relative error of 1%. The smaller the relative error, the more accurate the approximation, but it may require more resources.
.alias("divisionDistinct"): This renames the result column to "divisionDistinct" for better readability.
So, the correct answer is:

B. storesDF.agg(approx_count_distinct(col("division"), 0.01).alias("divisionDistinct"))

Certified Associate Developer for Apache Spark Exam - Question 31

Discussion