Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 32


You developed an ML model with AI Platform, and you want to move it to production. You serve a few thousand queries per second and are experiencing latency issues. Incoming requests are served by a load balancer that distributes them across multiple Kubeflow CPU-only pods running on Google Kubernetes Engine

(GKE). Your goal is to improve the serving latency without changing the underlying infrastructure. What should you do?

Show Answer
Correct Answer: D

To improve serving latency without changing the underlying infrastructure, recompiling TensorFlow Serving using the source to support CPU-specific optimizations and instructing GKE to choose an appropriate baseline minimum CPU platform for serving nodes is the most effective solution. This approach leverages CPU-specific performance enhancements and ensures that GKE allocates nodes with appropriate computational capabilities, thereby improving latency.

Discussion

17 comments
Sign in to comment
Y2DataOption: D
Sep 15, 2021

D is correct since this question is focusing on server performance which development env is higher than production env. It's already throttling so increase the pressure on them won't help. Both A and C is essentially doing this. B is a bit mysterious, but we definitely know that D would work.

mousseUwU
Oct 20, 2021

I think it's D too

picoOption: C
Nov 13, 2023

https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md#batch-scheduling-parameters-and-tuning A may help to some extent, but it primarily affects how many requests are processed in a single batch. It might not directly address latency issues. D is a valid approach for optimizing TensorFlow Serving for CPU-specific optimizations, but it's a more involved process and might not be the quickest way to address latency issues.

Yajnas_arpohcOption: A
Mar 24, 2023

CPU-only: One Approach If your system is CPU-only (no GPU), then consider starting with the following values: num_batch_threads equal to the number of CPU cores; max_batch_size to a really high value; batch_timeout_micros to 0. Then experiment with batch_timeout_micros values in the 1-10 millisecond (1000-10000 microsecond) range, while keeping in mind that 0 may be the optimal value. https://github.com/tensorflow/serving/tree/master/tensorflow_serving/batching

frangm23
Apr 24, 2023

In that very link, what it says is that max_batch_size is the parameter that governs the latency/troughput tradeoff, and as I understand, the higher the batch size, the higher the throughput, but that doesn't assure that latency will be lower. I would go with D

SergioRubianoOption: A
Mar 24, 2023

A is correct. max_batch_size TensorFlow Serving parameter

feldenOption: D
Jul 27, 2022

A would further increase latency. It may only help to improve the throughput if the memory and computation power of the GKE pods are not saturated.

sachinxshrivastavOption: D
Aug 5, 2022

D is the correct one

sachinxshrivastavOption: D
Aug 6, 2022

D is the right one

wish0035Option: D
Dec 15, 2022

ans: D

Omi_04040Option: D
Dec 26, 2022

Answer: D https://www.youtube.com/watch?v=fnZTVQ1SnDg

M25Option: D
May 9, 2023

Went with D

LitingOption: D
Jul 7, 2023

Definetely D to improve the serving latency of an ML model on AI Platform, you can recompile TensorFlow Serving using the source to support CPU-specific optimizations and instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes, this way GKE will schedule the pods on nodes with at least that CPU platform.

harithacMLOption: D
Jul 12, 2023

max_batch_size parameter controls the maximum number of requests that can be batched together by TensorFlow Serving. Increasing this parameter can help reduce the number of round trips between the client and server, which can improve serving latency. However, increasing the batch size too much can lead to higher memory usage and longer processing times for each batch.

tavva_prudhviOption: D
Aug 10, 2023

increasing the max_batch_size TensorFlow Serving parameter, is not the best choice because increasing the batch size may not necessarily improve latency. In fact, it may even lead to higher latency for individual requests, as they will have to wait for the batch to be filled before processing. This may be useful when optimizing for throughput, but not for serving latency, which is the primary goal in this scenario.

ichbinnoahOption: A
Nov 11, 2023

I think A is correct, as D implies changes to the infrastructure (question says you must not do that).

edoo
Mar 4, 2024

This is purely a software optimization and on how GKE handles requests. GKE should be able to choose different CPU types for nodes within the same cluster, which doesn't represent a change in architecture.

pinimichele01Option: D
Apr 14, 2024

increasing the max_batch_size TensorFlow Serving parameter, is not the best choice because increasing the batch size may not necessarily improve latency. In fact, it may even lead to higher latency for individual requests, as they will have to wait for the batch to be filled before processing. This may be useful when optimizing for throughput, but not for serving latency, which is the primary goal in this scenario.

PhilipKokuOption: C
Jun 6, 2024

C) Batch enqueued

chirag2506Option: D
Jun 25, 2024

it is D