Professional Machine Learning Engineer Exam - Question 32

Question

You developed an ML model with AI Platform, and you want to move it to production. You serve a few thousand queries per second and are experiencing latency issues. Incoming requests are served by a load balancer that distributes them across multiple Kubeflow CPU-only pods running on Google Kubernetes Engine

(GKE). Your goal is to improve the serving latency without changing the underlying infrastructure. What should you do?

Examice · Accepted Answer

To improve serving latency without changing the underlying infrastructure, recompiling TensorFlow Serving using the source to support CPU-specific optimizations and instructing GKE to choose an appropriate baseline minimum CPU platform for serving nodes is the most effective solution. This approach leverages CPU-specific performance enhancements and ensures that GKE allocates nodes with appropriate computational capabilities, thereby improving latency.

Y2Data · Answer

D is correct since this question is focusing on server performance which development env is higher than production env. It's already throttling so increase the pressure on them won't help. Both A and C is essentially doing this. B is a bit mysterious, but we definitely know that D would work.

pico · Answer

https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md#batch-scheduling-parameters-and-tuning

A may help to some extent, but it primarily affects how many requests are processed in a single batch. It might not directly address latency issues.

D is a valid approach for optimizing TensorFlow Serving for CPU-specific optimizations, but it's a more involved process and might not be the quickest way to address latency issues.

Yajnas_arpohc · Answer

CPU-only: One Approach
If your system is CPU-only (no GPU), then consider starting with the following values: num_batch_threads equal to the number of CPU cores; max_batch_size to a really high value; batch_timeout_micros to 0. Then experiment with batch_timeout_micros values in the 1-10 millisecond (1000-10000 microsecond) range, while keeping in mind that 0 may be the optimal value.

https://github.com/tensorflow/serving/tree/master/tensorflow_serving/batching

SergioRubiano · Answer

A is correct. max_batch_size TensorFlow Serving parameter

felden · Answer

A would further increase latency. It may only help to improve the throughput if the memory and computation power of the GKE pods are not saturated.

sachinxshrivastav · Answer

D is the correct one

sachinxshrivastav · Answer

D is the right one

wish0035 · Answer

ans: D

Omi_04040 · Answer

Answer: D
https://www.youtube.com/watch?v=fnZTVQ1SnDg

M25 · Answer

Went with D

Liting · Answer

Definetely D
to improve the serving latency of an ML model on AI Platform, you can recompile TensorFlow Serving using the source to support CPU-specific optimizations and instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes, this way GKE will schedule the pods on nodes with at least that CPU platform.

harithacML · Answer

max_batch_size parameter controls the maximum number of requests that can be batched together by TensorFlow Serving. Increasing this parameter can help reduce the number of round trips between the client and server, which can improve serving latency. However, increasing the batch size too much can lead to higher memory usage and longer processing times for each batch.

tavva_prudhvi · Answer

increasing the max_batch_size TensorFlow Serving parameter, is not the best choice because increasing the batch size may not necessarily improve latency. In fact, it may even lead to higher latency for individual requests, as they will have to wait for the batch to be filled before processing. This may be useful when optimizing for throughput, but not for serving latency, which is the primary goal in this scenario.

ichbinnoah · Answer

I think A is correct, as D implies changes to the infrastructure (question says you must not do that).

pinimichele01 · Answer

increasing the max_batch_size TensorFlow Serving parameter, is not the best choice because increasing the batch size may not necessarily improve latency. In fact, it may even lead to higher latency for individual requests, as they will have to wait for the batch to be filled before processing. This may be useful when optimizing for throughput, but not for serving latency, which is the primary goal in this scenario.

PhilipKoku · Answer

C) Batch enqueued

chirag2506 · Answer

it is D

Professional Machine Learning Engineer Exam - Question 32

Discussion