Professional Machine Learning Engineer Exam QuestionsBrowse all questions from this exam

Professional Machine Learning Engineer Exam - Question 238


You have deployed a scikit-team model to a Vertex AI endpoint using a custom model server. You enabled autoscaling: however, the deployed model fails to scale beyond one replica, which led to dropped requests. You notice that CPU utilization remains low even during periods of high load. What should you do?

Show Answer
Correct Answer: BD

In the given scenario, the model fails to scale beyond one replica, and CPU utilization remains low despite high load. This indicates that the bottleneck is internal to the model server's handling of requests rather than a lack of computational resources. Increasing the number of workers in your model server can enable it to handle more concurrent requests, thereby fully utilizing the CPU resources and addressing the problem without having to add more replicas.

Discussion

6 comments
Sign in to comment
sonicclaspsOption: A
Jan 31, 2024

"We generally recommend starting with one worker or thread per core. If you notice that CPU utilization is low, especially under high load, or your model is not scaling up because CPU utilization is low, then increase the number of workers." https://cloud.google.com/vertex-ai/docs/general/deployment

sonicclasps
Jan 31, 2024

sorry clicked wrong, answer is B

pikachu007Option: B
Jan 13, 2024

Low CPU Utilization: Despite high load, low CPU utilization indicates underutilization of available resources, suggesting a bottleneck within the model server itself, not overall compute capacity. Worker Concurrency: Increasing the number of workers within the model server allows it to handle more concurrent requests, effectively utilizing available CPU resources and addressing the bottleneck.

BlehMaks
Jan 19, 2024

i don't get it. The autoscaling system should increase/decrease the number of workers itself. if we do it instead of the autoscaling system, why do we need it?

guilhermebutzke
Feb 16, 2024

Increase the number of workers within the model server will distribute the load within the single replica, but it wouldn't address the problem of not scaling beyond one replica. Increasin worker will be a good option for delay in prediction.

asmgi
Jul 14, 2024

Not scaling beyond one replica is symptom and not the source of the problem. The problem is low CPU utilization.

Carlose2108Option: B
Feb 26, 2024

I went B

guilhermebutzkeOption: C
Feb 16, 2024

My answer: C The problem is in scale. The provided resources areok. So, A: Not correct, because CPU is enough. B: Not correct, because increasing the number of workers will accelerate the process in a single replica, and make the time of prediction faster for example, but not will happen in scale problem. C:Correct: This option involves adjusting the scaling of resources to match the expected demand, ensuring that the system can handle increased loads effectively D: This might help ensure at least one replica is always available, but it won't address the issue of not scaling up during high load.

pinimichele01Option: B
Apr 13, 2024

agree with sonicclasps -> B

pinimichele01
Apr 21, 2024

NOT D: This might help ensure at least one replica is always available, but it won't address the issue of not scaling up during high load.

fitri001Option: B
Apr 18, 2024

agree with sonicclasps -> B