You have deployed a model on Vertex AI for real-time inference. During an online prediction request, you get an “Out of Memory” error. What should you do?
You have deployed a model on Vertex AI for real-time inference. During an online prediction request, you get an “Out of Memory” error. What should you do?
During an 'Out of Memory' error in an online prediction request, the error suggests that the data being sent in each request is too large and exceeds the available memory. To address this, sending the request again with a smaller batch of instances can reduce the amount of data processed at a time, potentially avoiding the out-of-memory error and successfully completing the prediction request.
B is the answer 429 - Out of Memory https://cloud.google.com/ai-platform/training/docs/troubleshooting
Upvote this comment, its the right answer!
https://cloud.google.com/ai-platform/training/docs/troubleshooting
B. Send the request again with a smaller batch of instances. If you are getting an "Out of Memory" error during an online prediction request, it suggests that the amount of data you are sending in each request is too large and is exceeding the available memory. To resolve this issue, you can try sending the request again with a smaller batch of instances. This reduces the amount of data being sent in each request and helps avoid the out-of-memory error. If the problem persists, you can also try increasing the machine type or the number of instances to provide more resources for the prediction service.
https://cloud.google.com/ai-platform/training/docs/troubleshooting#http_status_codes
answer B as reported here: https://cloud.google.com/ai-platform/training/docs/troubleshooting
The correct answer is B.
This question is about prediction not training - and specifically it's about _online_ prediction (aka realtime serving). All the answers are about batch workloads apart from C.
Okay, option D is also about online serving, but the error message indicates a problem for individual predictions, which will not be fixed by increasing the number of predictions per second.
@BenMS this feels like a trick question.... makes on to zone to the word batch. https://cloud.google.com/ai-platform/training/docs/troubleshooting .... states then when an error occurs with an online prediction request, you usually get an HTTP status code back from the service. These are some commonly encountered codes and their meaning in the context of online prediction: 429 - Out of Memory The processing node ran out of memory while running your model. There is no way to increase the memory allocated to prediction nodes at this time. You can try these things to get your model to run: Reduce your model size by: 1. Using less precise variables. 2. Quantizing your continuous data. 3. Reducing the size of other input features (using smaller vocab sizes, for example). 4. Send the request again with a smaller batch of instances.
Went with B
By reducing the batch size of instances sent for prediction, you decrease the memory footprint of each request, potentially alleviating the out-of-memory issue. However, be mindful that excessively reducing the batch size might impact the efficiency of your prediction process.
B) Use smaller set of tokens