A. Use a machine type with more memory: While this might seem logical, autoscaling in Vertex AI endpoints relies on CPU utilization as the metric, not directly on memory usage. Even with more memory, the endpoint might not scale up if CPU utilization remains below the threshold.
B. Decrease the number of workers per machine (Not applicable to Vertex AI Endpoints): This option might be relevant for some serving frameworks, but Vertex AI Endpoints don't typically use a worker concept. Scaling down workers wouldn't directly address the memory bottleneck.
C. Increase the CPU utilization target: This would instruct the endpoint to scale up only when CPU usage reaches a higher threshold. Since the issue is memory usage, increasing the CPU target wouldn't trigger scaling when memory is the limiting factor.