Scalable LLM Deployment on AWS with Ray Serve and FastAPI
Deploying Large Language Models (LLMs) effectively requires a robust architecture that can handle high concurrency, manage GPU resources efficiently, and scale dynamically. In this post, I’ll walk through a production-ready setup for hosting open-source models (like Llama 3 or Mistral) on AWS using Ray Serve for orchestration and FastAPI as the interface.
The Architecture
The stack consists of:
- Infrastructure: AWS EC2 instances (g5.xlarge or similar GPU-optimized instances)
- Orchestration: Ray Cluster (Head node + Worker nodes)
- Serving: Ray Serve wrapping a FastAPI application
- Model Engine: vLLM for high-throughput inference
Why Ray Serve?
Ray Serve excels at “model composition” and scaling. Unlike simple Docker containers, Ray allows us to:
- Scale independently: Scale the model replicas separately from the API handling logic.
- Batching: Native support for dynamic request batching to maximize GPU utilization.
- Pipeline composition: Easily chain multiple models or pre/post-processing steps.
Configuration
Here is a simplified serve_config.yaml to get started. This configuration defines a deployment that autoscales based on request load.
| |
The FastAPI Wrapper
We wrap the vLLM engine in a FastAPI app to expose standard REST endpoints. This allows easy integration with existing frontend applications or services.
| |
Deployment on AWS
- Cluster Setup: Use the Ray Cluster Launcher to provision EC2 instances. Define your cluster configuration in a
cluster.yamlfile, specifying the instance types (e.g.,g5.xlargefor workers). - Deploy: Run
ray up cluster.yamlto start the cluster. - Serve: Submit your serve application using
serve run serve_config.yaml.
Monitoring and Optimization
Ray provides a built-in dashboard to monitor actor status, GPU usage, and request latency. For production, integrate this with Prometheus and Grafana to track:
- Queue Latency: Time requests spend waiting for a replica.
- GPU Utilization: Ensure you aren’t under-provisioning expensive hardware.
- Token Throughput: Measure tokens/second to benchmark performance.
By leveraging Ray Serve with AWS, we create a flexible, scalable inference platform that avoids the vendor lock-in of managed services while providing full control over the serving infrastructure.
Dev Journal