Deploying Large Language Models (LLMs) effectively requires a robust architecture that can handle high concurrency, manage GPU resources efficiently, and scale dynamically. In this post, I’ll walk through a production-ready setup for hosting open-source models (like Llama 3 or Mistral) on AWS using Ray Serve for orchestration and FastAPI as the interface.
The Architecture
The stack consists of:
- Infrastructure: AWS EC2 instances (g5.xlarge or similar GPU-optimized instances)
- Orchestration: Ray Cluster (Head node + Worker nodes)
- Serving: Ray Serve wrapping a FastAPI application
- Model Engine: vLLM for high-throughput inference
Why Ray Serve?
Ray Serve excels at “model composition” and scaling. Unlike simple Docker containers, Ray allows us to:
- Scale independently: Scale the model replicas separately from the API handling logic.
- Batching: Native support for dynamic request batching to maximize GPU utilization.
- Pipeline composition: Easily chain multiple models or pre/post-processing steps.
Configuration
Here is a simplified serve_config.yaml to get started. This configuration defines a deployment that autoscales based on request load.
proxy_location: EveryNode
http_options:
host: 0.0.0.0
port: 8000
applications:
- name: llm_app
route_prefix: /
import_path: app:deployment
runtime_env:
pip:
- vllm
- fastapi
deployments:
- name: VLLMDeployment
autoscaling_config:
min_replicas: 1
max_replicas: 4
target_num_ongoing_requests_per_replica: 10
ray_actor_options:
num_gpus: 1The FastAPI Wrapper
We wrap the vLLM engine in a FastAPI app to expose standard REST endpoints. This allows easy integration with existing frontend applications or services.
from fastapi import FastAPI
from ray import serve
from vllm import AsyncLLMEngine, SamplingParams
app = FastAPI()
@serve.deployment(num_gpus=1)
@serve.ingress(app)
class VLLMDeployment:
def __init__(self):
# Initialize vLLM engine
self.engine = AsyncLLMEngine.from_engine_args(...)
@app.post("/generate")
async def generate(self, prompt: str):
sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
results = await self.engine.generate(prompt, sampling_params, ...)
return {"text": results[0].outputs[0].text}
deployment = VLLMDeployment.bind()Deployment on AWS
- Cluster Setup: Use the Ray Cluster Launcher to provision EC2 instances. Define your cluster configuration in a
cluster.yamlfile, specifying the instance types (e.g.,g5.xlargefor workers). - Deploy: Run
ray up cluster.yamlto start the cluster. - Serve: Submit your serve application using
serve run serve_config.yaml.
Monitoring and Optimization
Ray provides a built-in dashboard to monitor actor status, GPU usage, and request latency. For production, integrate this with Prometheus and Grafana to track:
- Queue Latency: Time requests spend waiting for a replica.
- GPU Utilization: Ensure you aren’t under-provisioning expensive hardware.
- Token Throughput: Measure tokens/second to benchmark performance.
By leveraging Ray Serve with AWS, we create a flexible, scalable inference platform that avoids the vendor lock-in of managed services while providing full control over the serving infrastructure.
Dev Journal