Scalable LLM Deployment on AWS with Ray Serve and FastAPI

Deploying Large Language Models (LLMs) effectively requires a robust architecture that can handle high concurrency, manage GPU resources efficiently, and scale dynamically. In this post, I’ll walk through a production-ready setup for hosting open-source models (like Llama 3 or Mistral) on AWS using Ray Serve for orchestration and FastAPI as the interface.

The Architecture

The stack consists of:

Infrastructure: AWS EC2 instances (g5.xlarge or similar GPU-optimized instances)
Orchestration: Ray Cluster (Head node + Worker nodes)
Serving: Ray Serve wrapping a FastAPI application
Model Engine: vLLM for high-throughput inference

Why Ray Serve?

Ray Serve excels at “model composition” and scaling. Unlike simple Docker containers, Ray allows us to:

Scale independently: Scale the model replicas separately from the API handling logic.
Batching: Native support for dynamic request batching to maximize GPU utilization.
Pipeline composition: Easily chain multiple models or pre/post-processing steps.

Configuration

Here is a simplified serve_config.yaml to get started. This configuration defines a deployment that autoscales based on request load.

proxy_location: EveryNode

http_options:
  host: 0.0.0.0
  port: 8000

applications:
  - name: llm_app
    route_prefix: /
    import_path: app:deployment
    runtime_env:
      pip:
        - vllm
        - fastapi
    deployments:
      - name: VLLMDeployment
        autoscaling_config:
          min_replicas: 1
          max_replicas: 4
          target_num_ongoing_requests_per_replica: 10
        ray_actor_options:
          num_gpus: 1

The FastAPI Wrapper

We wrap the vLLM engine in a FastAPI app to expose standard REST endpoints. This allows easy integration with existing frontend applications or services.

from fastapi import FastAPI
from ray import serve
from vllm import AsyncLLMEngine, SamplingParams

app = FastAPI()

@serve.deployment(num_gpus=1)
@serve.ingress(app)
class VLLMDeployment:
    def __init__(self):
        # Initialize vLLM engine
        self.engine = AsyncLLMEngine.from_engine_args(...)

    @app.post("/generate")
    async def generate(self, prompt: str):
        sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
        results = await self.engine.generate(prompt, sampling_params, ...)
        return {"text": results[0].outputs[0].text}

deployment = VLLMDeployment.bind()

Deployment on AWS

Cluster Setup: Use the Ray Cluster Launcher to provision EC2 instances. Define your cluster configuration in a cluster.yaml file, specifying the instance types (e.g., g5.xlarge for workers).
Deploy: Run ray up cluster.yaml to start the cluster.
Serve: Submit your serve application using serve run serve_config.yaml.

Monitoring and Optimization

Ray provides a built-in dashboard to monitor actor status, GPU usage, and request latency. For production, integrate this with Prometheus and Grafana to track:

Queue Latency: Time requests spend waiting for a replica.
GPU Utilization: Ensure you aren’t under-provisioning expensive hardware.
Token Throughput: Measure tokens/second to benchmark performance.

By leveraging Ray Serve with AWS, we create a flexible, scalable inference platform that avoids the vendor lock-in of managed services while providing full control over the serving infrastructure.