Scalable LLM Deployment on AWS with Ray Serve and FastAPI

Deploying Large Language Models (LLMs) effectively requires a robust architecture that can handle high concurrency, manage GPU resources efficiently, and scale dynamically. In this post, I’ll walk through a production-ready setup for hosting open-source models (like Llama 3 or Mistral) on AWS using Ray Serve for orchestration and FastAPI as the interface.

The Architecture

The stack consists of:

  • Infrastructure: AWS EC2 instances (g5.xlarge or similar GPU-optimized instances)
  • Orchestration: Ray Cluster (Head node + Worker nodes)
  • Serving: Ray Serve wrapping a FastAPI application
  • Model Engine: vLLM for high-throughput inference

Why Ray Serve?

Ray Serve excels at “model composition” and scaling. Unlike simple Docker containers, Ray allows us to:

  1. Scale independently: Scale the model replicas separately from the API handling logic.
  2. Batching: Native support for dynamic request batching to maximize GPU utilization.
  3. Pipeline composition: Easily chain multiple models or pre/post-processing steps.

Configuration

Here is a simplified serve_config.yaml to get started. This configuration defines a deployment that autoscales based on request load.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
proxy_location: EveryNode

http_options:
  host: 0.0.0.0
  port: 8000

applications:
  - name: llm_app
    route_prefix: /
    import_path: app:deployment
    runtime_env:
      pip:
        - vllm
        - fastapi
    deployments:
      - name: VLLMDeployment
        autoscaling_config:
          min_replicas: 1
          max_replicas: 4
          target_num_ongoing_requests_per_replica: 10
        ray_actor_options:
          num_gpus: 1

The FastAPI Wrapper

We wrap the vLLM engine in a FastAPI app to expose standard REST endpoints. This allows easy integration with existing frontend applications or services.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from fastapi import FastAPI
from ray import serve
from vllm import AsyncLLMEngine, SamplingParams

app = FastAPI()

@serve.deployment(num_gpus=1)
@serve.ingress(app)
class VLLMDeployment:
    def __init__(self):
        # Initialize vLLM engine
        self.engine = AsyncLLMEngine.from_engine_args(...)

    @app.post("/generate")
    async def generate(self, prompt: str):
        sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
        results = await self.engine.generate(prompt, sampling_params, ...)
        return {"text": results[0].outputs[0].text}

deployment = VLLMDeployment.bind()

Deployment on AWS

  1. Cluster Setup: Use the Ray Cluster Launcher to provision EC2 instances. Define your cluster configuration in a cluster.yaml file, specifying the instance types (e.g., g5.xlarge for workers).
  2. Deploy: Run ray up cluster.yaml to start the cluster.
  3. Serve: Submit your serve application using serve run serve_config.yaml.

Monitoring and Optimization

Ray provides a built-in dashboard to monitor actor status, GPU usage, and request latency. For production, integrate this with Prometheus and Grafana to track:

  • Queue Latency: Time requests spend waiting for a replica.
  • GPU Utilization: Ensure you aren’t under-provisioning expensive hardware.
  • Token Throughput: Measure tokens/second to benchmark performance.

By leveraging Ray Serve with AWS, we create a flexible, scalable inference platform that avoids the vendor lock-in of managed services while providing full control over the serving infrastructure.