Deploying DeepSeek-R1 on vLLM: A Scalable and Efficient Kubernetes Solution

Large language models (LLMs) require significant compute power and optimized infrastructure to serve real-time inference at scale. Deploying these models in a production environment involves managing GPU resources efficiently, handling incoming requests with minimal latency, and ensuring scalability based on workload demand.

In this post, I’ll walk through how I deployed DeepSeek-R1 on vLLM within a Kubernetes cluster, leveraging GPU acceleration and autoscaling to deliver high-throughput inference. The reference GitHub repository can be found here.

Why vLLM Over Other Serving Solutions?

Traditional model-serving solutions, such as Hugging Face Transformers’ pipeline(), Triton Inference Server, and TorchServe, often struggle with maximizing GPU efficiency when serving large-scale LLMs. These tools are great for smaller-scale inference workloads, but they typically lack the optimizations needed to serve requests efficiently for large models like DeepSeek-R1.

vLLM is designed specifically for high-performance LLM inference and offers key advantages:

  • Continuous Batching: Unlike traditional batch processing, which waits for an entire batch to be filled before running inference, vLLM dynamically batches requests in real time. This enables higher throughput and lower latency, especially under unpredictable workloads.
  • Paged Attention: Large models can quickly run into VRAM fragmentation issues, causing slowdowns. vLLM introduces a memory-efficient attention mechanism that allows inference to scale efficiently, even with long input sequences.
  • CUDA/HIP Graph Execution: Instead of re-computing execution graphs repeatedly, vLLM caches optimized execution paths to improve inference speed on GPUs.
  • Flexible Deployment & Multi-GPU Scaling: Unlike traditional inference solutions that often require rigid configurations, vLLM can distribute workloads across multiple GPUs with minimal changes.
  • Model Quantization & Tuning: vLLM allows precision tuning, including 4-bit and 8-bit quantization, which can dramatically reduce VRAM consumption while maintaining accuracy.

For large-scale deployments, vLLM consistently outperforms standard inference engines in GPU efficiency, throughput, and scalability—making it the best choice for serving DeepSeek-R1 in a Kubernetes-based environment.

Extended Capabilities: Quantization and Precision Tuning

Deploying a large LLM requires balancing accuracy, memory usage, and inference speed. Depending on the hardware and workload, you can fine-tune DeepSeek-R1’s serving settings in vLLM:

  • Full Precision (FP32) – Highest accuracy, but requires large VRAM (2x FP16).
  • Half Precision (FP16) – Reduces memory usage by 50% while maintaining high accuracy. Ideal for NVIDIA A100, H100, and A10 GPUs.
  • bfloat16 (BF16) – Alternative to FP16 with slightly better numerical stability, preferred on NVIDIA H100.
  • 8-bit Quantization (--load-in-8bit) – Reduces memory usage by ~4x while maintaining reasonable model quality.
  • 4-bit Quantization (--load-in-4bit) – Cuts VRAM usage by ~8x, allowing models to run on smaller GPUs (at the cost of minor accuracy degradation).

Using FP16 or 8-bit quantization is often the best tradeoff for production deployments, ensuring fast inference while keeping VRAM requirements manageable.

Choosing the Right GPU Instance Based on VRAM Needs

One of the most important decisions when deploying DeepSeek-R1 is selecting an instance with enough VRAM to handle the model efficiently. The required VRAM depends on:

  1. Model size (number of parameters)
  2. Precision used (FP32, FP16, 8-bit, 4-bit)
  3. Batch size (number of concurrent inference requests)

VRAM Estimates for DeepSeek-R1

image

If your workload requires handling multiple concurrent users, you’ll need to increase VRAM accordingly. For instance:

  • A batch size of 2x doubles VRAM needs.
  • If using longer context lengths, VRAM consumption increases.

For large-scale production inference, an NVIDIA H100 (80GB) with FP16 provides optimal performance while allowing for scaling across multiple GPUs if needed.

Architecture and Deployment

The solution runs on a Kubernetes cluster with NVIDIA H100 GPUs. The DeepSeek-R1 model is served using vLLM within a dedicated namespace, ensuring resource isolation. A Persistent Volume Claim (PVC) is used to cache model weights, preventing unnecessary downloads each time a pod starts.

Autoscaling is enabled using Kubernetes’ Horizontal Pod Autoscaler (HPA), which monitors CPU and memory usage to dynamically adjust the number of running pods. This ensures that resources scale according to demand while keeping costs under control.

To expose the model to external clients, an Ingress controller routes requests efficiently while supporting TLS termination for secure API communication. Internal service discovery allows applications within the cluster to interact with the model seamlessly.

Challenges and Optimizations

One of the main challenges in deploying large models is efficient memory management. vLLM’s paged attention mechanism reduces memory fragmentation, allowing for larger batch sizes and improved token throughput. Additionally, tuning max_num_batched_tokens helped balance inference speed and memory consumption, ensuring optimal performance across various request loads.

Another consideration was cold start times when scaling up. By leveraging PVC-backed model caching, new pods can start serving requests quickly without needing to reload the model from scratch. This significantly reduces downtime when scaling in response to increased traffic.

Final Thoughts

This deployment successfully delivers a scalable and efficient DeepSeek-R1 serving solution on Kubernetes with vLLM. By optimizing GPU utilization, enabling autoscaling, and leveraging Kubernetes-native components, the model serves real-time inference workloads with minimal latency and high throughput.

For teams working on production LLM deployments, combining vLLM with Kubernetes provides a robust foundation for serving large models efficiently while maintaining flexibility in resource allocation and scaling.

If you’re looking to deploy large models like DeepSeek-R1, choosing the right precision, GPU instance, and inference engine is crucial for achieving a balance between speed, memory efficiency, and scalability. With vLLM’s advanced optimizations, production-scale model serving has never been easier.

Muntazir Fadhel


Muntazir Fadhel is an MLOps|Cloud Infrastructure Engineer specializing in designing scalable, secure, and automated ML pipelines. Proficient in CI/CD for model lifecycle management, big data processing, Kubernetes orchestration, and zero-downtime deployments. Adept at bridging data science and engineering to deliver reliable, production-grade ML systems that streamline workflows and ensure reproducibility across environments..

comments powered by Disqus