Tensor Deduplication for Multi-Model Inference

ngupta@nitingupta.dev (Nitin Gupta) — Mon, 08 Dec 2025 00:00:00 +0000

Summary

Problem: Multi-model workloads are the norm: A/B tests, customer fine-tunes, safety variants, multi-stage pipelines. GPU memory scales linearly with model count, and VRAM is the limiting resource.
Solution: Tensor deduplication automatically identifies and shares bit-identical weight tensors across models, requiring no checkpoint modifications.
Results: Across diffusion and LLM workloads, real-world savings range from 3–32%. DeepFloyd IF stages share 18.87 GB (32% reduction). Synthetic upper bound is 50%.
Overhead: Hashing adds <1% to model load time. Zero runtime overhead since the forward pass is unchanged.
Compatibility: Works with HuggingFace safetensors, GGUF, and Diffusers pipelines. No changes to training or checkpoints required.

Multi-Model Memory Bloat

Modern inference deployments rarely serve a single model. Production systems routinely load:

Shared Backbones: Loading Weights Once, Serving Many Models

ngupta@nitingupta.dev (Nitin Gupta) — Sat, 29 Nov 2025 00:00:00 +0000

I keep running into the same pattern when trying to self-host models (which is a lot of fun): we run several big models side by side, all of them valuable, all of them slightly different, and all of them wasting VRAM by reloading nearly the same weights.

This post is my attempt to explore a specific idea:

Can we load a shared backbone of weights once on a GPU, then load only the small, unique pieces per model that reuse that backbone?

Gpu on Nitin Gupta

Tensor Deduplication for Multi-Model Inference

Summary

Multi-Model Memory Bloat

Shared Backbones: Loading Weights Once, Serving Many Models