Tensor Deduplication for Multi-Model Inference

Summary

  • Problem: Multi-model workloads are the norm: A/B tests, customer fine-tunes, safety variants, multi-stage pipelines. GPU memory scales linearly with model count, and VRAM is the limiting resource.
  • Solution: Tensor deduplication automatically identifies and shares bit-identical weight tensors across models, requiring no checkpoint modifications.
  • Results: Across diffusion and LLM workloads, real-world savings range from 3–32%. DeepFloyd IF stages share 18.87 GB (32% reduction). Synthetic upper bound is 50%.
  • Overhead: Hashing adds <1% to model load time. Zero runtime overhead since the forward pass is unchanged.
  • Compatibility: Works with HuggingFace safetensors, GGUF, and Diffusers pipelines. No changes to training or checkpoints required.

Multi-Model Memory Bloat

Modern inference deployments rarely serve a single model. Production systems routinely load:

[Read More]

Shared Backbones: Loading Weights Once, Serving Many Models

I keep running into the same pattern when trying to self-host models (which is a lot of fun): we run several big models side by side, all of them valuable, all of them slightly different, and all of them wasting VRAM by reloading nearly the same weights.

This post is my attempt to explore a specific idea:

Can we load a shared backbone of weights once on a GPU, then load only the small, unique pieces per model that reuse that backbone?

[Read More]

Subtle Errors in C++ Programs

I recently stumbled upon a subtle bug in a benchmark code which again reminds me to never use C++ again, if I can.

Here’s a buggy snippet from this code (simplified):

// BUGGY
ostringstream os;
int i = 1;
os << "foo-" << i << ".dat";
const char *filename = os.str().c_str();
int fd = open(filename, O_RDONLY);

You may expect above code to try open a file named foo-1.dat but that’s not what is happening here.

In this snippet, os.str() create a temporary string object which is destroyed immediately after call to c_str() method. So, filename ends up pointing to freed memory which can of course contain arbitrary content (till you reach a NULL).

[Read More]

Compressed RAM disk for Windows, The Virtual Way!

Recently, I developed Linux kernel driver which creates generic RAM based compressed block devices (called zram). Being RAM disks, they do not provide persistent storage but there are many use cases where persistence is not required: /tmp, various caches under /var, swap disks etc. These cases can benefit greatly from high speed RAM disks along with savings which compression brings!

However, all this seems to be completely Linux centric. But with virtualization, zram can be used for Windows too! The trick is a expose zram as a ‘raw disk’ to Windows running inside a Virtual Machine (VM). I will be using VirtualBox as example but exposing raw disks should be supported by other Virtualization solutions like VMware, KVM too.

[Read More]

Difference Engine - Harnessing Memory Redundancy in Virtual Machines

Here is link to paper (pdf) (MP3)

Recently I came across this paper published in OSDI ‘08. Its an extension to VMware’s page-sharing and shows some amazing and hard to believe results. VMware page-sharing mechanism scans memory for all VMs and maps pages with same contents to a single page. This achieves memory savings if multiple VMs are hosted running same OS. However, with technique discussed in this paper, we find pages that are nearly same. For such pages, they save a base page and other similar pages as delta of original page. For pages which are not similar to any other page are simply compressed. Their benchmarks shows upto 45% more memory saving over ESX page-sharing under some (specially crafted) workload.

[Read More]