Tensor Deduplication for Multi-Model Inference

Summary

  • Problem: Multi-model workloads are the norm: A/B tests, customer fine-tunes, safety variants, multi-stage pipelines. GPU memory scales linearly with model count, and VRAM is the limiting resource.
  • Solution: Tensor deduplication automatically identifies and shares bit-identical weight tensors across models, requiring no checkpoint modifications.
  • Results: Across diffusion and LLM workloads, real-world savings range from 3–32%. DeepFloyd IF stages share 18.87 GB (32% reduction). Synthetic upper bound is 50%.
  • Overhead: Hashing adds <1% to model load time. Zero runtime overhead since the forward pass is unchanged.
  • Compatibility: Works with HuggingFace safetensors, GGUF, and Diffusers pipelines. No changes to training or checkpoints required.

Multi-Model Memory Bloat

Modern inference deployments rarely serve a single model. Production systems routinely load:

[Read More]

Shared Backbones: Loading Weights Once, Serving Many Models

I keep running into the same pattern when trying to self-host models (which is a lot of fun): we run several big models side by side, all of them valuable, all of them slightly different, and all of them wasting VRAM by reloading nearly the same weights.

This post is my attempt to explore a specific idea:

Can we load a shared backbone of weights once on a GPU, then load only the small, unique pieces per model that reuse that backbone?

[Read More]

Proactive Compaction

for the Linux kernel

Proactive Compaction

This feature has now been accepted and merged in the upstream kernel and will be part of kernel release 5.9. This post has been updated to match the upstream version of this feature.


In my previous post, I described how on-demand compaction scheme hurts hugepage allocation latencies on Linux. To improve the situation, I have been working on Proactive Compaction for the Linux kernel, which tries to reduce higher-order allocation latencies by compacting memory in the background.

[Read More]
linux  kernel  mm 

Linux kernel hugepage allocation latencies

A detailed analysis

Linux kernel hugepage allocation latencies

Some drivers needs to allocate almost all memory as hugepages to reduce (on-device or CPU) TLB pressure. However, on a running system, higher order allocations can fail if the memory is fragmented. Linux kernel can do on-demand compaction as we request more hugepages but this style of compaction incurs very high latency.

To show the effect of on-demand compaction on hugepage allocation latency, I created a test program “frag” which allocates almost all available system memory followed by freeing $\frac{3}{4}$ of pages from each hugepage-sized aligned chunk. This allocation pattern results in ~300% fragmented address space w.r.t order 9 i.e. physical mappings of our VA space is spread over 3x the number of hugepage-aligned chunks than what is ideally required.

[Read More]
linux  kernel  mm 

A layered object store design in Elixir (Part VI)

Summary

We built an object store from scratch in Elixir using a layered design approach. The overall theme has been to avoid generalizing the design too much which kept implementation of each layer/module simple. We were also careful when adding any third-party dependencies which has multiple advantages: deeper understanding of your codebase, easier debugging (I hate unknown code-paths in backtraces).

For reference, here are links for all five parts along with their summaries:

[Read More]
elixir 

A layered object store design in Elixir (Part V)

The Web layer

Part I, introduces the overall design of our object store. In this post we focus on the Web layer. This is the final layer for our object store responsible for exposing it over the web. It will expose endpoints: /upload for uploading a file and /file/:file_id for getting a file by ID. A typical GraphQL application with also expose endpoint /graphql which directly plugs into your API layer, however I will not discuss this part and stay focused on the object store side of things.

[Read More]
elixir 

A layered object store design in Elixir (Part IV)

The API layer

Part I, introduces the overall design of our object store. In this post we focus on the API layer. All layers till now were just concerned about storing the input file together with some file-format specific transforms (like thumbnails). It is at the API layer where we will be storing per-file system and user metadata. This metadata can be used to support application specific business logic and security policies.

This layer will depend on all per-file-format modules: ImageStore, VideoStore, etc. We will use Postgres for storing per-file metadata, so we also depend on the postgrex package. A typical API layer will also be exposing a GraphQL interface which forms the core of application specific business logic. I am not going to include an example GraphQL interface here but absinthe would be my preferred way of doing it, anytime.

[Read More]
elixir 

A layered object store design in Elixir (Part III)

ImageStore and VideoStore

Part I layer.

ImageStore

The ImageStore module is responsible for storing images along with their thumbnail. It will use the FileStore layer to actually store files on disk. Before we define module interfaces, lets see our application requirements:

  • All images must be stored in the jpg format.
  • Images cannot be larger than 1920x1080. We do not want to store user provided version at all.
  • Thumbnails should use the same jpg format.
  • All thumbnails must have the same size of 256x256.

Note that we are going for highly application specific requirements rather than a more general, configurable design. I have seen most of the complexity in software stacks is due to the temptation of making them “reusable”. As you will see, the implementation is going to be so simple, with clearly defined interfaces, that it would be much easier for you to create such a module for each of your applications, with its specific requirements baked in.

[Read More]
elixir 

A layered object store design in Elixir (Part II)

The FileStore layer

A layered object store design in Elixir (Part II)

Part I, introduces the overall design of our object store. In this post we focus on its first layer, the FileStore.

The FileStore layer is responsible for actually storing the file in our object store. At this level, we are not concerned about what kind of file it is (image, video, document, or whatever else), nor do we have any notion of security. We just store whatever input path is given to us.

[Read More]
elixir 

A layered object store design in Elixir (Part I)

Introduction

A layered object store design in Elixir (Part I)

I recently designed an object store from scratch in Elixir. It has been serving me well as a backend for an app which needs to store all kinds of files: images, videos, documents. I wanted something simple to avoid dealing with off-the-shelf object stores which require complex configurations and to avoid cloud storage which is dead simple to use but can get very expensive, very quickly. For this project, simplicity was the key to make sure I can debug any failures quickly.

[Read More]
elixir