<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Gpu on Nitin Gupta</title>
    <link>https://nitingupta.dev/tags/gpu/</link>
    <description>Recent content in Gpu on Nitin Gupta</description>
    <generator>Hugo</generator>
    <language>en</language>
    <managingEditor>ngupta@nitingupta.dev (Nitin Gupta)</managingEditor>
    <webMaster>ngupta@nitingupta.dev (Nitin Gupta)</webMaster>
    <lastBuildDate>Mon, 08 Dec 2025 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://nitingupta.dev/tags/gpu/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Tensor Deduplication for Multi-Model Inference</title>
      <link>https://nitingupta.dev/post/tensor-dedup/</link>
      <pubDate>Mon, 08 Dec 2025 00:00:00 +0000</pubDate><author>ngupta@nitingupta.dev (Nitin Gupta)</author>
      <guid>https://nitingupta.dev/post/tensor-dedup/</guid>
      <description>&lt;h2 id=&#34;summary&#34;&gt;Summary&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Multi-model workloads are the norm: A/B tests, customer fine-tunes, safety variants, multi-stage pipelines. GPU memory scales linearly with model count, and VRAM is the limiting resource.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Tensor deduplication automatically identifies and shares bit-identical weight tensors across models, requiring no checkpoint modifications.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Results&lt;/strong&gt;: Across diffusion and LLM workloads, real-world savings range from &lt;strong&gt;3–32%&lt;/strong&gt;. DeepFloyd IF stages share 18.87 GB (32% reduction). Synthetic upper bound is 50%.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Overhead&lt;/strong&gt;: Hashing adds &amp;lt;1% to model load time. Zero runtime overhead since the forward pass is unchanged.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;Compatibility&lt;/strong&gt;: Works with HuggingFace safetensors, GGUF, and Diffusers pipelines. No changes to training or checkpoints required.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;multi-model-memory-bloat&#34;&gt;Multi-Model Memory Bloat&lt;/h2&gt;&#xA;&lt;p&gt;Modern inference deployments rarely serve a single model. Production systems routinely load:&lt;/p&gt;</description>
    </item>
    <item>
      <title>Shared Backbones: Loading Weights Once, Serving Many Models</title>
      <link>https://nitingupta.dev/post/shared-backbones/</link>
      <pubDate>Sat, 29 Nov 2025 00:00:00 +0000</pubDate><author>ngupta@nitingupta.dev (Nitin Gupta)</author>
      <guid>https://nitingupta.dev/post/shared-backbones/</guid>
      <description>&lt;p&gt;I keep running into the same pattern when trying to self-host models (which is a lot of fun): we run several big models side by side, all of them valuable, all of them slightly different, and all of them wasting VRAM by reloading nearly the same weights.&lt;/p&gt;&#xA;&lt;p&gt;This post is my attempt to explore a specific idea:&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Can we load a shared backbone of weights once on a GPU, then load only the small, unique pieces per model that reuse that backbone?&lt;/strong&gt;&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
