<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Lmdb on Avinash&#39;s Blog</title>
    <link>https://avimallu.dev/tags/lmdb/</link>
    <description>Recent content in Lmdb on Avinash&#39;s Blog</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-US</language>
    <copyright>© Avinash Mallya</copyright>
    <lastBuildDate>Tue, 10 Feb 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://avimallu.dev/tags/lmdb/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Resolving I/O Bottlenecks for 100K Small Files with LMDB</title>
      <link>https://avimallu.dev/blog/005_ldmb_as_image_db/</link>
      <pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>
      <guid>https://avimallu.dev/blog/005_ldmb_as_image_db/</guid>
      <description>&lt;h1 id=&#34;premise&#34;&gt;Premise&lt;/h1&gt;&#xA;&lt;p&gt;I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.&lt;/p&gt;&#xA;&lt;p&gt;I had several thousand images already.&#xA;I was expecting several thousand more.&#xA;My repository was tracking these images via DVC.&#xA;My computer was also slowing down massively because of the sheer number of files.&#xA;DVC itself was slowing down (after all, randomly accessing many files isn&amp;rsquo;t going to be fast).&#xA;I also needed to access files at random for training/evaluating the model (lots of shuffling).&#xA;Lastly, these images had their own associated metadata (labels, bounding boxes, &amp;ldquo;correct&amp;rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.&lt;/p&gt;</description>
      <content:encoded><![CDATA[<h1 id="premise">Premise</h1>
<p>I was building a model for processing images (OCR, classification, object detection all bundled in), and I found myself with another problem - too many images to store efficiently as individual files on disk.</p>
<p>I had several thousand images already.
I was expecting several thousand more.
My repository was tracking these images via DVC.
My computer was also slowing down massively because of the sheer number of files.
DVC itself was slowing down (after all, randomly accessing many files isn&rsquo;t going to be fast).
I also needed to access files at random for training/evaluating the model (lots of shuffling).
Lastly, these images had their own associated metadata (labels, bounding boxes, &ldquo;correct&rdquo; text etc.), and they need to be stored along with the images - or least easily linkable to them.</p>
<p>I was primarily aiming for a &ldquo;simple&rdquo; solution, and didn&rsquo;t need a productionizable codebase.</p>
<h1 id="potential-solutions">Potential Solutions</h1>
<h2 id="partitioning">Partitioning</h2>
<p>A typical solution for &ldquo;too many files&rdquo; is to partition them by their name. It&rsquo;s ideal if their
name is a hash, so you can store the first character of the hash, then the second character, and
then the actual file. So, for example, the directory changes from:</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln">1</span><span class="cl">files/
</span></span><span class="line"><span class="ln">2</span><span class="cl">├── a1b2c3d4e5.txt
</span></span><span class="line"><span class="ln">3</span><span class="cl">├── b7f8a9c0d1.txt
</span></span><span class="line"><span class="ln">4</span><span class="cl">├── b7e4d2f1a0.txt
</span></span><span class="line"><span class="ln">5</span><span class="cl">├── cf1a2b3c4d.txt
</span></span><span class="line"><span class="ln">6</span><span class="cl">├── c0d1e2f3a4.txt
</span></span><span class="line"><span class="ln">7</span><span class="cl">└── a1f5e6d7c8.txt</span></span></code></pre></div><p>to</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="ln"> 1</span><span class="cl">files/
</span></span><span class="line"><span class="ln"> 2</span><span class="cl">├── a/
</span></span><span class="line"><span class="ln"> 3</span><span class="cl">│   └── 1/
</span></span><span class="line"><span class="ln"> 4</span><span class="cl">│       ├── a1b2c3d4e5.txt
</span></span><span class="line"><span class="ln"> 5</span><span class="cl">│       └── a1f5e6d7c8.txt
</span></span><span class="line"><span class="ln"> 6</span><span class="cl">├── b/
</span></span><span class="line"><span class="ln"> 7</span><span class="cl">│   └── 7/
</span></span><span class="line"><span class="ln"> 8</span><span class="cl">│       ├── b7f8a9c0d1.txt
</span></span><span class="line"><span class="ln"> 9</span><span class="cl">│       └── b7e4d2f1a0.txt
</span></span><span class="line"><span class="ln">10</span><span class="cl">└── c/
</span></span><span class="line"><span class="ln">11</span><span class="cl">    ├── 0/
</span></span><span class="line"><span class="ln">12</span><span class="cl">    │   └── c0d1e2f3a4.txt
</span></span><span class="line"><span class="ln">13</span><span class="cl">    └── f/
</span></span><span class="line"><span class="ln">14</span><span class="cl">        └── cf1a2b3c4d.txt</span></span></code></pre></div><p>This isn&rsquo;t novel - <code>git</code> and <code>dvc</code> both store their objects this way.
This limits directory size, so file system look ups take less time.</p>
<p>This isn&rsquo;t a perfect solution - it required that I store the images as their hash,
and handle the directory structure correctly. I will also need to maintain my own
mechanism to maintain the link between the hash and the metadata, which means creating
some sort of index. Lastly, DVC will still track files individually, which means that
its <code>push</code>, <code>diff</code> and <code>pull</code> commands will still be slow.</p>
<h2 id="separate-object-storage-maintain-only-the-index">Separate (object) storage, maintain only the index</h2>
<p>Another solution would be to offload the storage to a medium capable of handling a large number of files in any order,
while maintaining random access - for example, S3. Now my task will reduce to just correctly maintaining the index so
that I know how many images are stored on S3, and link their metadata via their hash.</p>
<p>If you&rsquo;ve experienced retrieving a list of a large number of files stored on S3, you&rsquo;d have first encountered the limit
of 1000 objects that <code>boto3</code> enforces per request. You&rsquo;ll need to work around it with pagination, which while standard,
is still more work. You would have also realized that even after all this, S3 will take quite some time to give you the
list even after you&rsquo;ve optimized as much as you can.</p>
<p>However, that means I need an active internet connection to access any data. It also introduces latency during training,
and quite wasteful bandwidth in running multiple training sessions for the models. I can minimize these if I store the
images in the same AWS region as I would be running the training in (say, an EC2), but that means I needed access to an
EC2.</p>
<p>This still left me the task of maintaining my own index, which I really wanted to avoid, as it would mean additional
maintenance burden for a relatively nascent project that hasn&rsquo;t reached production status, while demanding production
code for an even more nascent pipeline.</p>
<h2 id="what-about-a-database">What about a&hellip; database?</h2>
<p>This feels much like reaching for your nose by looping your hand behind your head instead of touching it directly with
your fingers. A database is great for many things, but setting one up and maintaining it (even a local SQLite one) isn&rsquo;t
really a quick and painless process, and has many gotchas. For instance, most databases aren&rsquo;t optimized for storing a
large number of binary blobs.</p>
<p>I considered, and tested using a more modern embeddable analytical database like DuckDB for this purpose (I&rsquo;m quite
biased to using DuckDB and/or Polars to solve a large number of my data processing problems). I quickly
found out that storing large binary blobs in it causes it to choke (which is fair, it isn&rsquo;t really designed for that). Storing
the files elsewhere while maintaining just the index in it still had the original problem - I needed to write the mechanism
of maintaining the index.</p>
<blockquote>
<p>Note: HuggingFace now provides many images datasets (such as <a href="https://huggingface.co/datasets/ylecun/mnist">MNIST</a>) in
the Parquet format, with the images stored using Arrow&rsquo;s extension types (but still as binary blobs). My experience with
storing binary data in Parquet hasn&rsquo;t been great, but you could check this out to see if it meets your requirements.</p>
</blockquote>
<h1 id="the-solution-i-landed-on">The solution I landed on</h1>
<h2 id="what-about-a-different-kind-of-database">What about a&hellip; <em>different</em> kind of database?</h2>
<p>Let&rsquo;s get down to first principles. What did I want to do? I wanted to store images. With those images, I also wanted
to store its metadata. I wanted to access said data quickly. It became clearer to me that I was looking for a fast key-value
store, and I stumbled upon <a href="https://www.symas.com/mdb">LMDB</a>.</p>
<h2 id="lmdb">LMDB</h2>
<p>Wikipedia&rsquo;s <a href="https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database">entry</a> on LMDB indicates that it&rsquo;s an incredibly
small (64kB) piece of software that does one thing really well - be a ridiculously fast key-value store. I won&rsquo;t pretend to
understand how it works, the writeup on the Wiki provides plenty of good detail. I&rsquo;ll focus, rather, on how I used it to
solve my problem.</p>
<h2 id="storing-and-retrieving-image-data-along-with-its-metadata">Storing and retrieving image data along with its metadata</h2>
<p>LMDB is rather barebones. It exposes few features - the ability to write, and read a particular key (stored as a bytestring),
that itself points to arbitrary bytes. I used its <a href="https://lmdb.readthedocs.io/en/release/">Python bindings</a>.</p>
<p>I wrote a tiny class (~200 LoC) that did the following:</p>
<ol>
<li>Read the file names and the data of the images that I had, along with their metadata as it currently was, into Python. Batched to avoid running out of memory.</li>
<li>Serialize the metadata, and read the image files as bytes, and link them to the keys f&quot;{file_name}_image&quot; and f&quot;{file_name}_metadata&quot; respectively.</li>
<li>Store these as key-value pairs into the LMDB database, which is a <strong>single</strong> file.</li>
<li>Provided a method to read the keys to identify all the images present in the database.</li>
<li>Provided a method to retrieve an arbitrary set of images and their metadata quickly from the saved database.</li>
</ol>
<p>This has many advantages:</p>
<ol>
<li>LMDB is fast - really, really fast. Random access? Check. Fast retrieval of available images? Check.</li>
<li>DVC tracking becomes simple - maintain a single file, and just version control that. No slow downs due to sheer number of files - either for DVC, or my computer.</li>
<li>No index to maintain - the images and their metadata are stored in the same location, and linkable via a mere change in the suffix to their file name.</li>
<li>Local access, practically zero latency.</li>
</ol>
<p>Which solves&hellip; all of the problems I had! When new files come in, all I needed to do was add them to the DB. LMDB has a few options available - you can
avoid overwriting the same key, ensure that the database is de-duplicated.</p>
<h1 id="the-code">The code</h1>
<p>I&rsquo;ve provided a sample code below that demonstrates storing just the images (not metadata) for the <a href="https://www.robots.ox.ac.uk/~vgg/data/flowers/102/">Oxford 102 Category Flower</a>
dataset, which has around 8000 images.</p>





<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-py" data-lang="py"><span class="line"><span class="ln">  1</span><span class="cl"><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
</span></span><span class="line"><span class="ln">  2</span><span class="cl"><span class="kn">import</span> <span class="nn">lmdb</span>
</span></span><span class="line"><span class="ln">  3</span><span class="cl">
</span></span><span class="line"><span class="ln">  4</span><span class="cl">
</span></span><span class="line"><span class="ln">  5</span><span class="cl"><span class="k">class</span> <span class="nc">ImageDB</span><span class="p">:</span>
</span></span><span class="line"><span class="ln">  6</span><span class="cl">    <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">env_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">max_size_as_mb</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
</span></span><span class="line"><span class="ln">  7</span><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">env_path</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">env_path</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">  8</span><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">env_path</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">max_size_as_mb</span> <span class="o">*</span> <span class="p">(</span><span class="mi">2</span><span class="o">**</span><span class="mi">20</span><span class="p">))</span>
</span></span><span class="line"><span class="ln">  9</span><span class="cl">        <span class="bp">self</span><span class="o">.</span><span class="n">db</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">open_db</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 10</span><span class="cl">
</span></span><span class="line"><span class="ln"> 11</span><span class="cl">    <span class="k">def</span> <span class="nf">save_image</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 12</span><span class="cl">        <span class="bp">self</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 13</span><span class="cl">        <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 14</span><span class="cl">        <span class="n">image_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 15</span><span class="cl">    <span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 16</span><span class="cl">        <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 17</span><span class="cl">            <span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
</span></span><span class="line"><span class="ln"> 18</span><span class="cl">
</span></span><span class="line"><span class="ln"> 19</span><span class="cl">    <span class="k">def</span> <span class="nf">read_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bytes</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 20</span><span class="cl">        <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 21</span><span class="cl">            <span class="k">if</span> <span class="n">image_as_bytes</span> <span class="o">:=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
</span></span><span class="line"><span class="ln"> 22</span><span class="cl">                <span class="k">return</span> <span class="n">image_as_bytes</span>
</span></span><span class="line"><span class="ln"> 23</span><span class="cl">            <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 24</span><span class="cl">                <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 25</span><span class="cl">
</span></span><span class="line"><span class="ln"> 26</span><span class="cl">    <span class="k">def</span> <span class="nf">save_images</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 27</span><span class="cl">        <span class="bp">self</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 28</span><span class="cl">        <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">],</span>
</span></span><span class="line"><span class="ln"> 29</span><span class="cl">    <span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 30</span><span class="cl">        <span class="c1"># Note: you might need to enforce a batch size here</span>
</span></span><span class="line"><span class="ln"> 31</span><span class="cl">        <span class="c1"># to aovid running out of memory because this loads</span>
</span></span><span class="line"><span class="ln"> 32</span><span class="cl">        <span class="c1"># all images sent to this function as bytes.</span>
</span></span><span class="line"><span class="ln"> 33</span><span class="cl">        <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 34</span><span class="cl">            <span class="n">item_tuples</span> <span class="o">=</span> <span class="p">[</span>
</span></span><span class="line"><span class="ln"> 35</span><span class="cl">                <span class="p">(</span><span class="n">k</span><span class="o">.</span><span class="n">encode</span><span class="p">(),</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">())</span>
</span></span><span class="line"><span class="ln"> 36</span><span class="cl">                <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">name_image</span><span class="o">.</span><span class="n">items</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 37</span><span class="cl">            <span class="p">]</span>
</span></span><span class="line"><span class="ln"> 38</span><span class="cl">            <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 39</span><span class="cl">            <span class="n">consumed</span><span class="p">,</span> <span class="n">added</span> <span class="o">=</span> <span class="n">cursor</span><span class="o">.</span><span class="n">putmulti</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 40</span><span class="cl">                <span class="n">item_tuples</span><span class="p">,</span> <span class="n">dupdata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">overwrite</span><span class="o">=</span><span class="kc">False</span>
</span></span><span class="line"><span class="ln"> 41</span><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="ln"> 42</span><span class="cl">            <span class="nb">print</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 43</span><span class="cl">                <span class="sa">f</span><span class="s2">&#34;Saved </span><span class="si">{</span><span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> out of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images to the DB (</span><span class="si">{</span><span class="n">consumed</span> <span class="o">-</span> <span class="n">added</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> seem to already exist).&#34;</span>
</span></span><span class="line"><span class="ln"> 44</span><span class="cl">            <span class="p">)</span>
</span></span><span class="line"><span class="ln"> 45</span><span class="cl">
</span></span><span class="line"><span class="ln"> 46</span><span class="cl">    <span class="k">def</span> <span class="nf">load_images</span><span class="p">(</span>
</span></span><span class="line"><span class="ln"> 47</span><span class="cl">        <span class="bp">self</span><span class="p">,</span>
</span></span><span class="line"><span class="ln"> 48</span><span class="cl">        <span class="n">names</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
</span></span><span class="line"><span class="ln"> 49</span><span class="cl">    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]:</span>
</span></span><span class="line"><span class="ln"> 50</span><span class="cl">        <span class="n">names_as_bytestrings</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">encode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">names</span><span class="p">]</span>
</span></span><span class="line"><span class="ln"> 51</span><span class="cl">        <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 52</span><span class="cl">            <span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 53</span><span class="cl">            <span class="k">return</span> <span class="p">{</span>
</span></span><span class="line"><span class="ln"> 54</span><span class="cl">                <span class="n">k</span><span class="o">.</span><span class="n">decode</span><span class="p">():</span> <span class="n">image_as_bytes</span>
</span></span><span class="line"><span class="ln"> 55</span><span class="cl">                <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">image_as_bytes</span> <span class="ow">in</span> <span class="n">cursor</span><span class="o">.</span><span class="n">getmulti</span><span class="p">(</span><span class="n">names_as_bytestrings</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 56</span><span class="cl">            <span class="p">}</span>
</span></span><span class="line"><span class="ln"> 57</span><span class="cl">
</span></span><span class="line"><span class="ln"> 58</span><span class="cl">    <span class="k">def</span> <span class="nf">delete_image</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 59</span><span class="cl">        <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 60</span><span class="cl">            <span class="k">if</span> <span class="n">txn</span><span class="o">.</span><span class="n">delete</span><span class="p">(</span><span class="n">name</span><span class="o">.</span><span class="n">encode</span><span class="p">()):</span>
</span></span><span class="line"><span class="ln"> 61</span><span class="cl">                <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;Image </span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s2"> deleted successfully&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 62</span><span class="cl">            <span class="k">else</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 63</span><span class="cl">                <span class="k">raise</span> <span class="ne">KeyError</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 64</span><span class="cl">
</span></span><span class="line"><span class="ln"> 65</span><span class="cl">    <span class="k">def</span> <span class="nf">retrieve_names</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
</span></span><span class="line"><span class="ln"> 66</span><span class="cl">        <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 67</span><span class="cl">            <span class="k">return</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">()</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span><span class="o">.</span><span class="n">iternext</span><span class="p">(</span><span class="n">keys</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="kc">False</span><span class="p">)]</span>
</span></span><span class="line"><span class="ln"> 68</span><span class="cl">
</span></span><span class="line"><span class="ln"> 69</span><span class="cl">
</span></span><span class="line"><span class="ln"> 70</span><span class="cl"><span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&#34;__main__&#34;</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 71</span><span class="cl">    <span class="n">db</span> <span class="o">=</span> <span class="n">ImageDB</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="s2">&#34;./db/&#34;</span><span class="p">),</span> <span class="mi">512</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 72</span><span class="cl">    <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 73</span><span class="cl">
</span></span><span class="line"><span class="ln"> 74</span><span class="cl">    <span class="c1"># Save the results</span>
</span></span><span class="line"><span class="ln"> 75</span><span class="cl">    <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">&#34;./data/jpg/&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">&#34;*.jpg&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 76</span><span class="cl">        <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span>
</span></span><span class="line"><span class="ln"> 77</span><span class="cl">        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">1000</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 78</span><span class="cl">            <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 79</span><span class="cl">            <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 80</span><span class="cl">    <span class="c1"># Add last batch also</span>
</span></span><span class="line"><span class="ln"> 81</span><span class="cl">    <span class="n">db</span><span class="o">.</span><span class="n">save_images</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 82</span><span class="cl">
</span></span><span class="line"><span class="ln"> 83</span><span class="cl">    <span class="k">del</span> <span class="n">name_image</span>
</span></span><span class="line"><span class="ln"> 84</span><span class="cl">    <span class="c1"># How many images have been stored?</span>
</span></span><span class="line"><span class="ln"> 85</span><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&#34;The DB has </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">db</span><span class="o">.</span><span class="n">retrieve_names</span><span class="p">())</span><span class="si">:</span><span class="s2">,</span><span class="si">}</span><span class="s2"> images stored&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln"> 86</span><span class="cl">
</span></span><span class="line"><span class="ln"> 87</span><span class="cl">    <span class="n">name_image</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">]</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 88</span><span class="cl">    <span class="c1"># Load the results from the DB and check if they match the files on disk</span>
</span></span><span class="line"><span class="ln"> 89</span><span class="cl">    <span class="k">for</span> <span class="n">image_path</span> <span class="ow">in</span> <span class="n">Path</span><span class="p">(</span><span class="s2">&#34;./data/jpg&#34;</span><span class="p">)</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s2">&#34;*.jpg&#34;</span><span class="p">):</span>
</span></span><span class="line"><span class="ln"> 90</span><span class="cl">        <span class="n">name_image</span><span class="p">[</span><span class="n">image_path</span><span class="o">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">image_path</span><span class="o">.</span><span class="n">read_bytes</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 91</span><span class="cl">        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">name_image</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">1000</span><span class="p">:</span>
</span></span><span class="line"><span class="ln"> 92</span><span class="cl">            <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
</span></span><span class="line"><span class="ln"> 93</span><span class="cl">            <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
</span></span><span class="line"><span class="ln"> 94</span><span class="cl">            <span class="n">name_image</span><span class="o">.</span><span class="n">clear</span><span class="p">()</span>
</span></span><span class="line"><span class="ln"> 95</span><span class="cl">    <span class="c1"># Verify last batch also</span>
</span></span><span class="line"><span class="ln"> 96</span><span class="cl">    <span class="n">saved_name_image</span> <span class="o">=</span> <span class="n">db</span><span class="o">.</span><span class="n">load_images</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">name_image</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span>
</span></span><span class="line"><span class="ln"> 97</span><span class="cl">    <span class="k">assert</span> <span class="n">name_image</span> <span class="o">==</span> <span class="n">saved_name_image</span>
</span></span><span class="line"><span class="ln"> 98</span><span class="cl">
</span></span><span class="line"><span class="ln"> 99</span><span class="cl">    <span class="nb">print</span><span class="p">(</span><span class="s2">&#34;All images stored are byte identical to the original ones!&#34;</span><span class="p">)</span>
</span></span><span class="line"><span class="ln">100</span><span class="cl">
</span></span><span class="line"><span class="ln">101</span><span class="cl">    <span class="n">db</span><span class="o">.</span><span class="n">env</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></span></span></code></pre></div><p>This should provide you a good starting point to implement additional features,
such as storing metadata, filtering required input by metadata (such as extracting
a specific label for evaluation) and so on.</p>
<h1 id="caveats">Caveats</h1>
<h2 id="avoid-pil-or-pay-the-small-price">Avoid PIL, or pay the (small) price</h2>
<p>One gotcha that I initially faced is that the images I saved wasn&rsquo;t the same as
the images that I retrieved. This wasn&rsquo;t LMDB&rsquo;s fault, this was because I was
reading the images from disk via PIL, and storing them as bytes in LMDB. PIL
decodes and encodes the image, so a roundtrip will not necessarily be identical,
even for lossless file formats (other than bitmap images).</p>
<p>Don&rsquo;t encode/re-encode the image before you store it, or be prepared for the
stored data to not be byte-identical.</p>
<h2 id="the-max_size_as_mb-argument">The <code>max_size_as_mb</code> argument</h2>
<p>LMDB has an unusual design. You need to specify the upper bound of the DB size
upon creation, and if it exceeds this size, it will fail. You can edit this
later, with some caveats (on Windows, this will actually allocate the full
size).</p>
<h2 id="concurrency-and-lmdb">Concurrency and LMDB</h2>
<p>LMDB, while extremely fast, has some considerations with concurrency. See
<a href="https://lmdb.readthedocs.io/en/release/#threads">the documentation</a> for details.
It may not be suited for distributed workloads.</p>
<h1 id="alternatives">Alternatives</h1>
<p>This article covers a &ldquo;quick and dirty&rdquo; solution, and was before more purpose-built
solutions were available. Some alternatives are:</p>
<ol>
<li>If you&rsquo;re comfortable operating directly on archives, a simple <code>tar</code> file will
do - it can provide an offset index to provide random access to data.</li>
<li><a href="https://github.com/webdataset/webdataset">Nvidia&rsquo;s WebDataset</a>. Modern, open
source and purpose built for large scale deep learning.</li>
<li><a href="https://lancedb.com/">LanceDB</a>, which describes itself as &ldquo;designed for multimodal&rdquo;
and &ldquo;built for scale&rdquo;. It&rsquo;s built on top of Arrow, closely related to Parquet.</li>
<li>As mentioned, HuggingFace has multiple solutions to this, starting with Arrow backed
storage, and their own <a href="https://huggingface.co/docs/datasets/index"><code>datasets</code></a>.</li>
</ol>
<p>Use these if you want to scale to production level training.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
