Skip to content
English
  • There are no suggestions because the search field is empty.

IronOcr Performance: High Peak Memory During Bulk OCR Processing

Overview

Running OCR over many PDF segments at once multiplies memory use: each task renders full-page bitmaps through an OcrInput, and creating a fresh IronTesseract engine per segment reloads the language model files into memory every time. At full processor concurrency this drove peak memory into the multi-GB range with spikes that fail in memory-limited environments. Capping concurrency, pooling engines, and gating tasks with a semaphore keeps the number of in-flight OCR jobs — and their memory — bounded.


Cause

OCR is memory-heavy. Each OcrInput renders full-page bitmaps, and each IronTesseract engine loads language model files. Creating a new engine per segment reloaded those models repeatedly, and running one OCR task per CPU core (Environment.ProcessorCount) let many bitmap-heavy jobs run simultaneously. With nothing limiting how many tasks were active, peak memory scaled with concurrency.


Solution

  1. Recommended — Cap OCR concurrency. Clamp the number of simultaneous OCR tasks to a small ceiling. Each task renders full-page bitmaps, so fewer concurrent tasks directly lower peak memory. Tune the ceiling to the machine's capability.
    // Clamp concurrency to avoid memory saturation and CPU over-subscription.

    int concurrency = Math.Clamp(Environment.ProcessorCount / 2, 1, 4);
  2. Pool IronTesseract engines. Create exactly one engine per concurrent slot at startup and reuse them across all segments, instead of constructing a new engine per segment and reloading the language model each time.
    // Pre-create one engine per concurrent slot and reuse them across segments. 

    var enginePool = new ConcurrentBag<IronTesseract>(
        Enumerable.Range(0, concurrency).Select(_ => new IronTesseract())
    );
  3. Gate work with a SemaphoreSlim. Initialize the semaphore to the concurrency limit and wrap it in using. Each task calls WaitAsync() before it starts and Release() in a finally, so only the allowed number of segments are ever in flight at once. 
    using var semaphore = new SemaphoreSlim(concurrency);
    await semaphore.WaitAsync();

    try
    {
      // Rent a pre-loaded engine from the pool.
      if (!enginePool.TryTake(out var ocr))
            ocr = new IronTesseract(); // Defensive fallback; should never be reached.

      try
      {
          using var input = new OcrInput();
          input.LoadPdf(segmentStream); // page-range segment produced upstream
          var ocrResult = await ocr.ReadAsync(input);
            ocrResult.SaveAsSearchablePdf(outputPath);

      }
        finally

      {
          enginePool.Add(ocr); // Return engine to pool for the next waiting segment.
      }
    }
    finally

    {
      semaphore.Release();
    }

     

  4. Dispose OcrInput per segment. Wrap it in using so its rendered page bitmaps are released as soon as the segment is read, before the next task takes the slot.