IronOcr Performance: High Peak Memory During Bulk OCR Processing

Overview

Running OCR over many PDF segments at once multiplies memory use: each task renders full-page bitmaps through an OcrInput, and creating a fresh IronTesseract engine per segment reloads the language model files into memory every time. At full processor concurrency this drove peak memory into the multi-GB range with spikes that fail in memory-limited environments. Capping concurrency, pooling engines, and gating tasks with a semaphore keeps the number of in-flight OCR jobs — and their memory — bounded.

Cause

OCR is memory-heavy. Each OcrInput renders full-page bitmaps, and each IronTesseract engine loads language model files. Creating a new engine per segment reloaded those models repeatedly, and running one OCR task per CPU core (Environment.ProcessorCount) let many bitmap-heavy jobs run simultaneously. With nothing limiting how many tasks were active, peak memory scaled with concurrency.

Solution

Recommended — Cap OCR concurrency. Clamp the number of simultaneous OCR tasks to a small ceiling. Each task renders full-page bitmaps, so fewer concurrent tasks directly lower peak memory. Tune the ceiling to the machine's capability.
```
// Clamp concurrency to avoid memory saturation and CPU over-subscription.

 int concurrency = Math.Clamp(Environment.ProcessorCount / 2, 1, 4); 
```
Pool IronTesseract engines. Create exactly one engine per concurrent slot at startup and reuse them across all segments, instead of constructing a new engine per segment and reloading the language model each time.
```
// Pre-create one engine per concurrent slot and reuse them across segments. 

var enginePool = new ConcurrentBag<IronTesseract>(
    Enumerable.Range(0, concurrency).Select(_ => new IronTesseract())
);
```

Gate work with a SemaphoreSlim. Initialize the semaphore to the concurrency limit and wrap it in using. Each task calls WaitAsync() before it starts and Release() in a finally, so only the allowed number of segments are ever in flight at once.

using var semaphore = new SemaphoreSlim(concurrency);
await semaphore.WaitAsync();

try
{
    // Rent a pre-loaded engine from the pool.
    if (!enginePool.TryTake(out var ocr))
        ocr = new IronTesseract(); // Defensive fallback; should never be reached.

    try
    {
        using var input = new OcrInput();
        input.LoadPdf(segmentStream); // page-range segment produced upstream
        var ocrResult = await ocr.ReadAsync(input);
        ocrResult.SaveAsSearchablePdf(outputPath);

    }
    finally

    {
        enginePool.Add(ocr); // Return engine to pool for the next waiting segment.
    }
}
finally

{
    semaphore.Release();
}

Dispose OcrInput per segment. Wrap it in using so its rendered page bitmaps are released as soon as the segment is read, before the next task takes the slot.