IronOcr Performance: High Peak Memory During Bulk OCR Processing
Overview
Running OCR over many PDF segments at once multiplies memory use: each task renders full-page bitmaps through an OcrInput, and creating a fresh IronTesseract engine per segment reloads the language model files into memory every time. At full processor concurrency this drove peak memory into the multi-GB range with spikes that fail in memory-limited environments. Capping concurrency, pooling engines, and gating tasks with a semaphore keeps the number of in-flight OCR jobs — and their memory — bounded.
Cause
OCR is memory-heavy. Each OcrInput renders full-page bitmaps, and each IronTesseract engine loads language model files. Creating a new engine per segment reloaded those models repeatedly, and running one OCR task per CPU core (Environment.ProcessorCount) let many bitmap-heavy jobs run simultaneously. With nothing limiting how many tasks were active, peak memory scaled with concurrency.
Solution
- Recommended — Cap OCR concurrency. Clamp the number of simultaneous OCR tasks to a small ceiling. Each task renders full-page bitmaps, so fewer concurrent tasks directly lower peak memory. Tune the ceiling to the machine's capability.
// Clamp concurrency to avoid memory saturation and CPU over-subscription.
int concurrency = Math.Clamp(Environment.ProcessorCount / 2, 1, 4); - Pool
IronTesseractengines. Create exactly one engine per concurrent slot at startup and reuse them across all segments, instead of constructing a new engine per segment and reloading the language model each time.// Pre-create one engine per concurrent slot and reuse them across segments.
var enginePool = new ConcurrentBag<IronTesseract>(
Enumerable.Range(0, concurrency).Select(_ => new IronTesseract())
); - Gate work with a
SemaphoreSlim. Initialize the semaphore to the concurrency limit and wrap it inusing. Each task callsWaitAsync()before it starts andRelease()in afinally, so only the allowed number of segments are ever in flight at once.
using var semaphore = new SemaphoreSlim(concurrency);
await semaphore.WaitAsync();
try
{
// Rent a pre-loaded engine from the pool.
if (!enginePool.TryTake(out var ocr))
ocr = new IronTesseract(); // Defensive fallback; should never be reached.
try
{
using var input = new OcrInput();
input.LoadPdf(segmentStream); // page-range segment produced upstream
var ocrResult = await ocr.ReadAsync(input);
ocrResult.SaveAsSearchablePdf(outputPath);
}
finally
{
enginePool.Add(ocr); // Return engine to pool for the next waiting segment.
}
}
finally
{
semaphore.Release();
} - Dispose
OcrInputper segment. Wrap it inusingso its rendered page bitmaps are released as soon as the segment is read, before the next task takes the slot.