[Public] Handling Large PDF Files Input in IronOCR Efficiently
Memory optimization while working with large PDF files
IronOCR is a powerful .NET OCR library that can extract text from PDFs and images with high accuracy. However, when working with large multi-page PDF documents, developers may run into performance or memory issues if the library isn't used optimally.
This article outlines the best practices for processing large PDFs with IronOCR and IronPDF, helping you avoid OutOfMemoryException
and maximize performance.
Problem with OcrInput.LoadPdf()
When handling large PDFs, a common mistake is to use:
ocrInput.LoadPdf("large.pdf");
This method loads all pages at once, which triggers:
-
A massive memory spike
-
IronOCR’s internal imaging system to process all pages simultaneously
-
A DPI defaulting to 200, further increasing memory use if not manually set
-
Potential for exceptions like
System.OutOfMemoryException
or resource deadlocks
Recommended Approach
Instead of loading the entire PDF at once, you should process one page at a time using LoadPdfPage()
.
Step-by-Step Optimization
-
Get PDF Page Count
UseIronPdf.PdfDocument
(or any open-source PDF library) to retrieve the number of pages in the document. -
Loop Through Pages
Use afor
loop to process each page individually. -
Load Individual Page
Inside the loop, useOcrInput.LoadPdfPage("file.pdf", pageIndex, dpi)
to load only one page at a time.-
If the visual quality is good, set the DPI as low as 80 to conserve memory.
-
-
Extract Text per Page
Pass the input toIronTesseract.Read()
to extract the text for that page. -
Build the Full Text
Use aStringBuilder
to concatenate the extracted text from each page.
Sample Code
Tips for Better Performance
-
Lower DPI (e.g., 80–100) when visual quality allows — this significantly reduces memory usage.
-
Avoid reading the entire PDF at once unless it’s very small.
-
Dispose objects properly using
using
statements to release memory early. -
Consider doing OCR in parallel only if your system has enough memory and CPU headroom (be cautious with large PDFs).
Conclusion
Handling large PDF files with IronOCR requires a mindful, page-by-page approach. By combining IronOCR with IronPDF (or another PDF utility), you can process even large documents efficiently without crashing your application.
Following the above best practices not only ensures stability but also improves speed and scalability in production systems