Skip to content
English
  • There are no suggestions because the search field is empty.

[Public] Handling Large PDF Files Input in IronOCR Efficiently

Memory optimization while working with large PDF files

 

IronOCR is a powerful .NET OCR library that can extract text from PDFs and images with high accuracy. However, when working with large multi-page PDF documents, developers may run into performance or memory issues if the library isn't used optimally.

This article outlines the best practices for processing large PDFs with IronOCR and IronPDF, helping you avoid OutOfMemoryException and maximize performance.


Problem with OcrInput.LoadPdf()

When handling large PDFs, a common mistake is to use:

ocrInput.LoadPdf("large.pdf");

 

This method loads all pages at once, which triggers:

  • A massive memory spike

  • IronOCR’s internal imaging system to process all pages simultaneously

  • A DPI defaulting to 200, further increasing memory use if not manually set

  • Potential for exceptions like System.OutOfMemoryException or resource deadlocks


Recommended Approach

Instead of loading the entire PDF at once, you should process one page at a time using LoadPdfPage().

 

Step-by-Step Optimization

  1. Get PDF Page Count
    Use IronPdf.PdfDocument (or any open-source PDF library) to retrieve the number of pages in the document.

  2. Loop Through Pages
    Use a for loop to process each page individually.

  3. Load Individual Page
    Inside the loop, use OcrInput.LoadPdfPage("file.pdf", pageIndex, dpi) to load only one page at a time.

    • If the visual quality is good, set the DPI as low as 80 to conserve memory.

  4. Extract Text per Page
    Pass the input to IronTesseract.Read() to extract the text for that page.

  5. Build the Full Text
    Use a StringBuilder to concatenate the extracted text from each page.


Sample Code

 

Screenshot 2025-07-03 135640

 
 

Tips for Better Performance

  • Lower DPI (e.g., 80–100) when visual quality allows — this significantly reduces memory usage.

  • Avoid reading the entire PDF at once unless it’s very small.

  • Dispose objects properly using using statements to release memory early.

  • Consider doing OCR in parallel only if your system has enough memory and CPU headroom (be cautious with large PDFs).


Conclusion

Handling large PDF files with IronOCR requires a mindful, page-by-page approach. By combining IronOCR with IronPDF (or another PDF utility), you can process even large documents efficiently without crashing your application.

Following the above best practices not only ensures stability but also improves speed and scalability in production systems