[Public] Enhancing ReadDocumentAdvanced to Detect Dotted Border Tables in IronOCR
Summary
IronOCR's powerful ReadDocumentAdvanced
method is specialized in extracting data from tables in documents. However, a known limitation has been its inability to detect tables with non-solid borders, particularly those using dotted or dashed lines (e.g., border: 1px dotted black
). Though the texts may be able to be extracted, but to access table data in each cell will need IronOCR to be able to detect the table object first.
This article outlines an enhancement to address this limitation and possible workaround for extracting data from a dotted table.
Background
Modern PDFs and scanned documents often use varied border styles to format tables — from solid to dotted or dashed lines. While IronOCR has performed well in detecting solid-bordered tables, however, it will throw exception when trying to access a data on a simple sample input as below
Unhandled exception. System.InvalidOperationException: Sequence contains no elements
Sample input
This is due to the method inability to detect dotted line tables in a document
Workaround
The workaround below helps in:
-
making the dotted or dashed borders of a table to be solid internally
-
Detecting Table objects in the structured output
One effective technique involves applying a Dilate() filter before processing. This pre-processing step merges the gaps between the dots in the border, allowing the OCR engine to treat dotted borders as continuous lines — effectively converting visual dotted lines into solid contours.
Example Code:
var ocr = new IronTesseract();
ocr.Configuration.ReadDataTables = true;
var input = new OcrInput();
input.Load("image-20250408-144240.png");
input.Dilate();
input.SaveAsImages("export.png", AnyBitmap.ImageFormat.Png);
var res = ocr.ReadDocumentAdvanced(input);
Console.WriteLine(res.Tables.First().CellInfos.First().CellText);
The image below is the output of the image after Dilate() filter is applied.
This small adjustment significantly enhances IronOCR’s ability to extract full tables, even when unconventional border styles are used.