Extract all lines (previously full-page table)
Under normal circumstances the StatementReader engine is able to analyse a page and detect if a table of transactions is there, where the table starts and where it ends. When processing searchable PDF documents, you are most likely to get clean transactions without any unwanted lines with this option turned off.
In the case of scanned pages however, the quality of the text may be suboptimal and the OCR process may produce errors. The worst thing that can happen is that the engine does not detect the existence of a table in a page. In this case you will see a message in the Excel output that informs you that no transactions have been found on the particular page.
This is where the option “Extract all lines” is useful. If you check this option, you are instructing the internal engine to stop trying to detect the existence of tables but instead assume that there is one table in every single page covering the whole page. This will ensure that you do not miss any transactions from anywhere. The drawback is that there could be some unwanted lines before and after the actual transactions, which you will have to remove manually from the Excel; something that can be done quickly by sorting the date/amount columns.
We recommend that you only check ‘Extract all lines’ when you have non-searchable PDF documents and you notice missing pages in the output.
Recent Posts
See AllRevisited starter script from January 2021: Split Excel file into separate files Excel is essential, and Python is the future - forcing...