November 3, 2024

Docling

Top trending repository on GitHub today is Docling version 2. Docling is a MIT licensed document parser, OCR and converter from IBM Deep Search Team. Their technical report details use of two specialized models – DocLayNet for layout analysis and TableFormer for table structure recognition. They provide simple CLI tool as well as Python bindings.

Running the CLI is pretty easy with vux (I’ve recently started to switch from pip and pipx to uv:

uvx docling 2408.09869v3.pdf --to json --to md --to doctags --to text

This will generate all the (currently) supported output formats from the input PDF (which is Docling technical report). Report contains relatively complex table:

Docling table

Docling’s Markdown representation of the table is as follows. It is not perfect, as it handled the first (Apple silicon) row correctly, but the 2nd (Intel Xeon) row is not parsed incorrectly:

CPU Thread budget native backend native backend native backend pypdfium backend pypdfium backend pypdfium backend
Thread budget TTS Pages/s Mem TTS Pages/s Mem
Apple M3 Max 4 177 s 1.27 6.20 GB 103 s 2.18 2.56 GB
(16 cores) 16 167 s 1.34 6.20 GB 92 s 2.45 2.56 GB
Intel® Xeon E5-2690 4 16 375 s 244 s 0.60 0.92 6.16 GB 239 s 143 s 0.94 1.57 2.42 GB

Python API allows for more control and access to functionality, check usage and Docling v2 sections in their documentation. It allows for interesting things, like programmatic access to extracted tables for example (using uv in embedded script mode, which is super awesome):

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "docling",
# ]
# ///

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("https://arxiv.org/pdf/2408.09869")

table = result.document.tables[0]
df = table.export_to_dataframe()
print(df)

will produce the following output (same as the Markdown tables, as expected), as the table is extracted as pandas DataFrame:

                    CPU. Thread budget.Thread budget native backend.TTS  ... pypdfium backend.TTS pypdfium backend.Pages/s pypdfium backend.Mem
0           Apple M3 Max                           4              177 s  ...                103 s                     2.18              2.56 GB
1             (16 cores)                          16              167 s  ...                 92 s                     2.45              2.56 GB
2  Intel(R) Xeon E5-2690                        4 16        375 s 244 s  ...          239 s 143 s                0.94 1.57              2.42 GB

[3 rows x 8 columns]