title: ai-cookbook/knowledge/docling at main · daveebbelaar/ai-cookbook
source: https://github.com/daveebbelaar/ai-cookbook/tree/main/knowledge/docling
author:
- "[[GitHub]]"
published:
created: 2025-04-07
description: Examples and tutorials to help developers build AI systems - ai-cookbook/knowledge/docling at main · daveebbelaar/ai-cookbook
tags:
- LLM
- RAG
Docling is a powerful, flexible open source document processing library that converts various document formats into a unified format. It has advanced document understanding capabilities powered by state-of-the-art AI models for layout analysis and table structure recognition.
The whole system runs locally on standard computers and is designed to be extensible - developers can add new models or modify the pipeline for specific needs. It's particularly useful for tasks like enterprise document search, passage retrieval, and knowledge extraction. With its advanced chunking and processing capabilities, it's the perfect tool for providing GenAI applications with knowledge through RAG (Retrieval Augmented Generation) pipelines.
pip install -r requirements.txt
.env
file:OPENAI_API_KEY=your_api_key_here
Execute the files in order to build and query the document database:
python 1-extraction.py
python 2-chunking.py
python 3-embedding.py
python 4-search.py
streamlit run 5-chat.py
Then open your browser and navigate to http://localhost:8501
to interact with the document Q&A interface.
Format | Description |
---|---|
Native PDF documents with layout preservation | |
DOCX, XLSX, PPTX | Microsoft Office formats (2007+) |
Markdown | Plain text with markup |
HTML/XHTML | Web documents |
Images | PNG, JPEG, TIFF, BMP |
USPTO XML | Patent documents |
PMC XML | PubMed Central articles |
Check out this page for an up to date list.
The standard pipeline includes:
Docling leverages two primary specialized AI models for document understanding. At its core, the layout analysis model is built on the RT-DETR (Real-Time Detection Transformer)
architecture, which excels at detecting and classifying page elements. This model processes pages at 72 dpi resolution and can analyze a single page in under a second on a standard CPU, having been trained on the comprehensive DocLayNet
dataset.
The second key model is TableFormer
, a table structure recognition system that can handle complex table layouts including partial borders, empty cells, spanning cells, and hierarchical headers. TableFormer typically processes tables in 2-6 seconds on CPU, making it efficient for practical use.
For documents requiring text extraction from images, Docling integrates EasyOCR
as an optional component, which operates at 216 dpi for optimal quality but requires about 30 seconds per page. Both the layout analysis and TableFormer models were developed by IBM Research and are publicly available as pre-trained weights on Hugging Face under "ds4sd/docling-models".
For more detailed information about these models and their implementation, you can refer to the technical documentation.
When you're building a RAG (Retrieval Augmented Generation) application, you need to break down documents into smaller, meaningful pieces that can be easily searched and retrieved. But this isn't as simple as just splitting text every X words or characters.
What makes Docling's chunking unique is that it understands the actual structure of your document. It has two main approaches:
Imagine you're building a system to answer questions about technical documents. With basic chunking (like splitting every 500 words), you might cut right through the middle of a table, or separate a header from its content. But Docling's smart chunking:
This means when your RAG system retrieves chunks, they'll have the proper context and structure, leading to more accurate and coherent responses from your language model.
For full documentation, visit documentation site.
For example notebooks and more detailed guides, check out GitHub repository.