Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/QwenLM/Qwen-Agent/llms.txt

Use this file to discover all available pages before exploring further.

The Doc Parser tool extracts content from various document formats and intelligently splits them into chunks suitable for RAG (Retrieval-Augmented Generation) systems.

Overview

Doc Parser provides:
  • Multi-format Parsing: PDF, DOCX, PPTX, TXT, HTML, CSV, TSV, XLSX
  • Intelligent Chunking: Context-aware splitting with overlap
  • Table Extraction: Preserves table structure as markdown
  • Token Counting: Tracks token usage for each chunk
  • Caching: Parsed documents are cached for efficiency
  • Page Tracking: Maintains page number metadata

Registration

@register_tool('doc_parser')
class DocParser(BaseTool):
    ...
Tool Name: doc_parser

Parameters

url
string
required
Path to the document to parse. Can be:
  • Local file path: "/path/to/document.pdf"
  • HTTP(S) URL: "https://example.com/paper.pdf"

Parameter Schema

{
  "type": "object",
  "properties": {
    "url": {
      "description": "待解析的文件的路径,可以是一个本地路径或可下载的http(s)链接",
      "type": "string"
    }
  },
  "required": ["url"]
}

Configuration

max_ref_token
int
default:4000
Maximum total tokens for all chunks. If the document is smaller, it’s returned as a single chunk.
parser_page_size
int
default:500
Target size (in tokens) for each chunk. The chunking algorithm aims for this size but may vary.
path
string
Storage path for cached parsed documents. Defaults to $DEFAULT_WORKSPACE/tools/doc_parser.

Return Format

The tool returns a dictionary with the following structure:
{
  "url": "path/to/document.pdf",
  "title": "Document Title",
  "raw": [
    {
      "content": "[page: 1]\nFirst chunk content here...",
      "token": 234,
      "metadata": {
        "source": "path/to/document.pdf",
        "title": "Document Title",
        "chunk_id": 0
      }
    },
    {
      "content": "[page: 1]\nSecond chunk content...",
      "token": 456,
      "metadata": {
        "source": "path/to/document.pdf",
        "title": "Document Title",
        "chunk_id": 1
      }
    }
  ]
}

Chunk Structure

content
string
The text content of the chunk, including page markers like [page: 1].
token
int
Number of tokens in this chunk.
metadata
object
Metadata about the chunk:
  • source: Original file path or URL
  • title: Document title (extracted from first page or filename)
  • chunk_id: Sequential chunk identifier (0-indexed)

Usage

Basic Parsing

from qwen_agent.tools import DocParser
import json

# Initialize the parser
parser = DocParser()

# Parse a document
result = parser.call(params=json.dumps({'url': 'document.pdf'}))

print(f"Title: {result['title']}")
print(f"Number of chunks: {len(result['raw'])}")
for chunk in result['raw']:
    print(f"Chunk {chunk['metadata']['chunk_id']}: {chunk['token']} tokens")

With Custom Chunk Size

parser = DocParser(cfg={
    'parser_page_size': 1000,  # Larger chunks
    'max_ref_token': 8000
})

result = parser.call(params=json.dumps({'url': 'large_document.pdf'}))

Parsing Remote Documents

result = parser.call(params=json.dumps({
    'url': 'https://arxiv.org/pdf/1706.03762.pdf'
}))

print(f"Parsed: {result['title']}")
print(f"Chunks: {len(result['raw'])}")

Accessing Chunk Content

result = parser.call(params=json.dumps({'url': 'report.pdf'}))

for chunk in result['raw']:
    print(f"--- Chunk {chunk['metadata']['chunk_id']} ---")
    print(chunk['content'][:200])  # First 200 characters
    print(f"Tokens: {chunk['token']}\n")

Chunking Algorithm

The Doc Parser uses an intelligent chunking algorithm:

Small Documents

If total tokens ≤ max_ref_token, the entire document is returned as one chunk.

Large Documents

For documents exceeding max_ref_token:
  1. Page-Based Chunking: Chunks respect page boundaries
  2. Paragraph-Aware: Tries not to split mid-paragraph
  3. Overlap: Last portion of previous chunk is included in next chunk (up to 150 characters)
  4. Sentence Splitting: Very long paragraphs are split at sentence boundaries
  5. Page Markers: Each chunk includes [page: N] markers

Example

Chunk 0: [page: 1] Introduction paragraph... First section...
Chunk 1: [page: 1] ...First section... (overlap) Second section... [page: 2] ...
Chunk 2: [page: 2] ...Second section... (overlap) Third section...

Supported File Types

PDF

result = parser.call(params=json.dumps({'url': 'paper.pdf'}))
Features:
  • Text extraction with layout awareness
  • Table detection and conversion to markdown
  • Multi-column layout handling
  • Font size detection for headers

Word (DOCX)

result = parser.call(params=json.dumps({'url': 'report.docx'}))
Features:
  • Paragraph extraction
  • Table conversion to markdown
  • Entire document as single page

PowerPoint (PPTX)

result = parser.call(params=json.dumps({'url': 'presentation.pptx'}))
Features:
  • Each slide as separate page
  • Text frames and tables extracted
  • Slide order preserved

HTML

result = parser.call(params=json.dumps({'url': 'webpage.html'}))
Features:
  • BeautifulSoup parsing
  • Title extraction
  • Clean text without HTML tags

CSV / TSV / Excel

result = parser.call(params=json.dumps({'url': 'data.csv'}))
Features:
  • Tables converted to markdown format
  • Each sheet as separate page (Excel)
  • Preserves table structure

Plain Text

result = parser.call(params=json.dumps({'url': 'notes.txt'}))
Features:
  • Direct text processing
  • Paragraph splitting on newlines

Advanced Usage

Processing Multiple Documents

parser = DocParser()
documents = [
    'paper1.pdf',
    'paper2.pdf',
    'https://arxiv.org/pdf/1234.5678.pdf'
]

parsed_docs = []
for doc_path in documents:
    result = parser.call(params=json.dumps({'url': doc_path}))
    parsed_docs.append(result)
    print(f"Parsed {result['title']}: {len(result['raw'])} chunks")

Extracting Specific Pages

result = parser.call(params=json.dumps({'url': 'book.pdf'}))

# Find chunks from page 5
page_5_chunks = [
    chunk for chunk in result['raw']
    if '[page: 5]' in chunk['content']
]

for chunk in page_5_chunks:
    print(chunk['content'])

Token Budget Management

# Parse with strict token limit
parser = DocParser(cfg={
    'parser_page_size': 400,
    'max_ref_token': 2000
})

result = parser.call(params=json.dumps({'url': 'document.pdf'}))

# Calculate total tokens
total_tokens = sum(chunk['token'] for chunk in result['raw'])
print(f"Total tokens: {total_tokens}")

Caching Behavior

Parsed documents are automatically cached:
parser = DocParser()

# First call - parses document (slow)
import time
start = time.time()
result1 = parser.call(params=json.dumps({'url': 'large.pdf'}))
print(f"First parse: {time.time() - start:.2f}s")

# Second call - loads from cache (fast)
start = time.time()
result2 = parser.call(params=json.dumps({'url': 'large.pdf'}))
print(f"Cached load: {time.time() - start:.2f}s")
Cache key includes:
  • File URL/path hash
  • parser_page_size setting
Changing parser_page_size creates a new cache entry.

Integration with Retrieval

Doc Parser is used internally by the Retrieval tool:
from qwen_agent.tools import Retrieval

# Retrieval uses DocParser automatically
retrieval = Retrieval(cfg={
    'parser_page_size': 500,  # Passed to DocParser
    'max_ref_token': 4000
})

results = retrieval.call(params=json.dumps({
    'query': 'neural networks',
    'files': ['paper.pdf']
}))

Data Classes

Chunk

from qwen_agent.tools.doc_parser import Chunk

chunk = Chunk(
    content="Text content here",
    metadata={'source': 'doc.pdf', 'title': 'My Doc', 'chunk_id': 0},
    token=123
)

chunk_dict = chunk.to_dict()

Record

from qwen_agent.tools.doc_parser import Record

record = Record(
    url='document.pdf',
    raw=[chunk1, chunk2],
    title='Document Title'
)

record_dict = record.to_dict()

Performance Tips

Small chunks (300-500 tokens):
  • ✅ Better for precise retrieval
  • ✅ More granular context
  • ❌ More chunks to process
  • ❌ Less context per chunk
Large chunks (800-1200 tokens):
  • ✅ More context preserved
  • ✅ Fewer chunks to manage
  • ❌ Less precise retrieval
  • ❌ May exceed LLM context limits
  • Parse documents once during setup
  • Reuse the same DocParser instance
  • Cache location: $DEFAULT_WORKSPACE/tools/doc_parser
  • Clear cache if document content changes
For very large documents:
parser = DocParser(cfg={
    'parser_page_size': 1000,  # Larger chunks
    'max_ref_token': 10000      # Higher limit
})

Example: Document Analysis

from qwen_agent.tools import DocParser
import json

def analyze_document(file_path):
    """Analyze document structure and content."""
    parser = DocParser()
    result = parser.call(params=json.dumps({'url': file_path}))
    
    print(f"Document: {result['title']}")
    print(f"Source: {result['url']}")
    print(f"Total chunks: {len(result['raw'])}")
    
    total_tokens = sum(chunk['token'] for chunk in result['raw'])
    print(f"Total tokens: {total_tokens}")
    
    avg_tokens = total_tokens / len(result['raw']) if result['raw'] else 0
    print(f"Average tokens per chunk: {avg_tokens:.1f}")
    
    # Find largest chunk
    largest = max(result['raw'], key=lambda c: c['token'])
    print(f"Largest chunk: {largest['token']} tokens (ID: {largest['metadata']['chunk_id']})")
    
    # Show first chunk preview
    if result['raw']:
        first_chunk = result['raw'][0]
        print(f"\nFirst chunk preview:")
        print(first_chunk['content'][:300] + "...")

# Analyze a document
analyze_document('research_paper.pdf')

Troubleshooting

Ensure required dependencies are installed:
pip install "qwen-agent[rag]"
This installs parsers for all supported formats.
Some documents may have complex layouts. Try:
  • Checking if the document is text-based (not scanned images)
  • Using a different file format (e.g., export PDF to DOCX)
  • Adjusting parser_page_size
Tables are converted to markdown. If formatting is poor:
  • Export document in a more structured format (Excel for tables)
  • Manually convert tables to CSV
For very large documents:
parser = DocParser(cfg={
    'parser_page_size': 300,   # Smaller chunks
    'max_ref_token': 2000       # Lower limit
})

Retrieval

High-level RAG tool that uses DocParser

Simple Doc Parser

Basic parsing without chunking

Storage

Caching system used by DocParser