SKILL.md

Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.

Core Capabilities

Image OCR: Extract text from PNG, JPEG, TIFF, BMP images
PDF OCR: Process scanned PDFs page by page
Multi-language: Support for 100+ languages
Structured Output: Plain text, Markdown, JSON, or HTML
Table Detection: Extract tabular data to CSV/JSON
Batch Processing: Process multiple documents at once
Quality Assessment: Confidence scoring for OCR results

Quick Start

from scripts.ocr_processor import OCRProcessor

# Simple text extraction
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)

# Extract to structured format
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks'])  # Text blocks with positions

Core Workflow

1. Basic Text Extraction

from scripts.ocr_processor import OCRProcessor

# From image
processor = OCRProcessor("scan.png")
text = processor.extract_text()

# From PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text()  # All pages

# Specific pages
text = processor.extract_text(pages=[1, 2, 3])

2. Structured Extraction

# Get detailed results
result = processor.extract_structured()

# Result contains:
# - text: Full extracted text
# - blocks: Text blocks with bounding boxes
# - lines: Individual lines
# - words: Individual words with confidence
# - confidence: Overall confidence score
# - language: Detected language

3. Export Formats

# Export to Markdown
processor.export_markdown("output.md")

# Export to JSON
processor.export_json("output.json")

# Export to searchable PDF
processor.export_searchable_pdf("searchable.pdf")

# Export to HTML
processor.export_html("output.html")

Language Support

# Specify language for better accuracy
processor = OCRProcessor("german_doc.png", lang='deu')

# Multiple languages
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')

# Auto-detect language
processor = OCRProcessor("document.png", lang='auto')

Supported Languages (Common)

| Code | Language | Code | Language | |------|----------|------|----------| | eng | English | fra | French | | deu | German | spa | Spanish | | ita | Italian | por | Portuguese | | rus | Russian | chi_sim | Chinese (Simplified) | | chi_tra | Chinese (Traditional) | jpn | Japanese | | kor | Korean | ara | Arabic | | hin | Hindi | nld | Dutch |

Image Preprocessing

Preprocessing improves OCR accuracy on low-quality images.

# Enable preprocessing
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
    deskew=True,        # Fix rotation
    denoise=True,       # Remove noise
    threshold=True,     # Binarize image
    contrast=1.5        # Enhance contrast
)
text = processor.extract_text()

Available Preprocessing Options

| Option | Description | Default | |--------|-------------|---------| | deskew | Correct skewed/rotated images | False | | denoise | Remove noise and artifacts | False | | threshold | Convert to black/white | False | | threshold_method | 'otsu', 'adaptive', 'simple' | 'otsu' | | contrast | Contrast factor (1.0 = no change) | 1.0 | | sharpen | Sharpen factor (0 = none) | 0 | | scale | Upscale factor for small text | 1.0 | | remove_shadows | Remove shadow artifacts | False |

Table Extraction

# Extract tables from document
tables = processor.extract_tables()

# Each table is a list of rows
for table in tables:
    for row in table:
        print(row)

# Export tables to CSV
processor.export_tables_csv("tables/")

# Export to JSON
processor.export_tables_json("tables.json")

PDF Processing

Multi-Page PDFs

# Process all pages
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()

# Process specific pages
page_3 = processor.extract_text(pages=[3])

# Get per-page results
results = processor.extract_by_page()
for page_num, text in results.items():
    print(f"Page {page_num}: {len(text)} characters")

Create Searchable PDF

# Convert scanned PDF to searchable PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")

Batch Processing

from scripts.ocr_processor import batch_ocr

# Process directory of images
results = batch_ocr(
    input_dir="scans/",
    output_dir="extracted/",
    output_format="markdown",
    lang="eng",
    recursive=True
)

print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")

Receipt/Document Parsing

Receipt Extraction

# Parse receipt structure
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()

# Returns structured data:
# - vendor: Store name
# - date: Transaction date
# - items: List of items with prices
# - subtotal: Subtotal amount
# - tax: Tax amount
# - total: Total amount

Business Card Parsing

# Extract business card info
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()

# Returns:
# - name: Person's name
# - title: Job title
# - company: Company name
# - email: Email addresses
# - phone: Phone numbers
# - address: Physical address
# - website: Website URLs

Configuration

processor = OCRProcessor("document.png")

# Configure OCR settings
processor.config.update({
    'psm': 3,           # Page segmentation mode
    'oem': 3,           # OCR engine mode
    'dpi': 300,         # DPI for processing
    'timeout': 30,      # Timeout in seconds
    'min_confidence': 60,  # Minimum word confidence
})

Page Segmentation Modes (PSM)

| Mode | Description | |------|-------------| | 0 | Orientation and script detection only | | 1 | Automatic page segmentation with OSD | | 3 | Fully automatic page segmentation (default) | | 4 | Assume single column of text | | 6 | Assume single uniform block of text | | 7 | Treat image as single text line | | 8 | Treat image as single word | | 11 | Sparse text. Find as much text as possible | | 12 | Sparse text with OSD |

Quality Assessment

# Get confidence scores
result = processor.extract_structured()

# Overall confidence (0-100)
print(f"Confidence: {result['confidence']}%")

# Per-word confidence
for word in result['words']:
    print(f"{word['text']}: {word['confidence']}%")

# Filter low-confidence words
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]

Output Formats

Markdown Export

processor.export_markdown("output.md")

Output includes:

Document title (if detected)
Structured headings
Paragraphs
Tables (as Markdown tables)
Page breaks for multi-page docs

JSON Export

processor.export_json("output.json")

Output structure:

{
  "source": "document.pdf",
  "pages": 5,
  "language": "eng",
  "confidence": 92.5,
  "text": "Full extracted text...",
  "blocks": [
    {
      "type": "paragraph",
      "text": "Block text...",
      "bbox": [x, y, width, height],
      "confidence": 95.2
    }
  ],
  "tables": [...]
}

HTML Export

processor.export_html("output.html")

Creates styled HTML with:

Preserved layout approximation
Highlighted low-confidence regions
Embedded images (optional)
Print-friendly styling

CLI Usage

# Basic extraction
python ocr_processor.py image.png -o output.txt

# Extract to markdown
python ocr_processor.py document.pdf -o output.md --format markdown

# Specify language
python ocr_processor.py german.png --lang deu

# Batch processing
python ocr_processor.py scans/ -o extracted/ --batch

# With preprocessing
python ocr_processor.py noisy.png --preprocess --deskew --denoise

Error Handling

from scripts.ocr_processor import OCRProcessor, OCRError

try:
    processor = OCRProcessor("document.png")
    text = processor.extract_text()
except OCRError as e:
    print(f"OCR failed: {e}")
except FileNotFoundError:
    print("File not found")

Performance Tips

Image Quality: Higher resolution (300+ DPI) improves accuracy
Preprocessing: Use for low-quality scans
Language: Specifying language improves speed and accuracy
PSM Mode: Choose appropriate mode for document type
Large Files: Process PDFs page by page for memory efficiency

Limitations

Handwritten text: Limited accuracy
Complex layouts: May lose structure
Very low quality: Preprocessing helps but has limits
Non-Latin scripts: Require specific language packs

Dependencies

pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0

System Requirements

Tesseract OCR engine must be installed
Language data files for non-English languages

Ocr Document Processor

Install

🤖Use in AI Agents

Claude Code

OpenAI Codex / Cursor / Windsurf

Environment & Dependencies