LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows
-
Updated
Apr 18, 2026 - Python
LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows
A fast, helpful, and open-source document parser
A system for agentic LLM-powered data processing and ETL
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
在保留版面、公式与结构的前提下进行 PDF 翻译,适用于科研与技术文档
PowerPoint .NET library for reading, modifying, and generating PPTX presentations without Microsoft Office
Generic framework for historical document processing
Open-source batch OCR workbench — a free, local alternative to ABBYY FineReader. Powered by Ollama + GLM-OCR + PP-DocLayoutV3, ~0.5s/page on RTX 4090. Three-panel editor, layout-aware, PDF/image batch processing, Markdown/Word export. 批量OCR工作台,纯本地运行,免费平替ABBYY,适合书籍文档数字化。
TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents
⚡ Cloud-native, AI-powered, document processing pipelines on AWS.
a PDF library for rust
Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.
Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.
Open-source document chat platform with semantic search, RAG (Retrieval Augmented Generation), and multi-provider AI support (OpenRouter, OpenAI, ImageRouter).
MCP server for seamless document format conversion and processing
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.
To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."