document-processing

Here are 751 public repositories matching this topic...

Zipstack / unstract

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

data-extraction document-processing etl-pipelines unstructured-data-extraction api-deployments open-source-data-pipeline

Updated Apr 18, 2026
Python

run-llama / liteparse

Star

A fast, helpful, and open-source document parser

pdf ocr text-extraction ocr-recognition pdf-parser document-processing document-ocr

Updated Apr 17, 2026
TypeScript

ucbepic / docetl

Star

A system for agentic LLM-powered data processing and ETL

python workflow data etl semantic-data elt data-pipelines agents document-analysis document-processing unstructured-data unstructured-data-analysis llm

Updated Mar 27, 2026
Python

enoch3712 / ExtractThinker

Star

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

python nlp pdf machine-learning ocr ai openai pdf-to-text document-processing document-image-analysis document-intelligence llm document-parsing langchain

Updated Aug 27, 2025
Python

OpenOCR: An Open-Source Toolkit for General-OCR Research and Applications, integrates a unified training and evaluation benchmark, commercial-grade OCR and Document Parsing systems, and faithful reproductions of the core implementations from a wide range of academic papers.

ocr document-analysis document-processing scene-text-recognition scene-text-detection ocr-pytorch chineseocr document-parsing

Updated Mar 2, 2026
Python

eclaire-labs / eclaire

Star

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

open-source automation privacy ocr ai rest-api bookmarks self-hosted data-extraction note-taking web-archiving bookmark-manager task-management document-processing on-device-ai local-first personal-knowledge-management ai-assistant llm

Updated Apr 13, 2026
TypeScript

SylphxAI / pdf-reader-mcp

Star

📄 Production-ready MCP server for PDF processing - 5-10x faster with parallel processing and 94%+ test coverage

nodejs pdf performance typescript mcp stdio pdf-reader parallel-processing pdf-parser document-processing pdf-parse ai-tools ai-agent model-context-protocol model-content-protocol llm-tool

Updated Apr 13, 2026
TypeScript

yfedoseev / pdf_oxide

Star

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

python markdown rust fast pdf text-extraction data-extraction pdf-generation pdf-to-text pdf-library pdf-parser document-processing rag pyo3 pdf-editor image-extraction llm pdf-to-markdown

Updated Apr 18, 2026
Rust

wxyhgk / retain-pdf

Star

在保留版面、公式与结构的前提下进行 PDF 翻译，适用于科研与技术文档

pdf ocr translation document-processing scientific-papers typst document-ai layout-preserving

Updated Apr 16, 2026
Python

ShapeCrawler / ShapeCrawler

Star

PowerPoint .NET library for reading, modifying, and generating PPTX presentations without Microsoft Office

csharp dotnet presentation slides powerpoint openxml pptx document-processing office-open-xml

Updated Apr 18, 2026
C#

dhlab-epfl / dhSegment

Star

Generic framework for historical document processing

tensorflow python3 segmentation historical-data document-processing

Updated Jul 9, 2021
Python

vorojar / Folio-OCR

Star

Open-source batch OCR workbench — a free, local alternative to ABBYY FineReader. Powered by Ollama + GLM-OCR + PP-DocLayoutV3, ~0.5s/page on RTX 4090. Three-panel editor, layout-aware, PDF/image batch processing, Markdown/Word export. 批量OCR工作台，纯本地运行，免费平替ABBYY，适合书籍文档数字化。

privacy ocr offline book-digitization document-processing document-ocr layout-detection markdown-export pdf-ocr local-ai ollama batch-ocr glm-ocr abbyy-alternative

Updated Apr 2, 2026
Python

ucbepic / TWIX

Star

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

document-processing document-data-extraction

Updated Nov 26, 2025
Python

awslabs / project-lakechain

Star

⚡ Cloud-native, AI-powered, document processing pipelines on AWS.

aws machine-learning natural-language-processing computer-vision serverless hacktoberfest document-processing aws-cdk generative-ai retrieval-augmented-generation

Updated Jan 22, 2026
TypeScript

bzsanti / oxidizePdf

Star

a PDF library for rust

rust pdf ocr encryption text-extraction data-extraction invoice crates-io rust-library pdf-generation pdf-reader pdf-manipulation pdfa pdf-library table-extraction pdf-parser digital-signatures document-processing

Updated Apr 13, 2026
Rust

formkiq / formkiq-core

Star

Open-source document management platform leveraging AWS managed services. RESTful API for document storage, processing, full-text search, and metadata management. Multi-tenant serverless architecture with auto-scaling... deployed entirely in your AWS account.

aws ocr serverless headless cloud-storage document-database amazon-web-services dms document-management optical-character-recognition document-processing document-management-system document-api document-apis intelligent-document-processing document-layer

Updated Apr 18, 2026
Java

ExtractPDF4J / ExtractPDF4J

Sponsor

Star

Java PDF table extraction & OCR library. Extract structured tables from text-based and scanned PDFs using stream, lattice (OpenCV-style grid detection), and hybrid parsing.

java cli ocr maven pdf-document pdf-extractor ocr-recognition document-processing pdf-processor pdf-document-processor pdf-extraction java17

Updated Mar 15, 2026
Java

watat83 / document-chat-system

Star

Open-source document chat platform with semantic search, RAG (Retrieval Augmented Generation), and multi-provider AI support (OpenRouter, OpenAI, ImageRouter).

chatbot embeddings webapp document-management document-processing rag vector-search llms

Updated Apr 10, 2026
TypeScript

Tele-AI / doc-ops-mcp

Star

MCP server for seamless document format conversion and processing

document-conversion file-converter pdf-conversion markdown-converter watermark document-processing document-converter docx-to-pdf pdf-processing docx2pdf document-rewriting

Updated Mar 30, 2026
TypeScript

iamarunbrahma / pdf-to-markdown

Star

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

python information-retrieval document-conversion pdf-converter text-extraction pdf-parsing document-processing rag pdf-extraction retrieval-augmented-generation pdf-to-markdown

Updated Nov 22, 2024
Python

Improve this page

Add a description, image, and links to the document-processing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-processing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-processing

Here are 751 public repositories matching this topic...

Zipstack / unstract

run-llama / liteparse

ucbepic / docetl

enoch3712 / ExtractThinker

Topdu / OpenOCR

eclaire-labs / eclaire

SylphxAI / pdf-reader-mcp

yfedoseev / pdf_oxide

wxyhgk / retain-pdf

ShapeCrawler / ShapeCrawler

dhlab-epfl / dhSegment

vorojar / Folio-OCR

ucbepic / TWIX

awslabs / project-lakechain

bzsanti / oxidizePdf

formkiq / formkiq-core

ExtractPDF4J / ExtractPDF4J

watat83 / document-chat-system

Tele-AI / doc-ops-mcp

iamarunbrahma / pdf-to-markdown

Improve this page

Add this topic to your repo