PaperCortex adds semantic search, auto-classification, receipt extraction, bank statement matching, and DATEV export to Paperless-ngx — powered entirely by local AI through Ollama. Exposes everything as an MCP Server for Claude Code and AI agent integration. - MCP Server with 5 tools (search, classify, receipt, query, export) - Local Ollama embeddings for semantic document search - Receipt data extraction (vendor, amount, date, tax, line items) - DATEV Buchungsstapel CSV export for German accounting - Bank CSV transaction matching - Paperless-ngx REST API client - Docker deployment - Zero cloud dependencies — 100% self-hosted
2.7 KiB
2.7 KiB
Architecture
Overview
PaperCortex is structured as three layers:
- MCP Server Layer -- Exposes tools via the Model Context Protocol for AI agent integration.
- Intelligence Layer -- Embedding generation, classification, receipt extraction, and query answering.
- Data Layer -- Paperless-ngx API client and local SQLite vector store.
Components
MCP Server (src/mcp-server/)
The entry point for all AI agent interactions. Implements the MCP standard using @modelcontextprotocol/sdk and communicates via stdio transport.
Each tool is implemented as a separate handler module under src/mcp-server/tools/.
Embeddings (src/embeddings/)
- ollama.ts -- Client for the Ollama API. Handles embedding generation and LLM completions.
- store.ts -- SQLite-backed vector store using
better-sqlite3. Stores document embeddings and supports cosine similarity search.
Current implementation uses brute-force search, which is performant up to ~100k documents. For larger archives, consider migrating to sqlite-vss or a dedicated vector database.
Paperless Integration (src/paperless/)
- client.ts -- REST API client for Paperless-ngx. Supports document CRUD, search, tags, correspondents, and document types.
- types.ts -- TypeScript type definitions matching the Paperless-ngx API v3+ schema.
Receipt Processing (src/receipt/)
- extractor.ts -- Uses LLM to extract structured data from receipt OCR text.
- matcher.ts -- Matches extracted receipts against bank CSV transaction exports.
- datev.ts -- Generates DATEV Buchungsstapel format CSV for German accounting software.
Data Flow
Paperless-ngx --(REST API)--> PaperCortex --(Ollama API)--> Ollama
|
v
SQLite Vector DB
|
v
MCP Server (stdio)
|
v
Claude Code / AI Agents
Security Model
- All data stays local -- no external API calls except to Paperless-ngx and Ollama (both self-hosted).
- API tokens are read from environment variables, never hardcoded.
- The SQLite database is stored on the local filesystem with configurable path.
- MCP Server communicates via stdio (no network port required for MCP).
Future Considerations
- Webhook support -- Listen for Paperless-ngx webhooks to auto-process new documents.
- Plugin system -- Allow custom extractors and exporters.
- Web dashboard -- Optional UI for monitoring and manual review.
- Multi-user -- Support multiple Paperless-ngx instances and user isolation.