PaperCortex/docs/architecture.md
Rene Fichtmueller 2052d87ba1 feat: initial release — AI document intelligence for Paperless-ngx
PaperCortex adds semantic search, auto-classification, receipt extraction,
bank statement matching, and DATEV export to Paperless-ngx — powered
entirely by local AI through Ollama. Exposes everything as an MCP Server
for Claude Code and AI agent integration.

- MCP Server with 5 tools (search, classify, receipt, query, export)
- Local Ollama embeddings for semantic document search
- Receipt data extraction (vendor, amount, date, tax, line items)
- DATEV Buchungsstapel CSV export for German accounting
- Bank CSV transaction matching
- Paperless-ngx REST API client
- Docker deployment
- Zero cloud dependencies — 100% self-hosted
2026-03-26 06:28:48 +13:00

65 lines
2.7 KiB
Markdown

# Architecture
## Overview
PaperCortex is structured as three layers:
1. **MCP Server Layer** -- Exposes tools via the Model Context Protocol for AI agent integration.
2. **Intelligence Layer** -- Embedding generation, classification, receipt extraction, and query answering.
3. **Data Layer** -- Paperless-ngx API client and local SQLite vector store.
## Components
### MCP Server (`src/mcp-server/`)
The entry point for all AI agent interactions. Implements the MCP standard using `@modelcontextprotocol/sdk` and communicates via stdio transport.
Each tool is implemented as a separate handler module under `src/mcp-server/tools/`.
### Embeddings (`src/embeddings/`)
- **ollama.ts** -- Client for the Ollama API. Handles embedding generation and LLM completions.
- **store.ts** -- SQLite-backed vector store using `better-sqlite3`. Stores document embeddings and supports cosine similarity search.
Current implementation uses brute-force search, which is performant up to ~100k documents. For larger archives, consider migrating to `sqlite-vss` or a dedicated vector database.
### Paperless Integration (`src/paperless/`)
- **client.ts** -- REST API client for Paperless-ngx. Supports document CRUD, search, tags, correspondents, and document types.
- **types.ts** -- TypeScript type definitions matching the Paperless-ngx API v3+ schema.
### Receipt Processing (`src/receipt/`)
- **extractor.ts** -- Uses LLM to extract structured data from receipt OCR text.
- **matcher.ts** -- Matches extracted receipts against bank CSV transaction exports.
- **datev.ts** -- Generates DATEV Buchungsstapel format CSV for German accounting software.
## Data Flow
```
Paperless-ngx --(REST API)--> PaperCortex --(Ollama API)--> Ollama
|
v
SQLite Vector DB
|
v
MCP Server (stdio)
|
v
Claude Code / AI Agents
```
## Security Model
- All data stays local -- no external API calls except to Paperless-ngx and Ollama (both self-hosted).
- API tokens are read from environment variables, never hardcoded.
- The SQLite database is stored on the local filesystem with configurable path.
- MCP Server communicates via stdio (no network port required for MCP).
## Future Considerations
- **Webhook support** -- Listen for Paperless-ngx webhooks to auto-process new documents.
- **Plugin system** -- Allow custom extractors and exporters.
- **Web dashboard** -- Optional UI for monitoring and manual review.
- **Multi-user** -- Support multiple Paperless-ngx instances and user isolation.