feat: initial release — AI document intelligence for Paperless-ngx

PaperCortex adds semantic search, auto-classification, receipt extraction,
bank statement matching, and DATEV export to Paperless-ngx — powered
entirely by local AI through Ollama. Exposes everything as an MCP Server
for Claude Code and AI agent integration.

- MCP Server with 5 tools (search, classify, receipt, query, export)
- Local Ollama embeddings for semantic document search
- Receipt data extraction (vendor, amount, date, tax, line items)
- DATEV Buchungsstapel CSV export for German accounting
- Bank CSV transaction matching
- Paperless-ngx REST API client
- Docker deployment
- Zero cloud dependencies — 100% self-hosted
This commit is contained in:
Rene Fichtmueller 2026-03-26 06:28:48 +13:00
commit 2052d87ba1
25 changed files with 3322 additions and 0 deletions

20
.env.example Normal file
View File

@ -0,0 +1,20 @@
# PaperCortex Configuration
# Copy this file to .env and fill in your values
# Paperless-ngx connection
PAPERLESS_URL=http://localhost:8000
PAPERLESS_TOKEN=your-paperless-api-token-here
# Ollama connection
OLLAMA_URL=http://localhost:11434
OLLAMA_MODEL=qwen2.5:14b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# Vector store
VECTOR_DB_PATH=./data/vectors.db
# MCP Server
MCP_SERVER_PORT=3100
# Logging
LOG_LEVEL=info

35
.gitignore vendored Normal file
View File

@ -0,0 +1,35 @@
# Dependencies
node_modules/
# Build output
dist/
# Environment files
.env
.env.local
.env.*.local
# Data directory (vectors, cache)
data/
# OS files
.DS_Store
Thumbs.db
# IDE
.vscode/
.idea/
*.swp
*.swo
# Logs
logs/
*.log
npm-debug.log*
# Test coverage
coverage/
# Temporary files
tmp/
temp/

34
Dockerfile Normal file
View File

@ -0,0 +1,34 @@
FROM node:22-alpine AS builder
WORKDIR /app
COPY package.json package-lock.json* ./
RUN npm ci
COPY tsconfig.json ./
COPY src/ ./src/
RUN npm run build
# --- Production image ---
FROM node:22-alpine
WORKDIR /app
RUN addgroup -g 1001 -S papercortex && \
adduser -S papercortex -u 1001
COPY package.json package-lock.json* ./
RUN npm ci --omit=dev && npm cache clean --force
COPY --from=builder /app/dist ./dist
RUN mkdir -p /app/data && chown papercortex:papercortex /app/data
USER papercortex
ENV NODE_ENV=production
ENV VECTOR_DB_PATH=/app/data/vectors.db
EXPOSE 3100
CMD ["node", "dist/mcp-server/index.js"]

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 PaperCortex Contributors
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

737
README.md Normal file
View File

@ -0,0 +1,737 @@
<p align="center">
<img src="docs/assets/papercortex-logo.svg" alt="PaperCortex Logo" width="120" />
<h1 align="center">PaperCortex</h1>
<p align="center">
<strong>AI-Powered Document Intelligence for Paperless-ngx</strong><br/>
<em>Semantic search, auto-classification, receipt extraction, and accounting export — 100% local, 100% private.</em>
</p>
<p align="center">
<a href="#-quick-start"><img src="https://img.shields.io/badge/Docker-one--command-2496ED?logo=docker&logoColor=white" alt="Docker"></a>
<a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-22c55e.svg" alt="MIT License"></a>
<img src="https://img.shields.io/badge/TypeScript-5.x-3178C6?logo=typescript&logoColor=white" alt="TypeScript">
<img src="https://img.shields.io/badge/Ollama-Local_AI-7C3AED?logo=ollama&logoColor=white" alt="Ollama">
<img src="https://img.shields.io/badge/MCP-Server-F97316" alt="MCP Server">
<img src="https://img.shields.io/badge/Paperless--ngx-Compatible-EF4444?logo=data:image/svg+xml;base64,..." alt="Paperless-ngx">
<img src="https://img.shields.io/badge/DATEV-Export-EAB308" alt="DATEV Export">
<img src="https://img.shields.io/badge/Privacy-First-10B981" alt="Privacy First">
</p>
<p align="center">
<a href="#-quick-start">Quick Start</a> · <a href="#-features">Features</a> · <a href="#-mcp-server-tools">MCP Tools</a> · <a href="#-receipt-intelligence">Receipts</a> · <a href="#-documentation">Docs</a>
</p>
</p>
---
## What is PaperCortex?
**PaperCortex** turns your [Paperless-ngx](https://github.com/paperless-ngx/paperless-ngx) document archive into an intelligent, queryable knowledge base — powered entirely by local AI running on your own hardware.
If you use Paperless-ngx to store invoices, receipts, contracts, tax documents, letters, or any other scanned paperwork, PaperCortex adds the intelligence layer that Paperless-ngx is missing:
- **Ask questions in plain English** — "Show me all invoices from Amazon over 100 EUR in 2025"
- **Find documents by meaning**, not just keywords — searching for "office rent" finds "Bueromiete" and "monthly lease payment"
- **Auto-tag and classify** every new document the moment it arrives
- **Extract structured data from receipts** — vendor, date, amount, tax rate, line items
- **Match receipts to bank transactions** automatically
- **Export to DATEV** for your German tax advisor — or plain CSV for any accounting software
Everything runs locally through [Ollama](https://ollama.com). No document content ever leaves your network. No cloud APIs. No subscriptions. No data harvesting.
PaperCortex exposes all capabilities as an **[MCP (Model Context Protocol)](https://modelcontextprotocol.io) Server**, making it a first-class tool for [Claude Code](https://docs.anthropic.com/en/docs/claude-code), AI coding agents, and automated workflows.
---
## The Problem
Paperless-ngx is an outstanding document management system with 37,000+ GitHub stars. It handles scanning, OCR, storage, and basic tagging beautifully. But once your documents are in Paperless-ngx, finding and working with them has real limitations:
| What you want to do | Paperless-ngx alone | With PaperCortex |
|---|---|---|
| Find a document by what it's about | Keyword search only — misses synonyms, translations, related concepts | **Semantic search** understands meaning across languages |
| Classify incoming documents | Manual rules or basic auto-matching | **LLM-powered classification** understands document content |
| Extract data from a receipt | Read it yourself and type it in | **Automatic extraction** of vendor, amount, date, tax, line items |
| Answer "How much did I spend on X?" | Export everything, open spreadsheet, filter manually | **Natural language query** returns the answer instantly |
| Send receipt data to accounting | Manual data entry or copy-paste | **One-click DATEV/CSV export** ready for your tax advisor |
| Use documents in AI workflows | No API integration for AI agents | **Full MCP Server** for Claude Code and any MCP-compatible agent |
| Keep data private | Self-hosted (good!) | Self-hosted AI too — **zero cloud dependency** |
---
## Features
### Semantic Document Search
Traditional keyword search fails when you don't remember the exact words. PaperCortex generates vector embeddings for every document using local Ollama models and stores them in a lightweight SQLite vector database.
**Search by meaning, not by memory:**
- Search for `"electricity bill"` → finds documents containing "Stromrechnung", "utility payment", "power invoice"
- Search for `"office supplies"` → finds "Bueroausstattung", "paper and toner", "desk accessories order"
- Search for `"tax deductible travel"` → finds flight bookings, hotel receipts, train tickets, taxi invoices
**Supported embedding models:**
- `nomic-embed-text` (recommended — fast, accurate, 768 dimensions)
- `mxbai-embed-large` (higher accuracy, slower)
- Any Ollama-compatible embedding model
### Automatic Document Classification
Every new document arriving in Paperless-ngx gets analyzed by a local LLM that reads the OCR content and assigns:
- **Document type** — Invoice, Receipt, Contract, Letter, Statement, Tax Document, Certificate
- **Tags** — Contextual tags based on content (e.g., "office", "travel", "insurance", "subscription")
- **Correspondent** — Identifies the sender/vendor from document content
- **Date extraction** — Finds the document date (not just the scan date)
- **Language detection** — Identifies the document language
Classification runs asynchronously in the background. New documents are processed within minutes of arriving in Paperless-ngx.
### Receipt Intelligence
PaperCortex includes a dedicated receipt processing pipeline optimized for expense management:
**Data extraction from receipts and invoices:**
- Vendor / merchant name and address
- Date of purchase
- Total amount (gross and net)
- Tax rate and tax amount (supports multiple VAT rates)
- Currency
- Individual line items with quantities and prices
- Payment method
- Invoice/receipt number
**Works with:**
- Scanned paper receipts (via Paperless-ngx OCR)
- Digital PDF invoices
- Photographed receipts (mobile upload to Paperless-ngx)
- Multi-page invoices
- Receipts in German, English, French, Spanish, and other languages
### Bank Statement Matching
Import your bank statement as CSV and let PaperCortex automatically match transactions to receipts:
- **Fuzzy matching** on amount, date, and vendor name
- **Confidence scoring** — high/medium/low match indicators
- **Unmatched detection** — highlights receipts without matching transactions and vice versa
- **Multi-currency support** — handles EUR, USD, GBP, CHF, and 20+ currencies
### DATEV Export
For German businesses and freelancers, PaperCortex generates DATEV-compatible export files that your Steuerberater can import directly:
- **DATEV CSV format** (Buchungsstapel) — the standard German accounting import format
- **SKR03 / SKR04** account mapping
- **Automatic account assignment** based on document classification
- **Beleglink** — links each DATEV entry back to the original document in Paperless-ngx
- **Period exports** — monthly, quarterly, or annual
Also supports plain CSV export for use with any accounting software worldwide.
### Natural Language Queries
Ask questions about your document archive in plain language:
```
"How much did I spend on hotels in Q1 2025?"
"Show me all contracts expiring this year"
"What was my highest single expense last month?"
"Find all invoices from Deutsche Telekom"
"Which receipts don't have a matching bank transaction?"
"Summarize my office supply spending trend over the last 12 months"
```
PaperCortex translates natural language into document queries, retrieves relevant documents via semantic search, and uses the local LLM to synthesize answers with source references.
### MCP Server Integration
PaperCortex implements the [Model Context Protocol (MCP)](https://modelcontextprotocol.io) — the open standard for connecting AI agents to external tools. This means any MCP-compatible AI agent can use your document archive as a knowledge source.
**Compatible with:**
- [Claude Code](https://docs.anthropic.com/en/docs/claude-code) (Anthropic)
- [Claude Desktop](https://claude.ai)
- Any MCP-compatible AI agent or IDE plugin
- Custom AI workflows via the MCP SDK
---
## Feature Comparison
| Feature | PaperCortex | paperless-ai | Veryfi | Taggun | Rossum |
|---|:---:|:---:|:---:|:---:|:---:|
| Fully self-hosted | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: |
| Local AI (no cloud API) | :white_check_mark: | :x: OpenAI | :x: | :x: | :x: |
| Semantic search | :white_check_mark: | :x: | :x: | :x: | :x: |
| Auto-classification | :white_check_mark: | :white_check_mark: | :x: | :x: | :white_check_mark: |
| Receipt data extraction | :white_check_mark: | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| Bank statement matching | :white_check_mark: | :x: | :x: | :x: | :x: |
| DATEV export | :white_check_mark: | :x: | :x: | :x: | :x: |
| CSV accounting export | :white_check_mark: | :x: | :white_check_mark: | :x: | :white_check_mark: |
| MCP Server | :white_check_mark: | :x: | :x: | :x: | :x: |
| Natural language queries | :white_check_mark: | :x: | :x: | :x: | :x: |
| Multi-language documents | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| Free and open source | :white_check_mark: | :white_check_mark: | :x: $$$ | :x: $$$ | :x: $$$$ |
| Privacy — data stays local | :white_check_mark: | :warning: API calls | :x: | :x: | :x: |
| Works with Paperless-ngx | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: |
---
## Architecture
```
┌─────────────────────┐ ┌──────────────────────────┐ ┌────────────────────┐
│ │ │ │ │ │
│ Claude Code / │ MCP │ PaperCortex │ REST │ Paperless-ngx │
│ AI Agents / ├────────►│ ├────────►│ │
│ Automation │ │ ┌──────────────────┐ │ API │ OCR + Storage + │
│ │ │ │ MCP Server │ │ │ Tagging │
└─────────────────────┘ │ │ (stdio / HTTP) │ │ │ │
│ └──────────────────┘ │ └────────────────────┘
│ │
│ ┌──────────────────┐ │ ┌────────────────────┐
│ │ Intelligence │ │ │ │
│ │ Layer │ │ LLM │ Ollama │
│ │ ├────────────►│ │
│ │ - Classifier │ │ API │ qwen2.5 / llama3 │
│ │ - Extractor │ │ │ nomic-embed-text │
│ │ - Query Engine │ │ │ │
│ └──────────────────┘ │ └────────────────────┘
│ │
│ ┌──────────────────┐ │
│ │ Vector Store │ │
│ │ (SQLite + HNSW) │ │
│ └──────────────────┘ │
│ │
└──────────────────────────┘
```
### How It Works
1. **Documents arrive** in Paperless-ngx through scanning, email, or manual upload
2. **PaperCortex polls** the Paperless-ngx API for new and updated documents
3. **Embedding generation** — Ollama creates vector embeddings from OCR text
4. **Classification** — the local LLM analyzes content and assigns types, tags, and metadata
5. **Storage** — embeddings and extracted data are stored in a local SQLite vector database
6. **Query interface** — the MCP Server exposes search, classify, extract, query, and export tools
7. **AI agents connect** via MCP and interact with your documents using natural language
All processing happens on your hardware. The only network traffic is between PaperCortex and your local Paperless-ngx and Ollama instances.
---
## Quick Start
### Prerequisites
- **[Docker](https://docs.docker.com/get-docker/)** and Docker Compose
- **[Paperless-ngx](https://github.com/paperless-ngx/paperless-ngx)** — running instance with API access
- **[Ollama](https://ollama.com)** — running locally or on your network
**Pull the required Ollama models:**
```bash
ollama pull qwen2.5:14b # LLM for classification, extraction, queries
ollama pull nomic-embed-text # Embedding model for semantic search
```
### Option 1: Docker Compose (Recommended)
```bash
git clone https://github.com/renefichtmueller/PaperCortex.git
cd PaperCortex
cp .env.example .env
```
Edit `.env` with your configuration:
```env
PAPERLESS_URL=http://your-paperless-instance:8000
PAPERLESS_TOKEN=your-paperless-api-token
OLLAMA_URL=http://your-ollama-host:11434
OLLAMA_MODEL=qwen2.5:14b
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
```
Start PaperCortex:
```bash
docker compose up -d
```
PaperCortex will begin indexing your existing documents automatically.
### Option 2: Manual Installation
```bash
git clone https://github.com/renefichtmueller/PaperCortex.git
cd PaperCortex
npm install
cp .env.example .env
# Edit .env with your settings
npm run build
npm start
```
### Option 3: npx (MCP Server only)
```bash
npx papercortex --paperless-url http://localhost:8000 --paperless-token YOUR_TOKEN
```
---
## MCP Server Tools
PaperCortex exposes five MCP tools that AI agents can call:
### `papercortex_search` — Semantic Document Search
Find documents by meaning, not just keywords.
```json
{
"tool": "papercortex_search",
"arguments": {
"query": "electricity bills from last winter",
"limit": 10,
"date_from": "2024-12-01",
"date_to": "2025-02-28"
}
}
```
**Returns:** Ranked list of documents with relevance scores, titles, dates, and Paperless-ngx document IDs.
### `papercortex_classify` — Auto-Classification
Analyze a document and assign type, tags, and metadata.
```json
{
"tool": "papercortex_classify",
"arguments": {
"document_id": 1234,
"apply": true
}
}
```
**Returns:** Suggested document type, tags, correspondent, and confidence scores. Set `apply: true` to write classifications back to Paperless-ngx.
### `papercortex_receipt` — Receipt Data Extraction
Extract structured financial data from receipts and invoices.
```json
{
"tool": "papercortex_receipt",
"arguments": {
"document_id": 5678
}
}
```
**Returns:**
```json
{
"vendor": "Amazon EU S.a.r.l.",
"date": "2025-03-15",
"total_gross": 119.99,
"total_net": 100.83,
"tax_rate": 19,
"tax_amount": 19.16,
"currency": "EUR",
"items": [
{ "description": "USB-C Hub", "quantity": 1, "price": 49.99 },
{ "description": "Monitor Arm", "quantity": 1, "price": 70.00 }
],
"invoice_number": "INV-DE-2025-1234567"
}
```
### `papercortex_query` — Natural Language Questions
Ask questions about your entire document archive.
```json
{
"tool": "papercortex_query",
"arguments": {
"question": "How much did I spend on business travel in Q1 2025?"
}
}
```
**Returns:** A natural language answer with source document references and a breakdown of the calculation.
### `papercortex_export` — Accounting Export
Export extracted receipt data in accounting-ready formats.
```json
{
"tool": "papercortex_export",
"arguments": {
"format": "datev",
"date_from": "2025-01-01",
"date_to": "2025-03-31",
"account_plan": "SKR03"
}
}
```
**Supported formats:** `datev` (German standard), `csv` (universal), `json` (programmatic).
---
## Claude Code Integration
### Register as MCP Server
Add to your `~/.claude.json` or project `.claude/settings.json`:
```json
{
"mcpServers": {
"papercortex": {
"command": "node",
"args": ["./dist/mcp-server/index.js"],
"cwd": "/path/to/PaperCortex",
"env": {
"PAPERLESS_URL": "http://localhost:8000",
"PAPERLESS_TOKEN": "your-token",
"OLLAMA_URL": "http://localhost:11434"
}
}
}
}
```
### Example Conversations
Once connected, you can ask Claude Code about your documents naturally:
```
You: Search my documents for anything related to the office lease renewal
Claude: I found 4 relevant documents:
1. "Mietvertrag Verlängerung 2025" (Score: 0.94) — Document #1234
2. "Office Lease Agreement Amendment" (Score: 0.91) — Document #1235
3. "Nebenkostenabrechnung 2024" (Score: 0.78) — Document #1240
4. "Facilities Management Invoice" (Score: 0.72) — Document #1251
```
```
You: Extract the receipt data from document #5678 and export it for DATEV
Claude: Extracted receipt data:
Vendor: Deutsche Bahn AG
Date: 2025-03-20
Amount: 89.90 EUR (net: 75.55 EUR, 19% VAT: 14.35 EUR)
Description: ICE Frankfurt-Berlin, 1st class
DATEV export saved to: exports/datev_2025_03.csv
```
```
You: How much did I spend on cloud services this year?
Claude: Based on 23 matching documents, your cloud service spending in 2025:
- AWS: 2,340.00 EUR (12 invoices)
- Hetzner: 456.00 EUR (3 invoices)
- Cloudflare: 240.00 EUR (3 invoices)
- Vercel: 180.00 EUR (3 invoices)
- GitHub: 132.00 EUR (2 invoices)
Total: 3,348.00 EUR
```
---
## Receipt Workflow
### End-to-End Receipt Processing
```
┌──────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────┐ ┌──────────┐
│ Scan / │ │ Paperless- │ │ PaperCortex │ │ Match │ │ Export │
│ Photo / ├───►│ ngx ├───►│ Receipt ├───►│ Bank ├───►│ DATEV / │
│ Email │ │ OCR+Store │ │ Extraction │ │ CSV │ │ CSV │
└──────────┘ └─────────────┘ └──────────────┘ └──────────┘ └──────────┘
```
### CLI Commands
```bash
# Process all unprocessed receipts
npm run receipt:process
# Extract data from a specific document
npm run receipt:extract -- --document-id 1234
# Import bank statement and match transactions
npm run receipt:match -- --bank-csv ./bank_export_2025_q1.csv
# Export matched data as DATEV
npm run receipt:export -- --format datev --period 2025-Q1
# Export as plain CSV
npm run receipt:export -- --format csv --period 2025-03
```
### DATEV Integration Details
The DATEV export generates a `Buchungsstapel` CSV file following the official DATEV format specification:
- **Header row** with advisor number, client number, fiscal year start, and export period
- **Transaction rows** with amount, debit/credit account, tax code, date, and booking text
- **Beleglink** — each row includes a reference to the source document in Paperless-ngx
- **Account mapping** — automatic assignment based on vendor and document type (configurable)
- **SKR03 and SKR04** chart of accounts supported
---
## Privacy and Security
### Why Local AI Matters
Your documents contain some of the most sensitive data in your life:
- **Tax returns** with income, deductions, and financial details
- **Contracts** with confidential terms and personal information
- **Medical bills** with health information
- **Bank statements** with account numbers and transaction history
- **Personal correspondence** with private content
Cloud-based document AI services require uploading this data to external servers for processing. Even with encryption and privacy policies, you are trusting a third party with your most sensitive information.
**PaperCortex takes a fundamentally different approach:**
- All AI processing runs on **your hardware** via Ollama
- Document content is sent only to **your local Ollama instance**
- Embeddings and extracted data are stored in a **local SQLite database**
- The only network traffic is between PaperCortex, your Paperless-ngx instance, and your Ollama server
- **No telemetry, no analytics, no external API calls**
**Your documents stay in your network. Period.**
### Security Best Practices
- Store the Paperless-ngx API token in environment variables, never in source code
- Run PaperCortex on the same network as Paperless-ngx and Ollama
- Use Docker networks to isolate services
- Regularly update Ollama and PaperCortex for security patches
---
## Configuration Reference
All configuration is done through environment variables. See `.env.example` for a complete template.
### Core Settings
| Variable | Default | Description |
|---|---|---|
| `PAPERLESS_URL` | `http://localhost:8000` | Paperless-ngx instance URL |
| `PAPERLESS_TOKEN` | *(required)* | Paperless-ngx API authentication token |
| `OLLAMA_URL` | `http://localhost:11434` | Ollama API endpoint |
| `OLLAMA_MODEL` | `qwen2.5:14b` | LLM model for classification and extraction |
| `OLLAMA_EMBEDDING_MODEL` | `nomic-embed-text` | Embedding model for semantic search |
| `VECTOR_DB_PATH` | `./data/vectors.db` | Path to the SQLite vector database |
### Processing Settings
| Variable | Default | Description |
|---|---|---|
| `POLL_INTERVAL` | `300` | Seconds between polling Paperless-ngx for new documents |
| `BATCH_SIZE` | `10` | Number of documents to process per batch |
| `EMBEDDING_DIMENSIONS` | `768` | Vector dimensions (must match embedding model) |
| `CLASSIFICATION_CONFIDENCE` | `0.7` | Minimum confidence to auto-apply classifications |
### Export Settings
| Variable | Default | Description |
|---|---|---|
| `DATEV_ADVISOR_NUMBER` | *(optional)* | Steuerberater number for DATEV export header |
| `DATEV_CLIENT_NUMBER` | *(optional)* | Mandantennummer for DATEV export header |
| `DATEV_FISCAL_YEAR_START` | `01-01` | Fiscal year start (MM-DD) |
| `DEFAULT_ACCOUNT_PLAN` | `SKR03` | Default chart of accounts (`SKR03` or `SKR04`) |
| `EXPORT_DIR` | `./exports` | Directory for generated export files |
### MCP Server Settings
| Variable | Default | Description |
|---|---|---|
| `MCP_TRANSPORT` | `stdio` | MCP transport mode (`stdio` or `http`) |
| `MCP_PORT` | `3100` | Port for HTTP transport mode |
| `MCP_AUTH_TOKEN` | *(optional)* | Bearer token for HTTP transport authentication |
---
## Supported Models
PaperCortex works with any Ollama-compatible model. Recommended configurations:
### For Classification and Extraction
| Model | VRAM | Speed | Quality | Recommended For |
|---|---|---|---|---|
| `qwen2.5:7b` | 5 GB | Fast | Good | Raspberry Pi, low-end servers |
| `qwen2.5:14b` | 10 GB | Medium | Very Good | Most homelab setups |
| `qwen2.5:32b` | 20 GB | Slow | Excellent | High-accuracy requirements |
| `llama3.1:8b` | 5 GB | Fast | Good | Alternative to Qwen |
| `mistral:7b` | 5 GB | Fast | Good | European language focus |
### For Embeddings
| Model | Dimensions | Speed | Quality |
|---|---|---|---|
| `nomic-embed-text` | 768 | Very Fast | Very Good |
| `mxbai-embed-large` | 1024 | Fast | Excellent |
| `all-minilm` | 384 | Fastest | Good |
---
## Project Structure
```
PaperCortex/
├── src/
│ ├── mcp-server/ # MCP Server for AI agent integration
│ │ ├── index.ts # Server entry point and tool registration
│ │ └── tools/
│ │ ├── search.ts # Semantic document search tool
│ │ ├── classify.ts # Auto-classification tool
│ │ ├── receipt.ts # Receipt data extraction tool
│ │ ├── query.ts # Natural language query tool
│ │ └── export.ts # DATEV/CSV export tool
│ ├── embeddings/
│ │ ├── ollama.ts # Ollama embedding API client
│ │ └── store.ts # SQLite vector store with HNSW index
│ ├── paperless/
│ │ ├── client.ts # Paperless-ngx REST API client
│ │ └── types.ts # TypeScript type definitions
│ └── receipt/
│ ├── extractor.ts # Receipt OCR content parsing and extraction
│ ├── matcher.ts # Bank CSV transaction matching engine
│ └── datev.ts # DATEV Buchungsstapel CSV formatter
├── docs/
│ ├── architecture.md # Detailed architecture documentation
│ ├── setup.md # Step-by-step installation guide
│ └── receipts.md # Receipt workflow documentation
├── docker-compose.yml # Production deployment
├── Dockerfile # Container build
├── .env.example # Configuration template (no secrets!)
├── package.json
├── tsconfig.json
└── LICENSE # MIT
```
---
## Roadmap
- [x] Core MCP Server with 5 tools
- [x] Paperless-ngx API client
- [x] Ollama embedding generation
- [x] SQLite vector store
- [x] Receipt data extraction
- [x] DATEV export
- [x] Docker deployment
- [ ] Bank CSV matching engine
- [ ] Web dashboard UI
- [ ] Webhook support (instant processing on document arrival)
- [ ] Multi-user support with separate vector stores
- [ ] Additional export formats (SKR04 mapping, FiBu, CSV+)
- [ ] Ollama vision model support for direct image analysis
- [ ] Automated document workflow triggers
- [ ] Plugin system for custom extractors
- [ ] Prometheus metrics endpoint
---
## Contributing
Contributions are welcome! PaperCortex is early-stage and there are many ways to help:
### Getting Started
```bash
git clone https://github.com/renefichtmueller/PaperCortex.git
cd PaperCortex
npm install
cp .env.example .env
# Edit .env with your local Paperless-ngx and Ollama settings
npm run dev
```
### How to Contribute
1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feat/amazing-feature`)
3. **Write tests** for your changes
4. **Commit** using conventional commits (`feat:`, `fix:`, `docs:`, `refactor:`)
5. **Push** and open a Pull Request
### Areas Where Help is Needed
| Area | Description | Difficulty |
|---|---|---|
| **Bank CSV Parsers** | Add parsers for different bank export formats (Sparkasse, ING, N26, Revolut, etc.) | Easy |
| **Export Formats** | Additional accounting export formats beyond DATEV | Medium |
| **Web Dashboard** | Build a simple web UI for browsing indexed documents and extracted data | Medium |
| **Multi-language** | Improve extraction accuracy for non-English/German receipts | Medium |
| **Vision Models** | Use Ollama vision models to extract data directly from receipt images | Hard |
| **Webhooks** | React to Paperless-ngx document events in real-time | Medium |
---
## Frequently Asked Questions
**Q: Does PaperCortex modify my documents in Paperless-ngx?**
A: By default, PaperCortex only reads documents. When you use the `classify` tool with `apply: true`, it can write tags, document types, and correspondents back to Paperless-ngx. Extraction results and embeddings are stored in PaperCortex's own database.
**Q: How much disk space does the vector database need?**
A: Roughly 1-2 KB per document for embeddings. A collection of 10,000 documents needs about 10-20 MB of vector storage.
**Q: Can I use OpenAI instead of Ollama?**
A: PaperCortex is designed for local-first operation with Ollama. Support for OpenAI-compatible APIs (including local alternatives like LM Studio, vLLM, or LocalAI) is on the roadmap.
**Q: What Paperless-ngx version is required?**
A: PaperCortex works with Paperless-ngx 2.0 and later (REST API v3+).
**Q: Can I run PaperCortex on a Raspberry Pi?**
A: PaperCortex itself is lightweight. The bottleneck is Ollama — you'll need a model that fits in your available RAM. `qwen2.5:7b` works on 8GB devices.
**Q: Is DATEV export only for Germany?**
A: The DATEV format is the German standard, but PaperCortex also exports plain CSV that works with any accounting software worldwide.
---
## License
MIT License — see [LICENSE](LICENSE) for details.
Free to use, modify, and distribute. Commercial use welcome.
---
## Acknowledgments
Built on the shoulders of giants:
- **[Paperless-ngx](https://github.com/paperless-ngx/paperless-ngx)** — The incredible open-source document management system (37k+ stars)
- **[Ollama](https://ollama.com)** — Making local AI accessible to everyone
- **[Model Context Protocol](https://modelcontextprotocol.io)** — The open standard for AI tool integration by Anthropic
- **[better-sqlite3](https://github.com/WiseLibs/better-sqlite3)** — Fast, reliable SQLite bindings for Node.js
---
## Star History
If PaperCortex is useful to you, please consider giving it a star — it helps others discover the project!
---
<p align="center">
<strong>Your documents. Your AI. Your hardware.</strong><br/>
<em>No cloud required.</em>
</p>

36
docker-compose.yml Normal file
View File

@ -0,0 +1,36 @@
services:
papercortex:
build: .
container_name: papercortex
restart: unless-stopped
ports:
- "3100:3100"
volumes:
- papercortex-data:/app/data
env_file:
- .env
environment:
- NODE_ENV=production
depends_on:
- ollama
ollama:
image: ollama/ollama:latest
container_name: papercortex-ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama-models:/root/.ollama
# Uncomment for NVIDIA GPU support:
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
volumes:
papercortex-data:
ollama-models:

64
docs/architecture.md Normal file
View File

@ -0,0 +1,64 @@
# Architecture
## Overview
PaperCortex is structured as three layers:
1. **MCP Server Layer** -- Exposes tools via the Model Context Protocol for AI agent integration.
2. **Intelligence Layer** -- Embedding generation, classification, receipt extraction, and query answering.
3. **Data Layer** -- Paperless-ngx API client and local SQLite vector store.
## Components
### MCP Server (`src/mcp-server/`)
The entry point for all AI agent interactions. Implements the MCP standard using `@modelcontextprotocol/sdk` and communicates via stdio transport.
Each tool is implemented as a separate handler module under `src/mcp-server/tools/`.
### Embeddings (`src/embeddings/`)
- **ollama.ts** -- Client for the Ollama API. Handles embedding generation and LLM completions.
- **store.ts** -- SQLite-backed vector store using `better-sqlite3`. Stores document embeddings and supports cosine similarity search.
Current implementation uses brute-force search, which is performant up to ~100k documents. For larger archives, consider migrating to `sqlite-vss` or a dedicated vector database.
### Paperless Integration (`src/paperless/`)
- **client.ts** -- REST API client for Paperless-ngx. Supports document CRUD, search, tags, correspondents, and document types.
- **types.ts** -- TypeScript type definitions matching the Paperless-ngx API v3+ schema.
### Receipt Processing (`src/receipt/`)
- **extractor.ts** -- Uses LLM to extract structured data from receipt OCR text.
- **matcher.ts** -- Matches extracted receipts against bank CSV transaction exports.
- **datev.ts** -- Generates DATEV Buchungsstapel format CSV for German accounting software.
## Data Flow
```
Paperless-ngx --(REST API)--> PaperCortex --(Ollama API)--> Ollama
|
v
SQLite Vector DB
|
v
MCP Server (stdio)
|
v
Claude Code / AI Agents
```
## Security Model
- All data stays local -- no external API calls except to Paperless-ngx and Ollama (both self-hosted).
- API tokens are read from environment variables, never hardcoded.
- The SQLite database is stored on the local filesystem with configurable path.
- MCP Server communicates via stdio (no network port required for MCP).
## Future Considerations
- **Webhook support** -- Listen for Paperless-ngx webhooks to auto-process new documents.
- **Plugin system** -- Allow custom extractors and exporters.
- **Web dashboard** -- Optional UI for monitoring and manual review.
- **Multi-user** -- Support multiple Paperless-ngx instances and user isolation.

101
docs/receipts.md Normal file
View File

@ -0,0 +1,101 @@
# Receipt Workflow
## Overview
PaperCortex provides a complete receipt-to-accounting pipeline:
1. **Scan** -- Upload receipts to Paperless-ngx (scan, email, photo)
2. **Extract** -- AI extracts structured data (vendor, date, amounts, line items)
3. **Match** -- Reconcile against bank CSV exports
4. **Export** -- Generate DATEV-compatible CSV for accounting software
## Receipt Extraction
### Via MCP Server (Claude Code)
```
Extract receipt data from document #1234
```
### Via CLI
```bash
npm run receipt:extract -- --document-id 1234
```
### Extracted Fields
| Field | Description | Example |
|---|---|---|
| vendor | Company name | "IKEA Deutschland GmbH" |
| vendorAddress | Full address | "Am Wanderweg 1, 65719 Hofheim" |
| vendorTaxId | Tax ID / VAT number | "DE 129 341 800" |
| date | Receipt date | "2024-03-15" |
| currency | ISO 4217 code | "EUR" |
| subtotal | Before tax | 84.03 |
| taxRate | Tax percentage | 19 |
| taxAmount | Tax amount | 15.97 |
| totalAmount | Total with tax | 100.00 |
| paymentMethod | How it was paid | "card" |
| lineItems | Individual items | Array of items |
| category | Expense category | "office_supplies" |
## Bank Statement Matching
Match receipts against bank CSV exports to verify which receipts correspond to which bank transactions.
### Supported Bank Formats
- Sparkasse (semicolon-separated, German format)
- ING (semicolon-separated)
- DKB (semicolon-separated)
- Volksbank (semicolon-separated)
- Generic CSV
### Matching Algorithm
1. **Amount match** -- Exact or close amount (within 1.00 tolerance)
2. **Date proximity** -- Same day, within 3 days, or within 7 days
3. **Vendor name** -- Partial match in transaction description
Results include a confidence score (0.0 - 1.0) and match reasons.
## DATEV Export
### Format
PaperCortex generates DATEV Buchungsstapel (posting batch) format CSV, compatible with:
- DATEV Unternehmen Online
- lexoffice
- sevDesk
- FastBill
- Any DATEV-import-capable software
### Account Mapping (SKR03)
| Category | Account | Description |
|---|---|---|
| office_supplies | 4930 | Buerokosten |
| travel | 4660 | Reisekosten |
| food | 4650 | Bewirtungskosten |
| telephone | 4920 | Telefon |
| postage | 4910 | Porto |
| rent | 4210 | Miete |
| advertising | 4600 | Werbekosten |
| software | 4964 | Software |
| consulting | 4950 | Rechts- und Beratungskosten |
| default | 4900 | Sonstige Aufwendungen |
### Export via CLI
```bash
# Export all receipts from March 2024 as DATEV CSV
npm run receipt:export -- --format datev --year 2024 --month 03
```
### Export via MCP Server
```
Export documents #100, #101, #102 as DATEV CSV
```

107
docs/setup.md Normal file
View File

@ -0,0 +1,107 @@
# Setup Guide
## Prerequisites
- **Node.js** 20+ (or Docker)
- **Paperless-ngx** instance with API access
- **Ollama** with required models
## Step 1: Install Ollama Models
```bash
# Required: LLM for classification and extraction
ollama pull qwen2.5:14b
# Required: Embedding model for semantic search
ollama pull nomic-embed-text
```
Verify Ollama is running:
```bash
curl http://localhost:11434/api/tags
```
## Step 2: Get Paperless-ngx API Token
1. Open your Paperless-ngx web UI
2. Go to Settings > API
3. Generate a new API token
4. Copy the token for the next step
## Step 3: Configure PaperCortex
```bash
git clone https://github.com/YOUR_USERNAME/PaperCortex.git
cd PaperCortex
cp .env.example .env
```
Edit `.env` with your values:
```env
PAPERLESS_URL=http://localhost:8000
PAPERLESS_TOKEN=<your-api-token>
OLLAMA_URL=http://localhost:11434
```
## Step 4: Run
### Option A: Docker (Recommended)
```bash
docker compose up -d
```
### Option B: Manual
```bash
npm install
npm run build
npm start
```
### Option C: Development
```bash
npm install
npm run dev
```
## Step 5: Register MCP Server
Add to your Claude Code configuration (`~/.claude.json`):
```json
{
"mcpServers": {
"papercortex": {
"command": "node",
"args": ["/absolute/path/to/PaperCortex/dist/mcp-server/index.js"],
"env": {
"PAPERLESS_URL": "http://localhost:8000",
"PAPERLESS_TOKEN": "your-token",
"OLLAMA_URL": "http://localhost:11434"
}
}
}
}
```
## Step 6: Populate Vector Store
On first run, you need to embed your existing documents. This will be automated in a future release. For now, the vector store is populated as documents are queried or classified.
## Troubleshooting
### "Connection refused" to Paperless-ngx
- Verify the URL in `.env` is reachable
- Check that the API token is valid
- Ensure Paperless-ngx is running
### "Connection refused" to Ollama
- Run `ollama serve` if not already running
- Check the port (default: 11434)
- Verify models are pulled: `ollama list`
### Slow first query
- The first embedding generation may take longer as Ollama loads the model into memory
- Subsequent queries will be faster once the model is loaded

57
package.json Normal file
View File

@ -0,0 +1,57 @@
{
"name": "papercortex",
"version": "0.1.0",
"description": "Self-hosted AI intelligence layer for Paperless-ngx with semantic search, receipt extraction, and MCP Server integration",
"main": "dist/mcp-server/index.js",
"type": "module",
"scripts": {
"build": "tsc",
"start": "node dist/mcp-server/index.js",
"dev": "tsx watch src/mcp-server/index.ts",
"lint": "eslint src/",
"test": "vitest",
"test:coverage": "vitest --coverage",
"receipt:extract": "tsx src/receipt/extractor.ts",
"receipt:match": "tsx src/receipt/matcher.ts",
"receipt:export": "tsx src/receipt/datev.ts"
},
"keywords": [
"paperless-ngx",
"ollama",
"mcp",
"mcp-server",
"semantic-search",
"document-ai",
"receipt-extraction",
"datev",
"self-hosted",
"local-ai",
"embeddings",
"vector-search"
],
"author": "",
"license": "MIT",
"repository": {
"type": "git",
"url": ""
},
"engines": {
"node": ">=20.0.0"
},
"dependencies": {
"@modelcontextprotocol/sdk": "^1.12.0",
"better-sqlite3": "^11.8.0",
"csv-parse": "^5.6.0",
"csv-stringify": "^6.5.0",
"dotenv": "^16.4.0",
"zod": "^3.24.0"
},
"devDependencies": {
"@types/better-sqlite3": "^7.6.12",
"@types/node": "^22.10.0",
"eslint": "^9.17.0",
"tsx": "^4.19.0",
"typescript": "^5.7.0",
"vitest": "^3.0.0"
}
}

148
src/embeddings/ollama.ts Normal file
View File

@ -0,0 +1,148 @@
/**
* Ollama embedding and LLM integration.
*
* Generates vector embeddings and LLM completions using a local Ollama instance.
* All functions are pure and return new objects -- no mutation.
*
* @example
* ```ts
* const ollama = createOllamaClient({ baseUrl: "http://localhost:11434" });
* const embedding = await ollama.embed("Office rent invoice March 2024");
* const answer = await ollama.complete("Classify this document: ...");
* ```
*/
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
export interface OllamaConfig {
readonly baseUrl: string;
readonly model: string;
readonly embeddingModel: string;
readonly timeout?: number;
}
export interface EmbeddingResult {
readonly vector: readonly number[];
readonly model: string;
readonly dimensions: number;
}
export interface CompletionResult {
readonly text: string;
readonly model: string;
readonly totalDuration: number;
}
export interface OllamaClient {
/** Generate an embedding vector for the given text. */
embed(text: string): Promise<EmbeddingResult>;
/** Generate a chat/instruct completion. */
complete(prompt: string, systemPrompt?: string): Promise<CompletionResult>;
/** Check if the Ollama server is reachable and models are available. */
healthCheck(): Promise<{ ok: boolean; models: readonly string[] }>;
}
// ---------------------------------------------------------------------------
// Implementation
// ---------------------------------------------------------------------------
/**
* Create an Ollama client for embeddings and completions.
*/
export function createOllamaClient(config: OllamaConfig): OllamaClient {
const { baseUrl, model, embeddingModel, timeout = 120_000 } = config;
async function post<T>(path: string, body: unknown): Promise<T> {
const url = `${baseUrl.replace(/\/+$/, "")}${path}`;
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeout);
try {
const response = await fetch(url, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(body),
signal: controller.signal,
});
if (!response.ok) {
const text = await response.text().catch(() => "");
throw new Error(`Ollama API error: ${response.status} -- ${text}`);
}
return (await response.json()) as T;
} finally {
clearTimeout(timer);
}
}
return {
async embed(text) {
// TODO: implement chunking for texts exceeding model context window
// TODO: add retry logic with exponential backoff
interface OllamaEmbedResponse {
embedding: number[];
}
const result = await post<OllamaEmbedResponse>("/api/embeddings", {
model: embeddingModel,
prompt: text,
});
return {
vector: result.embedding,
model: embeddingModel,
dimensions: result.embedding.length,
};
},
async complete(prompt, systemPrompt) {
// TODO: implement streaming support for long completions
// TODO: add structured output parsing (JSON mode)
interface OllamaGenerateResponse {
response: string;
model: string;
total_duration: number;
}
const result = await post<OllamaGenerateResponse>("/api/generate", {
model,
prompt,
system: systemPrompt ?? "",
stream: false,
});
return {
text: result.response,
model: result.model,
totalDuration: result.total_duration,
};
},
async healthCheck() {
try {
const url = `${baseUrl.replace(/\/+$/, "")}/api/tags`;
const response = await fetch(url);
if (!response.ok) return { ok: false, models: [] };
interface OllamaTagsResponse {
models: Array<{ name: string }>;
}
const data = (await response.json()) as OllamaTagsResponse;
return {
ok: true,
models: data.models.map((m) => m.name),
};
} catch {
return { ok: false, models: [] };
}
},
};
}

231
src/embeddings/store.ts Normal file
View File

@ -0,0 +1,231 @@
/**
* Local SQLite-backed vector store for document embeddings.
*
* Stores embedding vectors alongside document metadata in a SQLite database
* using better-sqlite3. Supports cosine similarity search for semantic
* document retrieval.
*
* @example
* ```ts
* const store = createVectorStore({ dbPath: "./data/vectors.db" });
* await store.upsert({ documentId: 42, vector: [...], content: "..." });
* const results = await store.search(queryVector, { limit: 10 });
* ```
*/
import Database from "better-sqlite3";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
export interface VectorStoreConfig {
readonly dbPath: string;
}
export interface DocumentEmbedding {
readonly documentId: number;
readonly vector: readonly number[];
readonly content: string;
readonly title: string;
readonly tags: readonly string[];
readonly createdAt: string;
}
export interface SearchResult {
readonly documentId: number;
readonly title: string;
readonly content: string;
readonly score: number;
readonly tags: readonly string[];
}
export interface SearchOptions {
readonly limit?: number;
readonly minScore?: number;
readonly tagFilter?: readonly string[];
}
export interface VectorStore {
/** Insert or update a document embedding. */
upsert(embedding: DocumentEmbedding): void;
/** Search for similar documents using cosine similarity. */
search(queryVector: readonly number[], options?: SearchOptions): readonly SearchResult[];
/** Remove an embedding by document ID. */
remove(documentId: number): void;
/** Get the total count of stored embeddings. */
count(): number;
/** Check if a document has been embedded. */
has(documentId: number): boolean;
/** Close the database connection. */
close(): void;
}
// ---------------------------------------------------------------------------
// Helpers
// ---------------------------------------------------------------------------
/**
* Compute cosine similarity between two vectors.
* Returns a value between -1 and 1 (1 = identical direction).
*/
function cosineSimilarity(a: readonly number[], b: readonly number[]): number {
if (a.length !== b.length) {
throw new Error(
`Vector dimension mismatch: ${a.length} vs ${b.length}`,
);
}
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
const denominator = Math.sqrt(normA) * Math.sqrt(normB);
if (denominator === 0) return 0;
return dotProduct / denominator;
}
// ---------------------------------------------------------------------------
// Implementation
// ---------------------------------------------------------------------------
/**
* Create a local vector store backed by SQLite.
*
* TODO: Consider migrating to sqlite-vss or DuckDB for ANN search at scale.
* The current brute-force approach works well for <100k documents.
*/
export function createVectorStore(config: VectorStoreConfig): VectorStore {
const db = new Database(config.dbPath);
// Enable WAL mode for better concurrent read performance
db.pragma("journal_mode = WAL");
// Create tables if they don't exist
db.exec(`
CREATE TABLE IF NOT EXISTS embeddings (
document_id INTEGER PRIMARY KEY,
vector BLOB NOT NULL,
content TEXT NOT NULL,
title TEXT NOT NULL,
tags TEXT NOT NULL DEFAULT '[]',
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_embeddings_created
ON embeddings (created_at);
`);
// Prepared statements for performance
const upsertStmt = db.prepare(`
INSERT INTO embeddings (document_id, vector, content, title, tags, created_at, updated_at)
VALUES (?, ?, ?, ?, ?, ?, datetime('now'))
ON CONFLICT(document_id) DO UPDATE SET
vector = excluded.vector,
content = excluded.content,
title = excluded.title,
tags = excluded.tags,
updated_at = datetime('now')
`);
const getAllStmt = db.prepare(`
SELECT document_id, vector, content, title, tags FROM embeddings
`);
const removeStmt = db.prepare(`
DELETE FROM embeddings WHERE document_id = ?
`);
const countStmt = db.prepare(`
SELECT COUNT(*) as count FROM embeddings
`);
const hasStmt = db.prepare(`
SELECT 1 FROM embeddings WHERE document_id = ? LIMIT 1
`);
return {
upsert(embedding) {
const vectorBlob = Buffer.from(new Float32Array(embedding.vector).buffer);
upsertStmt.run(
embedding.documentId,
vectorBlob,
embedding.content,
embedding.title,
JSON.stringify(embedding.tags),
embedding.createdAt,
);
},
search(queryVector, options = {}) {
const { limit = 10, minScore = 0.5, tagFilter } = options;
// TODO: Implement ANN (approximate nearest neighbor) for large datasets
// Current approach: brute-force scan -- fine for <100k documents
interface EmbeddingRow {
document_id: number;
vector: Buffer;
content: string;
title: string;
tags: string;
}
const rows = getAllStmt.all() as EmbeddingRow[];
const scored = rows
.map((row) => {
const storedVector = Array.from(new Float32Array(row.vector.buffer));
const tags: string[] = JSON.parse(row.tags);
const score = cosineSimilarity(queryVector, storedVector);
return {
documentId: row.document_id,
title: row.title,
content: row.content,
score,
tags,
};
})
.filter((result) => result.score >= minScore)
.filter((result) => {
if (!tagFilter || tagFilter.length === 0) return true;
return tagFilter.some((tag) => result.tags.includes(tag));
})
.sort((a, b) => b.score - a.score)
.slice(0, limit);
return scored;
},
remove(documentId) {
removeStmt.run(documentId);
},
count() {
const row = countStmt.get() as { count: number };
return row.count;
},
has(documentId) {
return hasStmt.get(documentId) !== undefined;
},
close() {
db.close();
},
};
}

249
src/mcp-server/index.ts Normal file
View File

@ -0,0 +1,249 @@
/**
* PaperCortex MCP Server entry point.
*
* Exposes document intelligence tools via the Model Context Protocol (MCP)
* for integration with Claude Code and other AI agents.
*
* @see https://modelcontextprotocol.io
*/
import { Server } from "@modelcontextprotocol/sdk/server/index.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import {
CallToolRequestSchema,
ListToolsRequestSchema,
} from "@modelcontextprotocol/sdk/types.js";
import { config } from "dotenv";
import { createOllamaClient } from "../embeddings/ollama.js";
import { createVectorStore } from "../embeddings/store.js";
import { createPaperlessClient } from "../paperless/client.js";
import { handleClassify } from "./tools/classify.js";
import { handleExport } from "./tools/export.js";
import { handleQuery } from "./tools/query.js";
import { handleReceipt } from "./tools/receipt.js";
import { handleSearch } from "./tools/search.js";
// ---------------------------------------------------------------------------
// Configuration
// ---------------------------------------------------------------------------
config(); // Load .env
function requireEnv(key: string): string {
const value = process.env[key];
if (!value) {
throw new Error(`Missing required environment variable: ${key}`);
}
return value;
}
// ---------------------------------------------------------------------------
// Service initialization
// ---------------------------------------------------------------------------
const paperless = createPaperlessClient({
baseUrl: requireEnv("PAPERLESS_URL"),
token: requireEnv("PAPERLESS_TOKEN"),
});
const ollama = createOllamaClient({
baseUrl: process.env["OLLAMA_URL"] ?? "http://localhost:11434",
model: process.env["OLLAMA_MODEL"] ?? "qwen2.5:14b",
embeddingModel: process.env["OLLAMA_EMBEDDING_MODEL"] ?? "nomic-embed-text",
});
const vectorStore = createVectorStore({
dbPath: process.env["VECTOR_DB_PATH"] ?? "./data/vectors.db",
});
// ---------------------------------------------------------------------------
// Shared context for tool handlers
// ---------------------------------------------------------------------------
export interface ToolContext {
readonly paperless: typeof paperless;
readonly ollama: typeof ollama;
readonly vectorStore: typeof vectorStore;
}
const ctx: ToolContext = { paperless, ollama, vectorStore };
// ---------------------------------------------------------------------------
// MCP Server setup
// ---------------------------------------------------------------------------
const server = new Server(
{
name: "papercortex",
version: "0.1.0",
},
{
capabilities: {
tools: {},
},
},
);
/**
* List all available PaperCortex tools.
*/
server.setRequestHandler(ListToolsRequestSchema, async () => ({
tools: [
{
name: "papercortex_search",
description:
"Semantic search across all documents in Paperless-ngx. " +
"Finds documents by meaning, not just keywords.",
inputSchema: {
type: "object" as const,
properties: {
query: {
type: "string",
description: "Natural language search query",
},
limit: {
type: "number",
description: "Maximum number of results (default: 10)",
},
tags: {
type: "array",
items: { type: "string" },
description: "Filter by tag names",
},
},
required: ["query"],
},
},
{
name: "papercortex_classify",
description:
"Auto-classify a document using local AI. " +
"Suggests tags, document type, and correspondent.",
inputSchema: {
type: "object" as const,
properties: {
documentId: {
type: "number",
description: "Paperless-ngx document ID",
},
applyTags: {
type: "boolean",
description: "Automatically apply suggested tags (default: false)",
},
},
required: ["documentId"],
},
},
{
name: "papercortex_receipt",
description:
"Extract structured data from a receipt document: " +
"vendor, date, amounts, tax, line items.",
inputSchema: {
type: "object" as const,
properties: {
documentId: {
type: "number",
description: "Paperless-ngx document ID of the receipt",
},
},
required: ["documentId"],
},
},
{
name: "papercortex_query",
description:
"Ask natural language questions about your documents. " +
'Example: "How much did I spend on office supplies in Q1 2024?"',
inputSchema: {
type: "object" as const,
properties: {
question: {
type: "string",
description: "Natural language question about your documents",
},
maxDocuments: {
type: "number",
description:
"Maximum documents to include in context (default: 5)",
},
},
required: ["question"],
},
},
{
name: "papercortex_export",
description:
"Export receipt data as DATEV-compatible CSV for German accounting, " +
"or as generic CSV.",
inputSchema: {
type: "object" as const,
properties: {
documentIds: {
type: "array",
items: { type: "number" },
description: "Document IDs to export",
},
format: {
type: "string",
enum: ["datev", "csv"],
description: "Export format (default: datev)",
},
},
required: ["documentIds"],
},
},
],
}));
/**
* Route tool calls to their respective handlers.
*/
server.setRequestHandler(CallToolRequestSchema, async (request) => {
const { name, arguments: args } = request.params;
try {
switch (name) {
case "papercortex_search":
return await handleSearch(ctx, args as Record<string, unknown>);
case "papercortex_classify":
return await handleClassify(ctx, args as Record<string, unknown>);
case "papercortex_receipt":
return await handleReceipt(ctx, args as Record<string, unknown>);
case "papercortex_query":
return await handleQuery(ctx, args as Record<string, unknown>);
case "papercortex_export":
return await handleExport(ctx, args as Record<string, unknown>);
default:
return {
content: [
{ type: "text" as const, text: `Unknown tool: ${name}` },
],
isError: true,
};
}
} catch (error) {
const message =
error instanceof Error ? error.message : "Unknown error occurred";
return {
content: [{ type: "text" as const, text: `Error: ${message}` }],
isError: true,
};
}
});
// ---------------------------------------------------------------------------
// Start server
// ---------------------------------------------------------------------------
async function main(): Promise<void> {
const transport = new StdioServerTransport();
await server.connect(transport);
console.error("PaperCortex MCP Server running on stdio");
}
main().catch((error) => {
console.error("Fatal error starting PaperCortex:", error);
process.exit(1);
});

View File

@ -0,0 +1,117 @@
/**
* Auto-classification tool for the PaperCortex MCP Server.
*
* Uses local LLM to analyze document content and suggest appropriate
* tags, document types, and correspondents.
*/
import type { ToolContext } from "../index.js";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
interface ClassifyArgs {
readonly documentId: number;
readonly applyTags?: boolean;
}
interface ClassificationResult {
readonly suggestedTags: readonly string[];
readonly suggestedType: string | null;
readonly suggestedCorrespondent: string | null;
readonly summary: string;
readonly language: string;
readonly confidence: number;
}
// ---------------------------------------------------------------------------
// Prompts
// ---------------------------------------------------------------------------
const CLASSIFY_SYSTEM_PROMPT = `You are a document classification assistant. Analyze the document content and provide structured classification.
Respond with valid JSON only:
{
"suggestedTags": ["tag1", "tag2"],
"suggestedType": "invoice|contract|receipt|letter|report|tax_document|bank_statement|insurance|warranty|manual|other",
"suggestedCorrespondent": "Company or person name",
"summary": "One sentence summary",
"language": "ISO 639-1 code",
"confidence": 0.0 to 1.0
}`;
// ---------------------------------------------------------------------------
// Handler
// ---------------------------------------------------------------------------
/**
* Handle a `papercortex_classify` tool call.
*
* 1. Fetch document content from Paperless-ngx.
* 2. Send content to Ollama for classification.
* 3. Optionally apply suggested tags back to Paperless-ngx.
*
* TODO: Match suggested tags against existing Paperless-ngx tags
* TODO: Create new tags automatically when confidence is high
* TODO: Learn from user corrections to improve classification
*/
export async function handleClassify(
ctx: ToolContext,
args: Record<string, unknown>,
): Promise<{ content: Array<{ type: "text"; text: string }> }> {
const { documentId, applyTags = false } = args as unknown as ClassifyArgs;
// Fetch document from Paperless-ngx
const document = await ctx.paperless.getDocument(documentId);
if (!document.content || document.content.trim().length === 0) {
return {
content: [
{
type: "text",
text: `Document #${documentId} has no text content. OCR may not have completed.`,
},
],
};
}
// Classify using Ollama
const prompt = `Classify this document:\n\nTitle: ${document.title}\n\nContent:\n${document.content.slice(0, 4000)}`;
const completion = await ctx.ollama.complete(prompt, CLASSIFY_SYSTEM_PROMPT);
let classification: ClassificationResult;
try {
classification = JSON.parse(completion.text) as ClassificationResult;
} catch {
return {
content: [
{
type: "text",
text: `Classification failed: LLM did not return valid JSON.\nRaw response: ${completion.text.slice(0, 500)}`,
},
],
};
}
// Optionally apply tags
let appliedNote = "";
if (applyTags && classification.suggestedTags.length > 0) {
// TODO: Look up tag IDs from Paperless-ngx, create missing tags
appliedNote =
"\n\nNote: Tag application is not yet implemented. " +
"Tags need to be matched against existing Paperless-ngx tags.";
}
const output =
`Classification for Document #${documentId} "${document.title}":\n\n` +
`Type: ${classification.suggestedType ?? "unknown"}\n` +
`Correspondent: ${classification.suggestedCorrespondent ?? "unknown"}\n` +
`Tags: ${classification.suggestedTags.join(", ") || "none"}\n` +
`Language: ${classification.language}\n` +
`Summary: ${classification.summary}\n` +
`Confidence: ${(classification.confidence * 100).toFixed(0)}%` +
appliedNote;
return { content: [{ type: "text", text: output }] };
}

View File

@ -0,0 +1,116 @@
/**
* DATEV/CSV export tool for the PaperCortex MCP Server.
*
* Exports receipt data in accounting-compatible formats.
*/
import { createReceiptExtractor } from "../../receipt/extractor.js";
import { createDatevExporter } from "../../receipt/datev.js";
import type { ToolContext } from "../index.js";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
interface ExportArgs {
readonly documentIds: readonly number[];
readonly format?: "datev" | "csv";
}
// ---------------------------------------------------------------------------
// Handler
// ---------------------------------------------------------------------------
/**
* Handle a `papercortex_export` tool call.
*
* 1. Extract receipt data from all specified documents.
* 2. Format as DATEV or generic CSV.
* 3. Return the CSV content.
*
* TODO: Add file output option (save to disk)
* TODO: Add date range filtering
* TODO: Add DATEV header metadata (consultant/client numbers from config)
*/
export async function handleExport(
ctx: ToolContext,
args: Record<string, unknown>,
): Promise<{ content: Array<{ type: "text"; text: string }> }> {
const { documentIds, format = "datev" } = args as unknown as ExportArgs;
if (!documentIds || documentIds.length === 0) {
return {
content: [
{
type: "text",
text: "Error: at least one document ID is required for export.",
},
],
};
}
// Extract receipt data from all documents
const extractor = createReceiptExtractor({
ollama: ctx.ollama,
paperless: ctx.paperless,
});
const receipts = await extractor.extractBatch(documentIds);
if (format === "datev") {
// TODO: Read consultant/client numbers from configuration
const exporter = createDatevExporter({
consultantNumber: 0,
clientNumber: 0,
});
const receiptsForExport = receipts.map((r) => ({
documentId: r.documentId,
vendor: r.vendor,
date: r.date,
totalAmount: r.totalAmount,
taxRate: r.taxRate,
category: r.category,
}));
const csv = exporter.generateCsv(receiptsForExport);
return {
content: [
{
type: "text",
text:
`DATEV export for ${receipts.length} receipt(s):\n\n` +
"```csv\n" +
csv +
"\n```\n\n" +
"Copy this CSV content into a file and import into your " +
"DATEV-compatible accounting software.",
},
],
};
}
// Generic CSV format
const header = "Document ID;Vendor;Date;Amount;Tax Rate;Tax Amount;Currency;Category";
const rows = receipts.map(
(r) =>
`${r.documentId};${r.vendor};${r.date};${r.totalAmount.toFixed(2)};` +
`${r.taxRate ?? ""};${r.taxAmount?.toFixed(2) ?? ""};${r.currency};${r.category ?? ""}`,
);
const csv = [header, ...rows].join("\n");
return {
content: [
{
type: "text",
text:
`CSV export for ${receipts.length} receipt(s):\n\n` +
"```csv\n" +
csv +
"\n```",
},
],
};
}

View File

@ -0,0 +1,110 @@
/**
* Natural language query tool for the PaperCortex MCP Server.
*
* Answers questions about documents using RAG (Retrieval-Augmented Generation):
* retrieves relevant documents via semantic search, then generates an answer
* using the local LLM with document context.
*/
import type { ToolContext } from "../index.js";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
interface QueryArgs {
readonly question: string;
readonly maxDocuments?: number;
}
// ---------------------------------------------------------------------------
// Prompts
// ---------------------------------------------------------------------------
const QUERY_SYSTEM_PROMPT = `You are a document analysis assistant. Answer the user's question based ONLY on the provided document excerpts. If the documents don't contain enough information to answer, say so clearly.
Be precise with numbers, dates, and amounts. Cite document IDs when referencing specific documents.`;
// ---------------------------------------------------------------------------
// Handler
// ---------------------------------------------------------------------------
/**
* Handle a `papercortex_query` tool call.
*
* Uses RAG (Retrieval-Augmented Generation):
* 1. Embed the question and retrieve relevant documents.
* 2. Build a context from retrieved documents.
* 3. Generate an answer using the local LLM.
*
* TODO: Add conversation history for follow-up questions
* TODO: Add source citation with page numbers
* TODO: Implement query decomposition for complex questions
*/
export async function handleQuery(
ctx: ToolContext,
args: Record<string, unknown>,
): Promise<{ content: Array<{ type: "text"; text: string }> }> {
const { question, maxDocuments = 5 } = args as unknown as QueryArgs;
if (!question || question.trim().length === 0) {
return {
content: [{ type: "text", text: "Error: question cannot be empty." }],
};
}
// Step 1: Retrieve relevant documents
const queryEmbedding = await ctx.ollama.embed(question);
const relevantDocs = ctx.vectorStore.search(queryEmbedding.vector, {
limit: maxDocuments,
minScore: 0.3,
});
if (relevantDocs.length === 0) {
return {
content: [
{
type: "text",
text:
`I couldn't find any relevant documents to answer: "${question}"\n\n` +
"The vector store may need to be populated first, or your documents " +
"may not contain information related to this question.",
},
],
};
}
// Step 2: Build context from retrieved documents
const context = relevantDocs
.map(
(doc) =>
`--- Document #${doc.documentId}: ${doc.title} (relevance: ${doc.score.toFixed(2)}) ---\n` +
doc.content.slice(0, 2000),
)
.join("\n\n");
// Step 3: Generate answer with context
const prompt =
`Based on the following documents, answer this question: "${question}"\n\n` +
`Documents:\n${context}`;
const completion = await ctx.ollama.complete(prompt, QUERY_SYSTEM_PROMPT);
const sourcesNote = relevantDocs
.map(
(doc) =>
` - Document #${doc.documentId}: ${doc.title} (score: ${doc.score.toFixed(2)})`,
)
.join("\n");
return {
content: [
{
type: "text",
text:
`${completion.text}\n\n` +
`---\nSources (${relevantDocs.length} documents):\n${sourcesNote}`,
},
],
};
}

View File

@ -0,0 +1,76 @@
/**
* Receipt extraction tool for the PaperCortex MCP Server.
*
* Extracts structured receipt data from Paperless-ngx documents
* using local LLM analysis.
*/
import { createReceiptExtractor } from "../../receipt/extractor.js";
import type { ToolContext } from "../index.js";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
interface ReceiptArgs {
readonly documentId: number;
}
// ---------------------------------------------------------------------------
// Handler
// ---------------------------------------------------------------------------
/**
* Handle a `papercortex_receipt` tool call.
*
* 1. Fetch document from Paperless-ngx.
* 2. Extract receipt data using LLM.
* 3. Return structured receipt information.
*
* TODO: Cache extraction results to avoid re-processing
* TODO: Add confidence thresholds and human review flags
* TODO: Store extracted data back as Paperless-ngx custom fields
*/
export async function handleReceipt(
ctx: ToolContext,
args: Record<string, unknown>,
): Promise<{ content: Array<{ type: "text"; text: string }> }> {
const { documentId } = args as unknown as ReceiptArgs;
const extractor = createReceiptExtractor({
ollama: ctx.ollama,
paperless: ctx.paperless,
});
const receipt = await extractor.extract(documentId);
// Format line items table
const lineItemsTable =
receipt.lineItems.length > 0
? receipt.lineItems
.map(
(item, i) =>
` ${i + 1}. ${item.description} | ` +
`${item.quantity}x ${item.unitPrice.toFixed(2)} = ${item.totalPrice.toFixed(2)}`,
)
.join("\n")
: " No line items extracted";
const output =
`Receipt Data for Document #${documentId}:\n\n` +
`Vendor: ${receipt.vendor}\n` +
`Address: ${receipt.vendorAddress ?? "N/A"}\n` +
`Tax ID: ${receipt.vendorTaxId ?? "N/A"}\n` +
`Date: ${receipt.date}\n` +
`Currency: ${receipt.currency}\n` +
`\nAmounts:\n` +
` Subtotal: ${receipt.subtotal?.toFixed(2) ?? "N/A"}\n` +
` Tax (${receipt.taxRate ?? "?"}%): ${receipt.taxAmount?.toFixed(2) ?? "N/A"}\n` +
` Total: ${receipt.totalAmount.toFixed(2)}\n` +
`\nPayment: ${receipt.paymentMethod ?? "N/A"}\n` +
`Category: ${receipt.category ?? "uncategorized"}\n` +
`Confidence: ${(receipt.confidence * 100).toFixed(0)}%\n` +
`\nLine Items:\n${lineItemsTable}`;
return { content: [{ type: "text", text: output }] };
}

View File

@ -0,0 +1,87 @@
/**
* Semantic search tool for the PaperCortex MCP Server.
*
* Performs vector similarity search across all embedded documents,
* returning the most semantically relevant results.
*/
import type { ToolContext } from "../index.js";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
interface SearchArgs {
readonly query: string;
readonly limit?: number;
readonly tags?: readonly string[];
}
// ---------------------------------------------------------------------------
// Handler
// ---------------------------------------------------------------------------
/**
* Handle a `papercortex_search` tool call.
*
* 1. Generate an embedding for the search query via Ollama.
* 2. Search the local vector store for similar documents.
* 3. Return ranked results with scores and metadata.
*
* TODO: Add hybrid search (combine vector + keyword for better recall)
* TODO: Add date range filtering
* TODO: Add result caching for repeated queries
*/
export async function handleSearch(
ctx: ToolContext,
args: Record<string, unknown>,
): Promise<{ content: Array<{ type: "text"; text: string }> }> {
const { query, limit = 10, tags } = args as unknown as SearchArgs;
if (!query || query.trim().length === 0) {
return {
content: [{ type: "text", text: "Error: search query cannot be empty." }],
};
}
// Generate embedding for the query
const queryEmbedding = await ctx.ollama.embed(query);
// Search vector store
const results = ctx.vectorStore.search(queryEmbedding.vector, {
limit,
minScore: 0.4,
tagFilter: tags ? [...tags] : undefined,
});
if (results.length === 0) {
return {
content: [
{
type: "text",
text: `No documents found matching "${query}". The vector store may need to be populated first.`,
},
],
};
}
// Format results
const formatted = results
.map(
(r, i) =>
`${i + 1}. [Document #${r.documentId}] (score: ${r.score.toFixed(3)})\n` +
` Title: ${r.title}\n` +
` Tags: ${r.tags.length > 0 ? r.tags.join(", ") : "none"}\n` +
` Preview: ${r.content.slice(0, 200).replace(/\n/g, " ")}...`,
)
.join("\n\n");
return {
content: [
{
type: "text",
text: `Found ${results.length} documents matching "${query}":\n\n${formatted}`,
},
],
};
}

182
src/paperless/client.ts Normal file
View File

@ -0,0 +1,182 @@
/**
* Paperless-ngx REST API client.
*
* Provides typed access to documents, correspondents, tags, and document types.
* All methods return immutable result objects.
*
* @example
* ```ts
* const client = createPaperlessClient({
* baseUrl: "http://localhost:8000",
* token: "your-api-token",
* });
* const docs = await client.getDocuments({ query: "invoice" });
* ```
*/
import type {
Correspondent,
DocumentSearchParams,
DocumentType,
PaginatedResponse,
PaperlessConfig,
PaperlessDocument,
Tag,
} from "./types.js";
// ---------------------------------------------------------------------------
// Client interface
// ---------------------------------------------------------------------------
export interface PaperlessClient {
/** Fetch a single document by ID. */
getDocument(id: number): Promise<PaperlessDocument>;
/** Search / list documents with optional filters. */
getDocuments(
params?: DocumentSearchParams,
): Promise<PaginatedResponse<PaperlessDocument>>;
/** Fetch all correspondents. */
getCorrespondents(): Promise<PaginatedResponse<Correspondent>>;
/** Fetch all tags. */
getTags(): Promise<PaginatedResponse<Tag>>;
/** Fetch all document types. */
getDocumentTypes(): Promise<PaginatedResponse<DocumentType>>;
/** Download the original file content of a document. */
downloadDocument(id: number): Promise<ArrayBuffer>;
/** Update tags on a document (immutable -- returns the updated doc). */
updateDocumentTags(
id: number,
tagIds: readonly number[],
): Promise<PaperlessDocument>;
}
// ---------------------------------------------------------------------------
// Implementation
// ---------------------------------------------------------------------------
/**
* Create a new Paperless-ngx API client.
*
* @param config - Connection configuration (URL + token).
* @returns A {@link PaperlessClient} instance.
*/
export function createPaperlessClient(config: PaperlessConfig): PaperlessClient {
const { baseUrl, token, timeout = 30_000 } = config;
const headers: Record<string, string> = {
Authorization: `Token ${token}`,
"Content-Type": "application/json",
Accept: "application/json; version=3",
};
/**
* Internal fetch wrapper with timeout and error handling.
*/
async function request<T>(
path: string,
options: RequestInit = {},
): Promise<T> {
const url = `${baseUrl.replace(/\/+$/, "")}/api${path}`;
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeout);
try {
const response = await fetch(url, {
...options,
headers: { ...headers, ...((options.headers as Record<string, string>) ?? {}) },
signal: controller.signal,
});
if (!response.ok) {
const body = await response.text().catch(() => "");
throw new Error(
`Paperless API error: ${response.status} ${response.statusText} -- ${body}`,
);
}
return (await response.json()) as T;
} finally {
clearTimeout(timer);
}
}
/**
* Build query string from search params.
*/
function buildQuery(params?: DocumentSearchParams): string {
if (!params) return "";
const entries = Object.entries(params).filter(
([, v]) => v !== undefined && v !== null,
);
if (entries.length === 0) return "";
const searchParams = new URLSearchParams();
for (const [key, value] of entries) {
if (Array.isArray(value)) {
searchParams.set(key, value.join(","));
} else {
searchParams.set(key, String(value));
}
}
return `?${searchParams.toString()}`;
}
return {
async getDocument(id) {
return request<PaperlessDocument>(`/documents/${id}/`);
},
async getDocuments(params) {
return request<PaginatedResponse<PaperlessDocument>>(
`/documents/${buildQuery(params)}`,
);
},
async getCorrespondents() {
return request<PaginatedResponse<Correspondent>>("/correspondents/");
},
async getTags() {
return request<PaginatedResponse<Tag>>("/tags/");
},
async getDocumentTypes() {
return request<PaginatedResponse<DocumentType>>("/document_types/");
},
async downloadDocument(id) {
const url = `${baseUrl.replace(/\/+$/, "")}/api/documents/${id}/download/`;
const controller = new AbortController();
const timer = setTimeout(() => controller.abort(), timeout);
try {
const response = await fetch(url, {
headers: { Authorization: `Token ${token}` },
signal: controller.signal,
});
if (!response.ok) {
throw new Error(
`Paperless download error: ${response.status} ${response.statusText}`,
);
}
return await response.arrayBuffer();
} finally {
clearTimeout(timer);
}
},
async updateDocumentTags(id, tagIds) {
return request<PaperlessDocument>(`/documents/${id}/`, {
method: "PATCH",
body: JSON.stringify({ tags: [...tagIds] }),
});
},
};
}

126
src/paperless/types.ts Normal file
View File

@ -0,0 +1,126 @@
/**
* TypeScript type definitions for the Paperless-ngx REST API.
*
* Based on Paperless-ngx API v3+.
* @see https://docs.paperless-ngx.com/api/
*/
// ---------------------------------------------------------------------------
// Pagination
// ---------------------------------------------------------------------------
/** Generic paginated response envelope from Paperless-ngx. */
export interface PaginatedResponse<T> {
readonly count: number;
readonly next: string | null;
readonly previous: string | null;
readonly results: readonly T[];
}
// ---------------------------------------------------------------------------
// Core entities
// ---------------------------------------------------------------------------
export interface PaperlessDocument {
readonly id: number;
readonly correspondent: number | null;
readonly document_type: number | null;
readonly storage_path: number | null;
readonly title: string;
readonly content: string;
readonly tags: readonly number[];
readonly created: string;
readonly created_date: string;
readonly modified: string;
readonly added: string;
readonly archive_serial_number: number | null;
readonly original_file_name: string;
readonly archived_file_name: string | null;
readonly owner: number | null;
readonly notes: readonly DocumentNote[];
readonly custom_fields: readonly CustomFieldValue[];
}
export interface DocumentNote {
readonly id: number;
readonly note: string;
readonly created: string;
readonly user: number;
}
export interface CustomFieldValue {
readonly field: number;
readonly value: string | number | boolean | null;
}
export interface Correspondent {
readonly id: number;
readonly slug: string;
readonly name: string;
readonly match: string;
readonly matching_algorithm: number;
readonly is_insensitive: boolean;
readonly document_count: number;
readonly last_correspondence: string | null;
}
export interface DocumentType {
readonly id: number;
readonly slug: string;
readonly name: string;
readonly match: string;
readonly matching_algorithm: number;
readonly is_insensitive: boolean;
readonly document_count: number;
}
export interface Tag {
readonly id: number;
readonly slug: string;
readonly name: string;
readonly color: string;
readonly text_color: string;
readonly match: string;
readonly matching_algorithm: number;
readonly is_insensitive: boolean;
readonly is_inbox_tag: boolean;
readonly document_count: number;
}
export interface StoragePath {
readonly id: number;
readonly slug: string;
readonly name: string;
readonly path: string;
readonly match: string;
readonly matching_algorithm: number;
readonly is_insensitive: boolean;
readonly document_count: number;
}
// ---------------------------------------------------------------------------
// Search & filter
// ---------------------------------------------------------------------------
export interface DocumentSearchParams {
readonly query?: string;
readonly correspondent__id?: number;
readonly document_type__id?: number;
readonly tags__id__all?: readonly number[];
readonly tags__id__none?: readonly number[];
readonly created__date__gt?: string;
readonly created__date__lt?: string;
readonly ordering?: string;
readonly page?: number;
readonly page_size?: number;
}
// ---------------------------------------------------------------------------
// API client configuration
// ---------------------------------------------------------------------------
export interface PaperlessConfig {
readonly baseUrl: string;
readonly token: string;
readonly timeout?: number;
}

171
src/receipt/datev.ts Normal file
View File

@ -0,0 +1,171 @@
/**
* DATEV export formatter.
*
* Generates DATEV-compatible CSV files for import into German accounting
* software (DATEV Unternehmen Online, lexoffice, sevDesk, etc.).
*
* Implements the DATEV "Buchungsstapel" (posting batch) format v7.0+.
*
* @see https://developer.datev.de/datev/platform/en/dtvf/formate
*
* @example
* ```ts
* const exporter = createDatevExporter({ consultantNumber: 12345, clientNumber: 67890 });
* const csv = exporter.generateCsv(receiptData);
* writeFileSync("./export.csv", csv);
* ```
*/
import { stringify } from "csv-stringify/sync";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
export interface DatevConfig {
/** DATEV consultant number (Beraternummer). */
readonly consultantNumber: number;
/** DATEV client number (Mandantennummer). */
readonly clientNumber: number;
/** Fiscal year start (1-12, default: 1 for January). */
readonly fiscalYearStart?: number;
/** Default debit account length (SKR03/SKR04). */
readonly accountLength?: 4 | 5;
}
export interface DatevBookingEntry {
readonly amount: number;
readonly debitAccount: string;
readonly creditAccount: string;
readonly taxCode: string;
readonly date: string;
readonly description: string;
readonly documentNumber: string;
readonly costCenter?: string;
}
export interface ReceiptForExport {
readonly documentId: number;
readonly vendor: string;
readonly date: string;
readonly totalAmount: number;
readonly taxRate: number | null;
readonly category: string | null;
}
export interface DatevExporter {
/** Generate DATEV CSV from receipt data. */
generateCsv(receipts: readonly ReceiptForExport[]): string;
/** Map a receipt to a DATEV booking entry. */
mapToBooking(receipt: ReceiptForExport): DatevBookingEntry;
}
// ---------------------------------------------------------------------------
// Constants
// ---------------------------------------------------------------------------
/**
* Map expense categories to SKR03 accounts.
* TODO: Add SKR04 mapping support
* TODO: Make configurable via user settings
*/
const SKR03_ACCOUNT_MAP: Record<string, string> = {
office_supplies: "4930",
travel: "4660",
food: "4650",
telephone: "4920",
postage: "4910",
insurance: "4360",
rent: "4210",
advertising: "4600",
software: "4964",
hardware: "4980",
consulting: "4950",
training: "4945",
vehicle: "4500",
default: "4900",
};
/**
* Map tax rates to DATEV tax codes (Steuerschluessel).
*/
const TAX_CODE_MAP: Record<number, string> = {
19: "9", // 19% USt (standard)
7: "8", // 7% USt (reduced)
0: "0", // Tax-free
};
// ---------------------------------------------------------------------------
// Implementation
// ---------------------------------------------------------------------------
/**
* Create a DATEV-format exporter for receipt data.
*
* TODO: Implement DATEV header line with metadata (consultant, client, date range)
* TODO: Add validation for account numbers against SKR03/SKR04
* TODO: Support DATEV XML format (Buchungsdaten v5.0)
*/
export function createDatevExporter(config: DatevConfig): DatevExporter {
const {
consultantNumber: _consultantNumber,
clientNumber: _clientNumber,
fiscalYearStart: _fiscalYearStart = 1,
accountLength: _accountLength = 4,
} = config;
function mapToBooking(receipt: ReceiptForExport): DatevBookingEntry {
const category = receipt.category ?? "default";
const debitAccount =
SKR03_ACCOUNT_MAP[category] ?? SKR03_ACCOUNT_MAP["default"];
const taxRate = receipt.taxRate ?? 19;
const taxCode = TAX_CODE_MAP[taxRate] ?? TAX_CODE_MAP[19];
// Parse date to DD.MM format for DATEV
const dateParts = receipt.date.split("-");
const datevDate =
dateParts.length === 3
? `${dateParts[2]}${dateParts[1]}`
: receipt.date;
return {
amount: receipt.totalAmount,
debitAccount,
creditAccount: "1200", // Bank account (SKR03 default)
taxCode,
date: datevDate,
description: receipt.vendor.slice(0, 60), // DATEV max 60 chars
documentNumber: `PC-${receipt.documentId}`,
costCenter: undefined,
};
}
function generateCsv(receipts: readonly ReceiptForExport[]): string {
const bookings = receipts.map(mapToBooking);
// DATEV Buchungsstapel columns
const rows = bookings.map((b) => [
b.amount.toFixed(2).replace(".", ","), // Umsatz (amount with comma)
"S", // Soll/Haben (S = Soll/Debit)
b.taxCode, // BU-Schluessel (tax code)
b.debitAccount, // Gegenkonto (offset account)
b.date, // Belegdatum (document date)
b.documentNumber, // Belegfeld 1 (document number)
"", // Belegfeld 2
b.description, // Buchungstext (description)
"", // Umsatzsteuer-ID
b.creditAccount, // Konto (account)
b.costCenter ?? "", // Kostenstelle (cost center)
]);
return stringify(rows, {
delimiter: ";",
quoted: true,
record_delimiter: "\r\n",
});
}
return { generateCsv, mapToBooking };
}

170
src/receipt/extractor.ts Normal file
View File

@ -0,0 +1,170 @@
/**
* Receipt data extraction using local LLM via Ollama.
*
* Extracts structured data from receipt documents: vendor, date, amounts,
* tax breakdown, line items, and payment method. Uses the Paperless-ngx
* OCR content and enriches it with LLM analysis.
*
* @example
* ```ts
* const extractor = createReceiptExtractor({ ollama, paperless });
* const receipt = await extractor.extract(documentId);
* console.log(receipt.vendor, receipt.totalAmount, receipt.taxAmount);
* ```
*/
import type { OllamaClient } from "../embeddings/ollama.js";
import type { PaperlessClient } from "../paperless/client.js";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
export interface ReceiptData {
readonly documentId: number;
readonly vendor: string;
readonly vendorAddress: string | null;
readonly vendorTaxId: string | null;
readonly date: string;
readonly currency: string;
readonly subtotal: number | null;
readonly taxRate: number | null;
readonly taxAmount: number | null;
readonly totalAmount: number;
readonly paymentMethod: string | null;
readonly lineItems: readonly LineItem[];
readonly category: string | null;
readonly confidence: number;
readonly rawText: string;
}
export interface LineItem {
readonly description: string;
readonly quantity: number;
readonly unitPrice: number;
readonly totalPrice: number;
readonly taxRate: number | null;
}
export interface ReceiptExtractorConfig {
readonly ollama: OllamaClient;
readonly paperless: PaperlessClient;
}
export interface ReceiptExtractor {
/** Extract structured receipt data from a Paperless-ngx document. */
extract(documentId: number): Promise<ReceiptData>;
/** Batch-extract receipts from multiple documents. */
extractBatch(documentIds: readonly number[]): Promise<readonly ReceiptData[]>;
}
// ---------------------------------------------------------------------------
// Prompts
// ---------------------------------------------------------------------------
const EXTRACTION_SYSTEM_PROMPT = `You are a receipt data extraction assistant. Given the OCR text of a receipt, extract structured data in JSON format.
Extract the following fields:
- vendor: Company/store name
- vendorAddress: Full address if visible
- vendorTaxId: Tax ID / VAT number if visible (e.g., USt-IdNr, Steuernummer)
- date: Date in ISO 8601 format (YYYY-MM-DD)
- currency: ISO 4217 currency code (e.g., EUR, USD)
- subtotal: Amount before tax (null if not distinguishable)
- taxRate: Tax percentage as decimal (e.g., 19 for 19%)
- taxAmount: Tax amount
- totalAmount: Total amount including tax
- paymentMethod: Payment method if visible (cash, card, etc.)
- lineItems: Array of { description, quantity, unitPrice, totalPrice, taxRate }
- category: Suggested expense category (office_supplies, travel, food, etc.)
- confidence: Your confidence in the extraction (0.0 to 1.0)
Respond ONLY with valid JSON. No explanation, no markdown.`;
// ---------------------------------------------------------------------------
// Implementation
// ---------------------------------------------------------------------------
/**
* Create a receipt data extractor.
*
* TODO: Add support for image-based receipts (pass images to multimodal LLM)
* TODO: Add receipt template matching for common vendors
* TODO: Add currency conversion support
*/
export function createReceiptExtractor(
config: ReceiptExtractorConfig,
): ReceiptExtractor {
const { ollama, paperless } = config;
async function extractSingle(documentId: number): Promise<ReceiptData> {
// Fetch the document content from Paperless-ngx
const document = await paperless.getDocument(documentId);
const ocrText = document.content;
if (!ocrText || ocrText.trim().length === 0) {
throw new Error(
`Document ${documentId} has no OCR content. Ensure Paperless-ngx has processed the document.`,
);
}
// Send to Ollama for structured extraction
const prompt = `Extract receipt data from the following OCR text:\n\n---\n${ocrText}\n---`;
const completion = await ollama.complete(prompt, EXTRACTION_SYSTEM_PROMPT);
// Parse LLM response
// TODO: Add robust JSON extraction (handle markdown code blocks, partial JSON)
// TODO: Validate against Zod schema for type safety
let parsed: Record<string, unknown>;
try {
parsed = JSON.parse(completion.text);
} catch {
throw new Error(
`Failed to parse receipt extraction result for document ${documentId}. ` +
`LLM response was not valid JSON.`,
);
}
return {
documentId,
vendor: String(parsed.vendor ?? "Unknown"),
vendorAddress: parsed.vendorAddress ? String(parsed.vendorAddress) : null,
vendorTaxId: parsed.vendorTaxId ? String(parsed.vendorTaxId) : null,
date: String(parsed.date ?? new Date().toISOString().split("T")[0]),
currency: String(parsed.currency ?? "EUR"),
subtotal: typeof parsed.subtotal === "number" ? parsed.subtotal : null,
taxRate: typeof parsed.taxRate === "number" ? parsed.taxRate : null,
taxAmount: typeof parsed.taxAmount === "number" ? parsed.taxAmount : null,
totalAmount: typeof parsed.totalAmount === "number" ? parsed.totalAmount : 0,
paymentMethod: parsed.paymentMethod ? String(parsed.paymentMethod) : null,
lineItems: Array.isArray(parsed.lineItems)
? parsed.lineItems.map((item: Record<string, unknown>) => ({
description: String(item.description ?? ""),
quantity: Number(item.quantity ?? 1),
unitPrice: Number(item.unitPrice ?? 0),
totalPrice: Number(item.totalPrice ?? 0),
taxRate: typeof item.taxRate === "number" ? item.taxRate : null,
}))
: [],
category: parsed.category ? String(parsed.category) : null,
confidence: typeof parsed.confidence === "number" ? parsed.confidence : 0.5,
rawText: ocrText,
};
}
return {
extract: extractSingle,
async extractBatch(documentIds) {
// TODO: Add concurrency control (process N at a time)
// TODO: Add progress reporting callback
const results: ReceiptData[] = [];
for (const id of documentIds) {
const result = await extractSingle(id);
results.push(result);
}
return results;
},
};
}

231
src/receipt/matcher.ts Normal file
View File

@ -0,0 +1,231 @@
/**
* Bank CSV transaction matching for receipts.
*
* Matches extracted receipt data against bank CSV exports to reconcile
* transactions. Supports common German bank export formats (Sparkasse,
* Volksbank, ING, DKB).
*
* @example
* ```ts
* const matcher = createTransactionMatcher();
* const bankTxns = await matcher.parseBankCsv("./bank_export.csv");
* const matches = matcher.matchReceipts(receipts, bankTxns);
* ```
*/
import { parse } from "csv-parse/sync";
import { readFileSync } from "node:fs";
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
export interface BankTransaction {
readonly date: string;
readonly description: string;
readonly amount: number;
readonly currency: string;
readonly iban: string | null;
readonly bic: string | null;
readonly reference: string | null;
readonly rawLine: string;
}
export interface ReceiptMatchCandidate {
readonly documentId: number;
readonly vendor: string;
readonly date: string;
readonly totalAmount: number;
readonly currency: string;
}
export interface MatchResult {
readonly receipt: ReceiptMatchCandidate;
readonly transaction: BankTransaction;
readonly confidence: number;
readonly matchReasons: readonly string[];
}
export interface UnmatchedItem {
readonly type: "receipt" | "transaction";
readonly item: ReceiptMatchCandidate | BankTransaction;
}
export interface MatchSummary {
readonly matched: readonly MatchResult[];
readonly unmatchedReceipts: readonly ReceiptMatchCandidate[];
readonly unmatchedTransactions: readonly BankTransaction[];
readonly matchRate: number;
}
export interface TransactionMatcher {
/** Parse a bank CSV export file into structured transactions. */
parseBankCsv(filePath: string, format?: BankCsvFormat): readonly BankTransaction[];
/** Match receipts against bank transactions. */
matchReceipts(
receipts: readonly ReceiptMatchCandidate[],
transactions: readonly BankTransaction[],
): MatchSummary;
}
export type BankCsvFormat = "auto" | "sparkasse" | "ing" | "dkb" | "volksbank" | "generic";
// ---------------------------------------------------------------------------
// Implementation
// ---------------------------------------------------------------------------
/**
* Create a transaction matcher for bank CSV reconciliation.
*
* TODO: Add ML-based fuzzy matching for vendor names
* TODO: Add support for MT940/CAMT.053 bank statement formats
* TODO: Add date tolerance configuration (match within N days)
*/
export function createTransactionMatcher(): TransactionMatcher {
/**
* Parse bank CSV with auto-detected or specified format.
*/
function parseBankCsv(
filePath: string,
format: BankCsvFormat = "auto",
): readonly BankTransaction[] {
const raw = readFileSync(filePath, "utf-8");
// TODO: Implement format auto-detection based on header patterns
// TODO: Add support for different CSV delimiters (semicolon for German exports)
// TODO: Handle different date formats (DD.MM.YYYY, YYYY-MM-DD, MM/DD/YYYY)
const _format = format; // Acknowledge format parameter for future use
const records = parse(raw, {
columns: true,
skip_empty_lines: true,
delimiter: ";",
relaxColumnCount: true,
}) as Record<string, string>[];
return records.map((record): BankTransaction => {
// Generic column mapping -- override per format
// TODO: Implement format-specific column mappings
return {
date: record["Buchungstag"] ?? record["Date"] ?? record["Datum"] ?? "",
description:
record["Verwendungszweck"] ??
record["Description"] ??
record["Buchungstext"] ??
"",
amount: parseFloat(
(record["Betrag"] ?? record["Amount"] ?? "0")
.replace(/\./g, "")
.replace(",", "."),
),
currency: record["Waehrung"] ?? record["Currency"] ?? "EUR",
iban: record["IBAN"] ?? null,
bic: record["BIC"] ?? null,
reference: record["Kundenreferenz"] ?? record["Reference"] ?? null,
rawLine: JSON.stringify(record),
};
});
}
/**
* Match receipts against bank transactions by amount and date proximity.
*/
function matchReceipts(
receipts: readonly ReceiptMatchCandidate[],
transactions: readonly BankTransaction[],
): MatchSummary {
const matched: MatchResult[] = [];
const matchedReceiptIds = new Set<number>();
const matchedTxnIndices = new Set<number>();
// TODO: Implement smarter matching with vendor name fuzzy matching
// TODO: Add configurable date tolerance window
// TODO: Handle split transactions (one receipt, multiple bank entries)
for (const receipt of receipts) {
let bestMatch: { index: number; confidence: number; reasons: string[] } | null =
null;
for (let i = 0; i < transactions.length; i++) {
if (matchedTxnIndices.has(i)) continue;
const txn = transactions[i];
const reasons: string[] = [];
let confidence = 0;
// Amount matching (exact or close)
const amountDiff = Math.abs(Math.abs(txn.amount) - receipt.totalAmount);
if (amountDiff < 0.01) {
confidence += 0.5;
reasons.push("exact_amount_match");
} else if (amountDiff < 1.0) {
confidence += 0.3;
reasons.push("close_amount_match");
}
// Date matching
const receiptDate = new Date(receipt.date).getTime();
const txnDate = new Date(txn.date).getTime();
const daysDiff = Math.abs(receiptDate - txnDate) / (1000 * 60 * 60 * 24);
if (daysDiff < 1) {
confidence += 0.3;
reasons.push("same_day");
} else if (daysDiff < 3) {
confidence += 0.15;
reasons.push("within_3_days");
} else if (daysDiff < 7) {
confidence += 0.05;
reasons.push("within_7_days");
}
// Vendor name in description
if (
txn.description
.toLowerCase()
.includes(receipt.vendor.toLowerCase().slice(0, 8))
) {
confidence += 0.2;
reasons.push("vendor_in_description");
}
if (
confidence > 0.5 &&
(!bestMatch || confidence > bestMatch.confidence)
) {
bestMatch = { index: i, confidence, reasons };
}
}
if (bestMatch) {
matched.push({
receipt,
transaction: transactions[bestMatch.index],
confidence: bestMatch.confidence,
matchReasons: bestMatch.reasons,
});
matchedReceiptIds.add(receipt.documentId);
matchedTxnIndices.add(bestMatch.index);
}
}
const unmatchedReceipts = receipts.filter(
(r) => !matchedReceiptIds.has(r.documentId),
);
const unmatchedTransactions = transactions.filter(
(_, i) => !matchedTxnIndices.has(i),
);
return {
matched,
unmatchedReceipts,
unmatchedTransactions,
matchRate:
receipts.length > 0 ? matched.length / receipts.length : 0,
};
}
return { parseBankCsv, matchReceipts };
}

72
src/skill/SKILL.md Normal file
View File

@ -0,0 +1,72 @@
# PaperCortex -- Document Intelligence Skill
> A Claude Code skill for interacting with your Paperless-ngx document archive through AI-powered semantic search, classification, receipt extraction, and accounting export.
## Prerequisites
- PaperCortex MCP Server running (see project README)
- Paperless-ngx instance with API access
- Ollama with `qwen2.5:14b` and `nomic-embed-text` models
## Available Tools
### papercortex_search
Search documents by meaning, not just keywords.
```
Search for: "office lease agreements from last year"
Search for: "tax-relevant receipts over 500 EUR"
Search for: "correspondence with insurance companies"
```
### papercortex_classify
Auto-classify a document with AI-suggested tags, type, and correspondent.
```
Classify document #1234
Classify document #1234 and apply suggested tags
```
### papercortex_receipt
Extract structured data from receipt documents.
```
Extract receipt from document #5678
```
Returns: vendor, date, amounts, tax breakdown, line items, category.
### papercortex_query
Ask natural language questions about your document archive.
```
"How much did I spend on office supplies in Q1 2024?"
"Which invoices are still unpaid?"
"Summarize all contracts expiring this year"
```
### papercortex_export
Export receipt data for accounting software.
```
Export documents #100, #101, #102 as DATEV CSV
Export documents #200, #201 as generic CSV
```
## Workflow Examples
### Monthly Bookkeeping
1. Search for all receipts from the current month
2. Extract data from each receipt
3. Export as DATEV CSV
4. Import into accounting software
### Document Organization
1. Find unclassified documents (no tags)
2. Auto-classify each document
3. Review and approve suggested tags
### Expense Analysis
1. Query: "What were my top 5 expense categories last quarter?"
2. Drill into specific categories with follow-up queries
3. Export relevant receipts for documentation

24
tsconfig.json Normal file
View File

@ -0,0 +1,24 @@
{
"compilerOptions": {
"target": "ES2022",
"module": "ESNext",
"moduleResolution": "bundler",
"lib": ["ES2022"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"resolveJsonModule": true,
"declaration": true,
"declarationMap": true,
"sourceMap": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"noImplicitReturns": true,
"noFallthroughCasesInSwitch": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist", "**/*.test.ts"]
}