feat: initial release — AI document intelligence for Paperless-ngx

PaperCortex adds semantic search, auto-classification, receipt extraction, bank statement matching, and DATEV export to Paperless-ngx — powered entirely by local AI through Ollama. Exposes everything as an MCP Server for Claude Code and AI agent integration. - MCP Server with 5 tools (search, classify, receipt, query, export) - Local Ollama embeddings for semantic document search - Receipt data extraction (vendor, amount, date, tax, line items) - DATEV Buchungsstapel CSV export for German accounting - Bank CSV transaction matching - Paperless-ngx REST API client - Docker deployment - Zero cloud dependencies — 100% self-hosted
2026-03-26 06:28:48 +13:00 · 2026-03-26 06:28:48 +13:00 · 2052d87ba1
commit 2052d87ba1
25 changed files with 3322 additions and 0 deletions
--- a/.env.example
+++ b/.env.example
@ -0,0 +1,20 @@
 # PaperCortex Configuration
 # Copy this file to .env and fill in your values
 # Paperless-ngx connection
 PAPERLESS_URL=http://localhost:8000
 PAPERLESS_TOKEN=your-paperless-api-token-here
 # Ollama connection
 OLLAMA_URL=http://localhost:11434
 OLLAMA_MODEL=qwen2.5:14b
 OLLAMA_EMBEDDING_MODEL=nomic-embed-text
 # Vector store
 VECTOR_DB_PATH=./data/vectors.db
 # MCP Server
 MCP_SERVER_PORT=3100
 # Logging
 LOG_LEVEL=info
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,35 @@
 # Dependencies
 node_modules/
 # Build output
 dist/
 # Environment files
 .env
 .env.local
 .env.*.local
 # Data directory (vectors, cache)
 data/
 # OS files
 .DS_Store
 Thumbs.db
 # IDE
 .vscode/
 .idea/
 *.swp
 *.swo
 # Logs
 logs/
 *.log
 npm-debug.log*
 # Test coverage
 coverage/
 # Temporary files
 tmp/
 temp/
--- a/34
+++ b/34
@ -0,0 +1,34 @@
 FROM node:22-alpine AS builder
 WORKDIR /app
 COPY package.json package-lock.json* ./
 RUN npm ci
 COPY tsconfig.json ./
 COPY src/ ./src/
 RUN npm run build
 # --- Production image ---
 FROM node:22-alpine
 WORKDIR /app
 RUN addgroup -g 1001 -S papercortex && \
    adduser -S papercortex -u 1001
 COPY package.json package-lock.json* ./
 RUN npm ci --omit=dev && npm cache clean --force
 COPY --from=builder /app/dist ./dist
 RUN mkdir -p /app/data && chown papercortex:papercortex /app/data
 USER papercortex
 ENV NODE_ENV=production
 ENV VECTOR_DB_PATH=/app/data/vectors.db
 EXPOSE 3100
 CMD ["node", "dist/mcp-server/index.js"]
--- a/21
+++ b/21
@ -0,0 +1,21 @@
 MIT License
 Copyright (c) 2026 PaperCortex Contributors
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,737 @@
 <p align="center">
  <img src="docs/assets/papercortex-logo.svg" alt="PaperCortex Logo" width="120" />
  <h1 align="center">PaperCortex</h1>
  <p align="center">
    <strong>AI-Powered Document Intelligence for Paperless-ngx</strong><br/>
    <em>Semantic search, auto-classification, receipt extraction, and accounting export — 100% local, 100% private.</em>
  </p>
  <p align="center">
    <a href="#-quick-start"><img src="https://img.shields.io/badge/Docker-one--command-2496ED?logo=docker&logoColor=white" alt="Docker"></a>
    <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-22c55e.svg" alt="MIT License"></a>
    <img src="https://img.shields.io/badge/TypeScript-5.x-3178C6?logo=typescript&logoColor=white" alt="TypeScript">
    <img src="https://img.shields.io/badge/Ollama-Local_AI-7C3AED?logo=ollama&logoColor=white" alt="Ollama">
    <img src="https://img.shields.io/badge/MCP-Server-F97316" alt="MCP Server">
    <img src="https://img.shields.io/badge/Paperless--ngx-Compatible-EF4444?logo=data:image/svg+xml;base64,..." alt="Paperless-ngx">
    <img src="https://img.shields.io/badge/DATEV-Export-EAB308" alt="DATEV Export">
    <img src="https://img.shields.io/badge/Privacy-First-10B981" alt="Privacy First">
  </p>
  <p align="center">
    <a href="#-quick-start">Quick Start</a> · <a href="#-features">Features</a> · <a href="#-mcp-server-tools">MCP Tools</a> · <a href="#-receipt-intelligence">Receipts</a> · <a href="#-documentation">Docs</a>
  </p>
 </p>
 ---
 ## What is PaperCortex?
 **PaperCortex** turns your [Paperless-ngx](https://github.com/paperless-ngx/paperless-ngx) document archive into an intelligent, queryable knowledge base — powered entirely by local AI running on your own hardware.
 If you use Paperless-ngx to store invoices, receipts, contracts, tax documents, letters, or any other scanned paperwork, PaperCortex adds the intelligence layer that Paperless-ngx is missing:
 - **Ask questions in plain English** — "Show me all invoices from Amazon over 100 EUR in 2025"
 - **Find documents by meaning**, not just keywords — searching for "office rent" finds "Bueromiete" and "monthly lease payment"
 - **Auto-tag and classify** every new document the moment it arrives
 - **Extract structured data from receipts** — vendor, date, amount, tax rate, line items
 - **Match receipts to bank transactions** automatically
 - **Export to DATEV** for your German tax advisor — or plain CSV for any accounting software
 Everything runs locally through [Ollama](https://ollama.com). No document content ever leaves your network. No cloud APIs. No subscriptions. No data harvesting.
 PaperCortex exposes all capabilities as an **[MCP (Model Context Protocol)](https://modelcontextprotocol.io) Server**, making it a first-class tool for [Claude Code](https://docs.anthropic.com/en/docs/claude-code), AI coding agents, and automated workflows.
 ---
 ## The Problem
 Paperless-ngx is an outstanding document management system with 37,000+ GitHub stars. It handles scanning, OCR, storage, and basic tagging beautifully. But once your documents are in Paperless-ngx, finding and working with them has real limitations:
 | What you want to do | Paperless-ngx alone | With PaperCortex |
 |---|---|---|
 | Find a document by what it's about | Keyword search only — misses synonyms, translations, related concepts | **Semantic search** understands meaning across languages |
 | Classify incoming documents | Manual rules or basic auto-matching | **LLM-powered classification** understands document content |
 | Extract data from a receipt | Read it yourself and type it in | **Automatic extraction** of vendor, amount, date, tax, line items |
 | Answer "How much did I spend on X?" | Export everything, open spreadsheet, filter manually | **Natural language query** returns the answer instantly |
 | Send receipt data to accounting | Manual data entry or copy-paste | **One-click DATEV/CSV export** ready for your tax advisor |
 | Use documents in AI workflows | No API integration for AI agents | **Full MCP Server** for Claude Code and any MCP-compatible agent |
 | Keep data private | Self-hosted (good!) | Self-hosted AI too — **zero cloud dependency** |
 ---
 ## Features
 ### Semantic Document Search
 Traditional keyword search fails when you don't remember the exact words. PaperCortex generates vector embeddings for every document using local Ollama models and stores them in a lightweight SQLite vector database.
 **Search by meaning, not by memory:**
 - Search for `"electricity bill"` → finds documents containing "Stromrechnung", "utility payment", "power invoice"
 - Search for `"office supplies"` → finds "Bueroausstattung", "paper and toner", "desk accessories order"
 - Search for `"tax deductible travel"` → finds flight bookings, hotel receipts, train tickets, taxi invoices
 **Supported embedding models:**
 - `nomic-embed-text` (recommended — fast, accurate, 768 dimensions)
 - `mxbai-embed-large` (higher accuracy, slower)
 - Any Ollama-compatible embedding model
 ### Automatic Document Classification
 Every new document arriving in Paperless-ngx gets analyzed by a local LLM that reads the OCR content and assigns:
 - **Document type** — Invoice, Receipt, Contract, Letter, Statement, Tax Document, Certificate
 - **Tags** — Contextual tags based on content (e.g., "office", "travel", "insurance", "subscription")
 - **Correspondent** — Identifies the sender/vendor from document content
 - **Date extraction** — Finds the document date (not just the scan date)
 - **Language detection** — Identifies the document language
 Classification runs asynchronously in the background. New documents are processed within minutes of arriving in Paperless-ngx.
 ### Receipt Intelligence
 PaperCortex includes a dedicated receipt processing pipeline optimized for expense management:
 **Data extraction from receipts and invoices:**
 - Vendor / merchant name and address
 - Date of purchase
 - Total amount (gross and net)
 - Tax rate and tax amount (supports multiple VAT rates)
 - Currency
 - Individual line items with quantities and prices
 - Payment method
 - Invoice/receipt number
 **Works with:**
 - Scanned paper receipts (via Paperless-ngx OCR)
 - Digital PDF invoices
 - Photographed receipts (mobile upload to Paperless-ngx)
 - Multi-page invoices
 - Receipts in German, English, French, Spanish, and other languages
 ### Bank Statement Matching
 Import your bank statement as CSV and let PaperCortex automatically match transactions to receipts:
 - **Fuzzy matching** on amount, date, and vendor name
 - **Confidence scoring** — high/medium/low match indicators
 - **Unmatched detection** — highlights receipts without matching transactions and vice versa
 - **Multi-currency support** — handles EUR, USD, GBP, CHF, and 20+ currencies
 ### DATEV Export
 For German businesses and freelancers, PaperCortex generates DATEV-compatible export files that your Steuerberater can import directly:
 - **DATEV CSV format** (Buchungsstapel) — the standard German accounting import format
 - **SKR03 / SKR04** account mapping
 - **Automatic account assignment** based on document classification
 - **Beleglink** — links each DATEV entry back to the original document in Paperless-ngx
 - **Period exports** — monthly, quarterly, or annual
 Also supports plain CSV export for use with any accounting software worldwide.
 ### Natural Language Queries
 Ask questions about your document archive in plain language:
 ```
 "How much did I spend on hotels in Q1 2025?"
 "Show me all contracts expiring this year"
 "What was my highest single expense last month?"
 "Find all invoices from Deutsche Telekom"
 "Which receipts don't have a matching bank transaction?"
 "Summarize my office supply spending trend over the last 12 months"
 ```
 PaperCortex translates natural language into document queries, retrieves relevant documents via semantic search, and uses the local LLM to synthesize answers with source references.
 ### MCP Server Integration
 PaperCortex implements the [Model Context Protocol (MCP)](https://modelcontextprotocol.io) — the open standard for connecting AI agents to external tools. This means any MCP-compatible AI agent can use your document archive as a knowledge source.
 **Compatible with:**
 - [Claude Code](https://docs.anthropic.com/en/docs/claude-code) (Anthropic)
 - [Claude Desktop](https://claude.ai)
 - Any MCP-compatible AI agent or IDE plugin
 - Custom AI workflows via the MCP SDK
 ---
 ## Feature Comparison
 | Feature | PaperCortex | paperless-ai | Veryfi | Taggun | Rossum |
 |---|:---:|:---:|:---:|:---:|:---:|
 | Fully self-hosted | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: |
 | Local AI (no cloud API) | :white_check_mark: | :x: OpenAI | :x: | :x: | :x: |
 | Semantic search | :white_check_mark: | :x: | :x: | :x: | :x: |
 | Auto-classification | :white_check_mark: | :white_check_mark: | :x: | :x: | :white_check_mark: |
 | Receipt data extraction | :white_check_mark: | :x: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
 | Bank statement matching | :white_check_mark: | :x: | :x: | :x: | :x: |
 | DATEV export | :white_check_mark: | :x: | :x: | :x: | :x: |
 | CSV accounting export | :white_check_mark: | :x: | :white_check_mark: | :x: | :white_check_mark: |
 | MCP Server | :white_check_mark: | :x: | :x: | :x: | :x: |
 | Natural language queries | :white_check_mark: | :x: | :x: | :x: | :x: |
 | Multi-language documents | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
 | Free and open source | :white_check_mark: | :white_check_mark: | :x: $$$  | :x: $$$ | :x: $$$$ |
 | Privacy — data stays local | :white_check_mark: | :warning: API calls | :x: | :x: | :x: |
 | Works with Paperless-ngx | :white_check_mark: | :white_check_mark: | :x: | :x: | :x: |
 ---
 ## Architecture
 ```
 ┌─────────────────────┐         ┌──────────────────────────┐         ┌────────────────────┐
 │                     │         │                          │         │                    │
 │  Claude Code /      │  MCP    │      PaperCortex         │  REST   │   Paperless-ngx    │
 │  AI Agents /        ├────────►│                          ├────────►│                    │
 │  Automation         │         │  ┌──────────────────┐    │   API   │  OCR + Storage +   │
 │                     │         │  │  MCP Server       │    │         │  Tagging           │
 └─────────────────────┘         │  │  (stdio / HTTP)   │    │         │                    │
                                │  └──────────────────┘    │         └────────────────────┘
                                │                          │
                                │  ┌──────────────────┐    │         ┌────────────────────┐
                                │  │  Intelligence     │    │         │                    │
                                │  │  Layer            │    │  LLM    │   Ollama           │
                                │  │                   ├────────────►│                    │
                                │  │  - Classifier     │    │  API    │  qwen2.5 / llama3  │
                                │  │  - Extractor      │    │         │  nomic-embed-text  │
                                │  │  - Query Engine   │    │         │                    │
                                │  └──────────────────┘    │         └────────────────────┘
                                │                          │
                                │  ┌──────────────────┐    │
                                │  │  Vector Store     │    │
                                │  │  (SQLite + HNSW)  │    │
                                │  └──────────────────┘    │
                                │                          │
                                └──────────────────────────┘
 ```
 ### How It Works
 1. **Documents arrive** in Paperless-ngx through scanning, email, or manual upload
 2. **PaperCortex polls** the Paperless-ngx API for new and updated documents
 3. **Embedding generation** — Ollama creates vector embeddings from OCR text
 4. **Classification** — the local LLM analyzes content and assigns types, tags, and metadata
 5. **Storage** — embeddings and extracted data are stored in a local SQLite vector database
 6. **Query interface** — the MCP Server exposes search, classify, extract, query, and export tools
 7. **AI agents connect** via MCP and interact with your documents using natural language
 All processing happens on your hardware. The only network traffic is between PaperCortex and your local Paperless-ngx and Ollama instances.
 ---
 ## Quick Start
 ### Prerequisites
 - **[Docker](https://docs.docker.com/get-docker/)** and Docker Compose
 - **[Paperless-ngx](https://github.com/paperless-ngx/paperless-ngx)** — running instance with API access
 - **[Ollama](https://ollama.com)** — running locally or on your network
 **Pull the required Ollama models:**
 ```bash
 ollama pull qwen2.5:14b          # LLM for classification, extraction, queries
 ollama pull nomic-embed-text      # Embedding model for semantic search
 ```
 ### Option 1: Docker Compose (Recommended)
 ```bash
 git clone https://github.com/renefichtmueller/PaperCortex.git
 cd PaperCortex
 cp .env.example .env
 ```
 Edit `.env` with your configuration:
 ```env
 PAPERLESS_URL=http://your-paperless-instance:8000
 PAPERLESS_TOKEN=your-paperless-api-token
 OLLAMA_URL=http://your-ollama-host:11434
 OLLAMA_MODEL=qwen2.5:14b
 OLLAMA_EMBEDDING_MODEL=nomic-embed-text
 ```
 Start PaperCortex:
 ```bash
 docker compose up -d
 ```
 PaperCortex will begin indexing your existing documents automatically.
 ### Option 2: Manual Installation
 ```bash
 git clone https://github.com/renefichtmueller/PaperCortex.git
 cd PaperCortex
 npm install
 cp .env.example .env
 # Edit .env with your settings
 npm run build
 npm start
 ```
 ### Option 3: npx (MCP Server only)
 ```bash
 npx papercortex --paperless-url http://localhost:8000 --paperless-token YOUR_TOKEN
 ```
 ---
 ## MCP Server Tools
 PaperCortex exposes five MCP tools that AI agents can call:
 ### `papercortex_search` — Semantic Document Search
 Find documents by meaning, not just keywords.
 ```json
 {
  "tool": "papercortex_search",
  "arguments": {
    "query": "electricity bills from last winter",
    "limit": 10,
    "date_from": "2024-12-01",
    "date_to": "2025-02-28"
  }
 }
 ```
 **Returns:** Ranked list of documents with relevance scores, titles, dates, and Paperless-ngx document IDs.
 ### `papercortex_classify` — Auto-Classification
 Analyze a document and assign type, tags, and metadata.
 ```json
 {
  "tool": "papercortex_classify",
  "arguments": {
    "document_id": 1234,
    "apply": true
  }
 }
 ```
 **Returns:** Suggested document type, tags, correspondent, and confidence scores. Set `apply: true` to write classifications back to Paperless-ngx.
 ### `papercortex_receipt` — Receipt Data Extraction
 Extract structured financial data from receipts and invoices.
 ```json
 {
  "tool": "papercortex_receipt",
  "arguments": {
    "document_id": 5678
  }
 }
 ```
 **Returns:**
 ```json
 {
  "vendor": "Amazon EU S.a.r.l.",
  "date": "2025-03-15",
  "total_gross": 119.99,
  "total_net": 100.83,
  "tax_rate": 19,
  "tax_amount": 19.16,
  "currency": "EUR",
  "items": [
    { "description": "USB-C Hub", "quantity": 1, "price": 49.99 },
    { "description": "Monitor Arm", "quantity": 1, "price": 70.00 }
  ],
  "invoice_number": "INV-DE-2025-1234567"
 }
 ```
 ### `papercortex_query` — Natural Language Questions
 Ask questions about your entire document archive.
 ```json
 {
  "tool": "papercortex_query",
  "arguments": {
    "question": "How much did I spend on business travel in Q1 2025?"
  }
 }
 ```
 **Returns:** A natural language answer with source document references and a breakdown of the calculation.
 ### `papercortex_export` — Accounting Export
 Export extracted receipt data in accounting-ready formats.
 ```json
 {
  "tool": "papercortex_export",
  "arguments": {
    "format": "datev",
    "date_from": "2025-01-01",
    "date_to": "2025-03-31",
    "account_plan": "SKR03"
  }
 }
 ```
 **Supported formats:** `datev` (German standard), `csv` (universal), `json` (programmatic).
 ---
 ## Claude Code Integration
 ### Register as MCP Server
 Add to your `~/.claude.json` or project `.claude/settings.json`:
 ```json
 {
  "mcpServers": {
    "papercortex": {
      "command": "node",
      "args": ["./dist/mcp-server/index.js"],
      "cwd": "/path/to/PaperCortex",
      "env": {
        "PAPERLESS_URL": "http://localhost:8000",
        "PAPERLESS_TOKEN": "your-token",
        "OLLAMA_URL": "http://localhost:11434"
      }
    }
  }
 }
 ```
 ### Example Conversations
 Once connected, you can ask Claude Code about your documents naturally:
 ```
 You: Search my documents for anything related to the office lease renewal
 Claude: I found 4 relevant documents:
  1. "Mietvertrag Verlängerung 2025" (Score: 0.94) — Document #1234
  2. "Office Lease Agreement Amendment" (Score: 0.91) — Document #1235
  3. "Nebenkostenabrechnung 2024" (Score: 0.78) — Document #1240
  4. "Facilities Management Invoice" (Score: 0.72) — Document #1251
 ```
 ```
 You: Extract the receipt data from document #5678 and export it for DATEV
 Claude: Extracted receipt data:
  Vendor: Deutsche Bahn AG
  Date: 2025-03-20
  Amount: 89.90 EUR (net: 75.55 EUR, 19% VAT: 14.35 EUR)
  Description: ICE Frankfurt-Berlin, 1st class
  DATEV export saved to: exports/datev_2025_03.csv
 ```
 ```
 You: How much did I spend on cloud services this year?
 Claude: Based on 23 matching documents, your cloud service spending in 2025:
  - AWS: 2,340.00 EUR (12 invoices)
  - Hetzner: 456.00 EUR (3 invoices)
  - Cloudflare: 240.00 EUR (3 invoices)
  - Vercel: 180.00 EUR (3 invoices)
  - GitHub: 132.00 EUR (2 invoices)
  Total: 3,348.00 EUR
 ```
 ---
 ## Receipt Workflow
 ### End-to-End Receipt Processing
 ```
 ┌──────────┐    ┌─────────────┐    ┌──────────────┐    ┌──────────┐    ┌──────────┐
 │  Scan /  │    │ Paperless-  │    │ PaperCortex  │    │  Match   │    │  Export  │
 │  Photo / ├───►│ ngx         ├───►│ Receipt      ├───►│  Bank    ├───►│  DATEV / │
 │  Email   │    │ OCR+Store   │    │ Extraction   │    │  CSV     │    │  CSV     │
 └──────────┘    └─────────────┘    └──────────────┘    └──────────┘    └──────────┘
 ```
 ### CLI Commands
 ```bash
 # Process all unprocessed receipts
 npm run receipt:process
 # Extract data from a specific document
 npm run receipt:extract -- --document-id 1234
 # Import bank statement and match transactions
 npm run receipt:match -- --bank-csv ./bank_export_2025_q1.csv
 # Export matched data as DATEV
 npm run receipt:export -- --format datev --period 2025-Q1
 # Export as plain CSV
 npm run receipt:export -- --format csv --period 2025-03
 ```
 ### DATEV Integration Details
 The DATEV export generates a `Buchungsstapel` CSV file following the official DATEV format specification:
 - **Header row** with advisor number, client number, fiscal year start, and export period
 - **Transaction rows** with amount, debit/credit account, tax code, date, and booking text
 - **Beleglink** — each row includes a reference to the source document in Paperless-ngx
 - **Account mapping** — automatic assignment based on vendor and document type (configurable)
 - **SKR03 and SKR04** chart of accounts supported
 ---
 ## Privacy and Security
 ### Why Local AI Matters
 Your documents contain some of the most sensitive data in your life:
 - **Tax returns** with income, deductions, and financial details
 - **Contracts** with confidential terms and personal information
 - **Medical bills** with health information
 - **Bank statements** with account numbers and transaction history
 - **Personal correspondence** with private content
 Cloud-based document AI services require uploading this data to external servers for processing. Even with encryption and privacy policies, you are trusting a third party with your most sensitive information.
 **PaperCortex takes a fundamentally different approach:**
 - All AI processing runs on **your hardware** via Ollama
 - Document content is sent only to **your local Ollama instance**
 - Embeddings and extracted data are stored in a **local SQLite database**
 - The only network traffic is between PaperCortex, your Paperless-ngx instance, and your Ollama server
 - **No telemetry, no analytics, no external API calls**
 **Your documents stay in your network. Period.**
 ### Security Best Practices
 - Store the Paperless-ngx API token in environment variables, never in source code
 - Run PaperCortex on the same network as Paperless-ngx and Ollama
 - Use Docker networks to isolate services
 - Regularly update Ollama and PaperCortex for security patches
 ---
 ## Configuration Reference
 All configuration is done through environment variables. See `.env.example` for a complete template.
 ### Core Settings
 | Variable | Default | Description |
 |---|---|---|
 | `PAPERLESS_URL` | `http://localhost:8000` | Paperless-ngx instance URL |
 | `PAPERLESS_TOKEN` | *(required)* | Paperless-ngx API authentication token |
 | `OLLAMA_URL` | `http://localhost:11434` | Ollama API endpoint |
 | `OLLAMA_MODEL` | `qwen2.5:14b` | LLM model for classification and extraction |
 | `OLLAMA_EMBEDDING_MODEL` | `nomic-embed-text` | Embedding model for semantic search |
 | `VECTOR_DB_PATH` | `./data/vectors.db` | Path to the SQLite vector database |
 ### Processing Settings
 | Variable | Default | Description |
 |---|---|---|
 | `POLL_INTERVAL` | `300` | Seconds between polling Paperless-ngx for new documents |
 | `BATCH_SIZE` | `10` | Number of documents to process per batch |
 | `EMBEDDING_DIMENSIONS` | `768` | Vector dimensions (must match embedding model) |
 | `CLASSIFICATION_CONFIDENCE` | `0.7` | Minimum confidence to auto-apply classifications |
 ### Export Settings
 | Variable | Default | Description |
 |---|---|---|
 | `DATEV_ADVISOR_NUMBER` | *(optional)* | Steuerberater number for DATEV export header |
 | `DATEV_CLIENT_NUMBER` | *(optional)* | Mandantennummer for DATEV export header |
 | `DATEV_FISCAL_YEAR_START` | `01-01` | Fiscal year start (MM-DD) |
 | `DEFAULT_ACCOUNT_PLAN` | `SKR03` | Default chart of accounts (`SKR03` or `SKR04`) |
 | `EXPORT_DIR` | `./exports` | Directory for generated export files |
 ### MCP Server Settings
 | Variable | Default | Description |
 |---|---|---|
 | `MCP_TRANSPORT` | `stdio` | MCP transport mode (`stdio` or `http`) |
 | `MCP_PORT` | `3100` | Port for HTTP transport mode |
 | `MCP_AUTH_TOKEN` | *(optional)* | Bearer token for HTTP transport authentication |
 ---
 ## Supported Models
 PaperCortex works with any Ollama-compatible model. Recommended configurations:
 ### For Classification and Extraction
 | Model | VRAM | Speed | Quality | Recommended For |
 |---|---|---|---|---|
 | `qwen2.5:7b` | 5 GB | Fast | Good | Raspberry Pi, low-end servers |
 | `qwen2.5:14b` | 10 GB | Medium | Very Good | Most homelab setups |
 | `qwen2.5:32b` | 20 GB | Slow | Excellent | High-accuracy requirements |
 | `llama3.1:8b` | 5 GB | Fast | Good | Alternative to Qwen |
 | `mistral:7b` | 5 GB | Fast | Good | European language focus |
 ### For Embeddings
 | Model | Dimensions | Speed | Quality |
 |---|---|---|---|
 | `nomic-embed-text` | 768 | Very Fast | Very Good |
 | `mxbai-embed-large` | 1024 | Fast | Excellent |
 | `all-minilm` | 384 | Fastest | Good |
 ---
 ## Project Structure
 ```
 PaperCortex/
 ├── src/
 │   ├── mcp-server/              # MCP Server for AI agent integration
 │   │   ├── index.ts             # Server entry point and tool registration
 │   │   └── tools/
 │   │       ├── search.ts        # Semantic document search tool
 │   │       ├── classify.ts      # Auto-classification tool
 │   │       ├── receipt.ts       # Receipt data extraction tool
 │   │       ├── query.ts         # Natural language query tool
 │   │       └── export.ts        # DATEV/CSV export tool
 │   ├── embeddings/
 │   │   ├── ollama.ts            # Ollama embedding API client
 │   │   └── store.ts             # SQLite vector store with HNSW index
 │   ├── paperless/
 │   │   ├── client.ts            # Paperless-ngx REST API client
 │   │   └── types.ts             # TypeScript type definitions
 │   └── receipt/
 │       ├── extractor.ts         # Receipt OCR content parsing and extraction
 │       ├── matcher.ts           # Bank CSV transaction matching engine
 │       └── datev.ts             # DATEV Buchungsstapel CSV formatter
 ├── docs/
 │   ├── architecture.md          # Detailed architecture documentation
 │   ├── setup.md                 # Step-by-step installation guide
 │   └── receipts.md              # Receipt workflow documentation
 ├── docker-compose.yml           # Production deployment
 ├── Dockerfile                   # Container build
 ├── .env.example                 # Configuration template (no secrets!)
 ├── package.json
 ├── tsconfig.json
 └── LICENSE                      # MIT
 ```
 ---
 ## Roadmap
 - [x] Core MCP Server with 5 tools
 - [x] Paperless-ngx API client
 - [x] Ollama embedding generation
 - [x] SQLite vector store
 - [x] Receipt data extraction
 - [x] DATEV export
 - [x] Docker deployment
 - [ ] Bank CSV matching engine
 - [ ] Web dashboard UI
 - [ ] Webhook support (instant processing on document arrival)
 - [ ] Multi-user support with separate vector stores
 - [ ] Additional export formats (SKR04 mapping, FiBu, CSV+)
 - [ ] Ollama vision model support for direct image analysis
 - [ ] Automated document workflow triggers
 - [ ] Plugin system for custom extractors
 - [ ] Prometheus metrics endpoint
 ---
 ## Contributing
 Contributions are welcome! PaperCortex is early-stage and there are many ways to help:
 ### Getting Started
 ```bash
 git clone https://github.com/renefichtmueller/PaperCortex.git
 cd PaperCortex
 npm install
 cp .env.example .env
 # Edit .env with your local Paperless-ngx and Ollama settings
 npm run dev
 ```
 ### How to Contribute
 1. **Fork** the repository
 2. **Create** a feature branch (`git checkout -b feat/amazing-feature`)
 3. **Write tests** for your changes
 4. **Commit** using conventional commits (`feat:`, `fix:`, `docs:`, `refactor:`)
 5. **Push** and open a Pull Request
 ### Areas Where Help is Needed
 | Area | Description | Difficulty |
 |---|---|---|
 | **Bank CSV Parsers** | Add parsers for different bank export formats (Sparkasse, ING, N26, Revolut, etc.) | Easy |
 | **Export Formats** | Additional accounting export formats beyond DATEV | Medium |
 | **Web Dashboard** | Build a simple web UI for browsing indexed documents and extracted data | Medium |
 | **Multi-language** | Improve extraction accuracy for non-English/German receipts | Medium |
 | **Vision Models** | Use Ollama vision models to extract data directly from receipt images | Hard |
 | **Webhooks** | React to Paperless-ngx document events in real-time | Medium |
 ---
 ## Frequently Asked Questions
 **Q: Does PaperCortex modify my documents in Paperless-ngx?**
 A: By default, PaperCortex only reads documents. When you use the `classify` tool with `apply: true`, it can write tags, document types, and correspondents back to Paperless-ngx. Extraction results and embeddings are stored in PaperCortex's own database.
 **Q: How much disk space does the vector database need?**
 A: Roughly 1-2 KB per document for embeddings. A collection of 10,000 documents needs about 10-20 MB of vector storage.
 **Q: Can I use OpenAI instead of Ollama?**
 A: PaperCortex is designed for local-first operation with Ollama. Support for OpenAI-compatible APIs (including local alternatives like LM Studio, vLLM, or LocalAI) is on the roadmap.
 **Q: What Paperless-ngx version is required?**
 A: PaperCortex works with Paperless-ngx 2.0 and later (REST API v3+).
 **Q: Can I run PaperCortex on a Raspberry Pi?**
 A: PaperCortex itself is lightweight. The bottleneck is Ollama — you'll need a model that fits in your available RAM. `qwen2.5:7b` works on 8GB devices.
 **Q: Is DATEV export only for Germany?**
 A: The DATEV format is the German standard, but PaperCortex also exports plain CSV that works with any accounting software worldwide.
 ---
 ## License
 MIT License — see [LICENSE](LICENSE) for details.
 Free to use, modify, and distribute. Commercial use welcome.
 ---
 ## Acknowledgments
 Built on the shoulders of giants:
 - **[Paperless-ngx](https://github.com/paperless-ngx/paperless-ngx)** — The incredible open-source document management system (37k+ stars)
 - **[Ollama](https://ollama.com)** — Making local AI accessible to everyone
 - **[Model Context Protocol](https://modelcontextprotocol.io)** — The open standard for AI tool integration by Anthropic
 - **[better-sqlite3](https://github.com/WiseLibs/better-sqlite3)** — Fast, reliable SQLite bindings for Node.js
 ---
 ## Star History
 If PaperCortex is useful to you, please consider giving it a star — it helps others discover the project!
 ---
 <p align="center">
  <strong>Your documents. Your AI. Your hardware.</strong><br/>
  <em>No cloud required.</em>
 </p>
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -0,0 +1,36 @@
 services:
  papercortex:
    build: .
    container_name: papercortex
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - papercortex-data:/app/data
    env_file:
      - .env
    environment:
      - NODE_ENV=production
    depends_on:
      - ollama
  ollama:
    image: ollama/ollama:latest
    container_name: papercortex-ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama
    # Uncomment for NVIDIA GPU support:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]
 volumes:
  papercortex-data:
  ollama-models:
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -0,0 +1,64 @@
 # Architecture
 ## Overview
 PaperCortex is structured as three layers:
 1. **MCP Server Layer** -- Exposes tools via the Model Context Protocol for AI agent integration.
 2. **Intelligence Layer** -- Embedding generation, classification, receipt extraction, and query answering.
 3. **Data Layer** -- Paperless-ngx API client and local SQLite vector store.
 ## Components
 ### MCP Server (`src/mcp-server/`)
 The entry point for all AI agent interactions. Implements the MCP standard using `@modelcontextprotocol/sdk` and communicates via stdio transport.
 Each tool is implemented as a separate handler module under `src/mcp-server/tools/`.
 ### Embeddings (`src/embeddings/`)
 - **ollama.ts** -- Client for the Ollama API. Handles embedding generation and LLM completions.
 - **store.ts** -- SQLite-backed vector store using `better-sqlite3`. Stores document embeddings and supports cosine similarity search.
 Current implementation uses brute-force search, which is performant up to ~100k documents. For larger archives, consider migrating to `sqlite-vss` or a dedicated vector database.
 ### Paperless Integration (`src/paperless/`)
 - **client.ts** -- REST API client for Paperless-ngx. Supports document CRUD, search, tags, correspondents, and document types.
 - **types.ts** -- TypeScript type definitions matching the Paperless-ngx API v3+ schema.
 ### Receipt Processing (`src/receipt/`)
 - **extractor.ts** -- Uses LLM to extract structured data from receipt OCR text.
 - **matcher.ts** -- Matches extracted receipts against bank CSV transaction exports.
 - **datev.ts** -- Generates DATEV Buchungsstapel format CSV for German accounting software.
 ## Data Flow
 ```
 Paperless-ngx  --(REST API)-->  PaperCortex  --(Ollama API)-->  Ollama
                                     |
                                     v
                              SQLite Vector DB
                                     |
                                     v
                              MCP Server (stdio)
                                     |
                                     v
                              Claude Code / AI Agents
 ```
 ## Security Model
 - All data stays local -- no external API calls except to Paperless-ngx and Ollama (both self-hosted).
 - API tokens are read from environment variables, never hardcoded.
 - The SQLite database is stored on the local filesystem with configurable path.
 - MCP Server communicates via stdio (no network port required for MCP).
 ## Future Considerations
 - **Webhook support** -- Listen for Paperless-ngx webhooks to auto-process new documents.
 - **Plugin system** -- Allow custom extractors and exporters.
 - **Web dashboard** -- Optional UI for monitoring and manual review.
 - **Multi-user** -- Support multiple Paperless-ngx instances and user isolation.
--- a/docs/receipts.md
+++ b/docs/receipts.md
@ -0,0 +1,101 @@
 # Receipt Workflow
 ## Overview
 PaperCortex provides a complete receipt-to-accounting pipeline:
 1. **Scan** -- Upload receipts to Paperless-ngx (scan, email, photo)
 2. **Extract** -- AI extracts structured data (vendor, date, amounts, line items)
 3. **Match** -- Reconcile against bank CSV exports
 4. **Export** -- Generate DATEV-compatible CSV for accounting software
 ## Receipt Extraction
 ### Via MCP Server (Claude Code)
 ```
 Extract receipt data from document #1234
 ```
 ### Via CLI
 ```bash
 npm run receipt:extract -- --document-id 1234
 ```
 ### Extracted Fields
 | Field | Description | Example |
 |---|---|---|
 | vendor | Company name | "IKEA Deutschland GmbH" |
 | vendorAddress | Full address | "Am Wanderweg 1, 65719 Hofheim" |
 | vendorTaxId | Tax ID / VAT number | "DE 129 341 800" |
 | date | Receipt date | "2024-03-15" |
 | currency | ISO 4217 code | "EUR" |
 | subtotal | Before tax | 84.03 |
 | taxRate | Tax percentage | 19 |
 | taxAmount | Tax amount | 15.97 |
 | totalAmount | Total with tax | 100.00 |
 | paymentMethod | How it was paid | "card" |
 | lineItems | Individual items | Array of items |
 | category | Expense category | "office_supplies" |
 ## Bank Statement Matching
 Match receipts against bank CSV exports to verify which receipts correspond to which bank transactions.
 ### Supported Bank Formats
 - Sparkasse (semicolon-separated, German format)
 - ING (semicolon-separated)
 - DKB (semicolon-separated)
 - Volksbank (semicolon-separated)
 - Generic CSV
 ### Matching Algorithm
 1. **Amount match** -- Exact or close amount (within 1.00 tolerance)
 2. **Date proximity** -- Same day, within 3 days, or within 7 days
 3. **Vendor name** -- Partial match in transaction description
 Results include a confidence score (0.0 - 1.0) and match reasons.
 ## DATEV Export
 ### Format
 PaperCortex generates DATEV Buchungsstapel (posting batch) format CSV, compatible with:
 - DATEV Unternehmen Online
 - lexoffice
 - sevDesk
 - FastBill
 - Any DATEV-import-capable software
 ### Account Mapping (SKR03)
 | Category | Account | Description |
 |---|---|---|
 | office_supplies | 4930 | Buerokosten |
 | travel | 4660 | Reisekosten |
 | food | 4650 | Bewirtungskosten |
 | telephone | 4920 | Telefon |
 | postage | 4910 | Porto |
 | rent | 4210 | Miete |
 | advertising | 4600 | Werbekosten |
 | software | 4964 | Software |
 | consulting | 4950 | Rechts- und Beratungskosten |
 | default | 4900 | Sonstige Aufwendungen |
 ### Export via CLI
 ```bash
 # Export all receipts from March 2024 as DATEV CSV
 npm run receipt:export -- --format datev --year 2024 --month 03
 ```
 ### Export via MCP Server
 ```
 Export documents #100, #101, #102 as DATEV CSV
 ```
--- a/docs/setup.md
+++ b/docs/setup.md
@ -0,0 +1,107 @@
 # Setup Guide
 ## Prerequisites
 - **Node.js** 20+ (or Docker)
 - **Paperless-ngx** instance with API access
 - **Ollama** with required models
 ## Step 1: Install Ollama Models
 ```bash
 # Required: LLM for classification and extraction
 ollama pull qwen2.5:14b
 # Required: Embedding model for semantic search
 ollama pull nomic-embed-text
 ```
 Verify Ollama is running:
 ```bash
 curl http://localhost:11434/api/tags
 ```
 ## Step 2: Get Paperless-ngx API Token
 1. Open your Paperless-ngx web UI
 2. Go to Settings > API
 3. Generate a new API token
 4. Copy the token for the next step
 ## Step 3: Configure PaperCortex
 ```bash
 git clone https://github.com/YOUR_USERNAME/PaperCortex.git
 cd PaperCortex
 cp .env.example .env
 ```
 Edit `.env` with your values:
 ```env
 PAPERLESS_URL=http://localhost:8000
 PAPERLESS_TOKEN=<your-api-token>
 OLLAMA_URL=http://localhost:11434
 ```
 ## Step 4: Run
 ### Option A: Docker (Recommended)
 ```bash
 docker compose up -d
 ```
 ### Option B: Manual
 ```bash
 npm install
 npm run build
 npm start
 ```
 ### Option C: Development
 ```bash
 npm install
 npm run dev
 ```
 ## Step 5: Register MCP Server
 Add to your Claude Code configuration (`~/.claude.json`):
 ```json
 {
  "mcpServers": {
    "papercortex": {
      "command": "node",
      "args": ["/absolute/path/to/PaperCortex/dist/mcp-server/index.js"],
      "env": {
        "PAPERLESS_URL": "http://localhost:8000",
        "PAPERLESS_TOKEN": "your-token",
        "OLLAMA_URL": "http://localhost:11434"
      }
    }
  }
 }
 ```
 ## Step 6: Populate Vector Store
 On first run, you need to embed your existing documents. This will be automated in a future release. For now, the vector store is populated as documents are queried or classified.
 ## Troubleshooting
 ### "Connection refused" to Paperless-ngx
 - Verify the URL in `.env` is reachable
 - Check that the API token is valid
 - Ensure Paperless-ngx is running
 ### "Connection refused" to Ollama
 - Run `ollama serve` if not already running
 - Check the port (default: 11434)
 - Verify models are pulled: `ollama list`
 ### Slow first query
 - The first embedding generation may take longer as Ollama loads the model into memory
 - Subsequent queries will be faster once the model is loaded
--- a/package.json
+++ b/package.json
@ -0,0 +1,57 @@
 {
  "name": "papercortex",
  "version": "0.1.0",
  "description": "Self-hosted AI intelligence layer for Paperless-ngx with semantic search, receipt extraction, and MCP Server integration",
  "main": "dist/mcp-server/index.js",
  "type": "module",
  "scripts": {
    "build": "tsc",
    "start": "node dist/mcp-server/index.js",
    "dev": "tsx watch src/mcp-server/index.ts",
    "lint": "eslint src/",
    "test": "vitest",
    "test:coverage": "vitest --coverage",
    "receipt:extract": "tsx src/receipt/extractor.ts",
    "receipt:match": "tsx src/receipt/matcher.ts",
    "receipt:export": "tsx src/receipt/datev.ts"
  },
  "keywords": [
    "paperless-ngx",
    "ollama",
    "mcp",
    "mcp-server",
    "semantic-search",
    "document-ai",
    "receipt-extraction",
    "datev",
    "self-hosted",
    "local-ai",
    "embeddings",
    "vector-search"
  ],
  "author": "",
  "license": "MIT",
  "repository": {
    "type": "git",
    "url": ""
  },
  "engines": {
    "node": ">=20.0.0"
  },
  "dependencies": {
    "@modelcontextprotocol/sdk": "^1.12.0",
    "better-sqlite3": "^11.8.0",
    "csv-parse": "^5.6.0",
    "csv-stringify": "^6.5.0",
    "dotenv": "^16.4.0",
    "zod": "^3.24.0"
  },
  "devDependencies": {
    "@types/better-sqlite3": "^7.6.12",
    "@types/node": "^22.10.0",
    "eslint": "^9.17.0",
    "tsx": "^4.19.0",
    "typescript": "^5.7.0",
    "vitest": "^3.0.0"
  }
 }
--- a/src/embeddings/ollama.ts
+++ b/src/embeddings/ollama.ts
@ -0,0 +1,148 @@
 /**
 * Ollama embedding and LLM integration.
 *
 * Generates vector embeddings and LLM completions using a local Ollama instance.
 * All functions are pure and return new objects -- no mutation.
 *
 * @example
 * ```ts
 * const ollama = createOllamaClient({ baseUrl: "http://localhost:11434" });
 * const embedding = await ollama.embed("Office rent invoice March 2024");
 * const answer = await ollama.complete("Classify this document: ...");
 * ```
 */
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 export interface OllamaConfig {
  readonly baseUrl: string;
  readonly model: string;
  readonly embeddingModel: string;
  readonly timeout?: number;
 }
 export interface EmbeddingResult {
  readonly vector: readonly number[];
  readonly model: string;
  readonly dimensions: number;
 }
 export interface CompletionResult {
  readonly text: string;
  readonly model: string;
  readonly totalDuration: number;
 }
 export interface OllamaClient {
  /** Generate an embedding vector for the given text. */
  embed(text: string): Promise<EmbeddingResult>;
  /** Generate a chat/instruct completion. */
  complete(prompt: string, systemPrompt?: string): Promise<CompletionResult>;
  /** Check if the Ollama server is reachable and models are available. */
  healthCheck(): Promise<{ ok: boolean; models: readonly string[] }>;
 }
 // ---------------------------------------------------------------------------
 // Implementation
 // ---------------------------------------------------------------------------
 /**
 * Create an Ollama client for embeddings and completions.
 */
 export function createOllamaClient(config: OllamaConfig): OllamaClient {
  const { baseUrl, model, embeddingModel, timeout = 120_000 } = config;
  async function post<T>(path: string, body: unknown): Promise<T> {
    const url = `${baseUrl.replace(/\/+$/, "")}${path}`;
    const controller = new AbortController();
    const timer = setTimeout(() => controller.abort(), timeout);
    try {
      const response = await fetch(url, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify(body),
        signal: controller.signal,
      });
      if (!response.ok) {
        const text = await response.text().catch(() => "");
        throw new Error(`Ollama API error: ${response.status} -- ${text}`);
      }
      return (await response.json()) as T;
    } finally {
      clearTimeout(timer);
    }
  }
  return {
    async embed(text) {
      // TODO: implement chunking for texts exceeding model context window
      // TODO: add retry logic with exponential backoff
      interface OllamaEmbedResponse {
        embedding: number[];
      }
      const result = await post<OllamaEmbedResponse>("/api/embeddings", {
        model: embeddingModel,
        prompt: text,
      });
      return {
        vector: result.embedding,
        model: embeddingModel,
        dimensions: result.embedding.length,
      };
    },
    async complete(prompt, systemPrompt) {
      // TODO: implement streaming support for long completions
      // TODO: add structured output parsing (JSON mode)
      interface OllamaGenerateResponse {
        response: string;
        model: string;
        total_duration: number;
      }
      const result = await post<OllamaGenerateResponse>("/api/generate", {
        model,
        prompt,
        system: systemPrompt ?? "",
        stream: false,
      });
      return {
        text: result.response,
        model: result.model,
        totalDuration: result.total_duration,
      };
    },
    async healthCheck() {
      try {
        const url = `${baseUrl.replace(/\/+$/, "")}/api/tags`;
        const response = await fetch(url);
        if (!response.ok) return { ok: false, models: [] };
        interface OllamaTagsResponse {
          models: Array<{ name: string }>;
        }
        const data = (await response.json()) as OllamaTagsResponse;
        return {
          ok: true,
          models: data.models.map((m) => m.name),
        };
      } catch {
        return { ok: false, models: [] };
      }
    },
  };
 }
--- a/src/embeddings/store.ts
+++ b/src/embeddings/store.ts
@ -0,0 +1,231 @@
 /**
 * Local SQLite-backed vector store for document embeddings.
 *
 * Stores embedding vectors alongside document metadata in a SQLite database
 * using better-sqlite3. Supports cosine similarity search for semantic
 * document retrieval.
 *
 * @example
 * ```ts
 * const store = createVectorStore({ dbPath: "./data/vectors.db" });
 * await store.upsert({ documentId: 42, vector: [...], content: "..." });
 * const results = await store.search(queryVector, { limit: 10 });
 * ```
 */
 import Database from "better-sqlite3";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 export interface VectorStoreConfig {
  readonly dbPath: string;
 }
 export interface DocumentEmbedding {
  readonly documentId: number;
  readonly vector: readonly number[];
  readonly content: string;
  readonly title: string;
  readonly tags: readonly string[];
  readonly createdAt: string;
 }
 export interface SearchResult {
  readonly documentId: number;
  readonly title: string;
  readonly content: string;
  readonly score: number;
  readonly tags: readonly string[];
 }
 export interface SearchOptions {
  readonly limit?: number;
  readonly minScore?: number;
  readonly tagFilter?: readonly string[];
 }
 export interface VectorStore {
  /** Insert or update a document embedding. */
  upsert(embedding: DocumentEmbedding): void;
  /** Search for similar documents using cosine similarity. */
  search(queryVector: readonly number[], options?: SearchOptions): readonly SearchResult[];
  /** Remove an embedding by document ID. */
  remove(documentId: number): void;
  /** Get the total count of stored embeddings. */
  count(): number;
  /** Check if a document has been embedded. */
  has(documentId: number): boolean;
  /** Close the database connection. */
  close(): void;
 }
 // ---------------------------------------------------------------------------
 // Helpers
 // ---------------------------------------------------------------------------
 /**
 * Compute cosine similarity between two vectors.
 * Returns a value between -1 and 1 (1 = identical direction).
 */
 function cosineSimilarity(a: readonly number[], b: readonly number[]): number {
  if (a.length !== b.length) {
    throw new Error(
      `Vector dimension mismatch: ${a.length} vs ${b.length}`,
    );
  }
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  const denominator = Math.sqrt(normA) * Math.sqrt(normB);
  if (denominator === 0) return 0;
  return dotProduct / denominator;
 }
 // ---------------------------------------------------------------------------
 // Implementation
 // ---------------------------------------------------------------------------
 /**
 * Create a local vector store backed by SQLite.
 *
 * TODO: Consider migrating to sqlite-vss or DuckDB for ANN search at scale.
 * The current brute-force approach works well for <100k documents.
 */
 export function createVectorStore(config: VectorStoreConfig): VectorStore {
  const db = new Database(config.dbPath);
  // Enable WAL mode for better concurrent read performance
  db.pragma("journal_mode = WAL");
  // Create tables if they don't exist
  db.exec(`
    CREATE TABLE IF NOT EXISTS embeddings (
      document_id INTEGER PRIMARY KEY,
      vector BLOB NOT NULL,
      content TEXT NOT NULL,
      title TEXT NOT NULL,
      tags TEXT NOT NULL DEFAULT '[]',
      created_at TEXT NOT NULL,
      updated_at TEXT NOT NULL DEFAULT (datetime('now'))
    );
    CREATE INDEX IF NOT EXISTS idx_embeddings_created
      ON embeddings (created_at);
  `);
  // Prepared statements for performance
  const upsertStmt = db.prepare(`
    INSERT INTO embeddings (document_id, vector, content, title, tags, created_at, updated_at)
    VALUES (?, ?, ?, ?, ?, ?, datetime('now'))
    ON CONFLICT(document_id) DO UPDATE SET
      vector = excluded.vector,
      content = excluded.content,
      title = excluded.title,
      tags = excluded.tags,
      updated_at = datetime('now')
  `);
  const getAllStmt = db.prepare(`
    SELECT document_id, vector, content, title, tags FROM embeddings
  `);
  const removeStmt = db.prepare(`
    DELETE FROM embeddings WHERE document_id = ?
  `);
  const countStmt = db.prepare(`
    SELECT COUNT(*) as count FROM embeddings
  `);
  const hasStmt = db.prepare(`
    SELECT 1 FROM embeddings WHERE document_id = ? LIMIT 1
  `);
  return {
    upsert(embedding) {
      const vectorBlob = Buffer.from(new Float32Array(embedding.vector).buffer);
      upsertStmt.run(
        embedding.documentId,
        vectorBlob,
        embedding.content,
        embedding.title,
        JSON.stringify(embedding.tags),
        embedding.createdAt,
      );
    },
    search(queryVector, options = {}) {
      const { limit = 10, minScore = 0.5, tagFilter } = options;
      // TODO: Implement ANN (approximate nearest neighbor) for large datasets
      // Current approach: brute-force scan -- fine for <100k documents
      interface EmbeddingRow {
        document_id: number;
        vector: Buffer;
        content: string;
        title: string;
        tags: string;
      }
      const rows = getAllStmt.all() as EmbeddingRow[];
      const scored = rows
        .map((row) => {
          const storedVector = Array.from(new Float32Array(row.vector.buffer));
          const tags: string[] = JSON.parse(row.tags);
          const score = cosineSimilarity(queryVector, storedVector);
          return {
            documentId: row.document_id,
            title: row.title,
            content: row.content,
            score,
            tags,
          };
        })
        .filter((result) => result.score >= minScore)
        .filter((result) => {
          if (!tagFilter || tagFilter.length === 0) return true;
          return tagFilter.some((tag) => result.tags.includes(tag));
        })
        .sort((a, b) => b.score - a.score)
        .slice(0, limit);
      return scored;
    },
    remove(documentId) {
      removeStmt.run(documentId);
    },
    count() {
      const row = countStmt.get() as { count: number };
      return row.count;
    },
    has(documentId) {
      return hasStmt.get(documentId) !== undefined;
    },
    close() {
      db.close();
    },
  };
 }
--- a/src/mcp-server/index.ts
+++ b/src/mcp-server/index.ts
@ -0,0 +1,249 @@
 /**
 * PaperCortex MCP Server entry point.
 *
 * Exposes document intelligence tools via the Model Context Protocol (MCP)
 * for integration with Claude Code and other AI agents.
 *
 * @see https://modelcontextprotocol.io
 */
 import { Server } from "@modelcontextprotocol/sdk/server/index.js";
 import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
 import {
  CallToolRequestSchema,
  ListToolsRequestSchema,
 } from "@modelcontextprotocol/sdk/types.js";
 import { config } from "dotenv";
 import { createOllamaClient } from "../embeddings/ollama.js";
 import { createVectorStore } from "../embeddings/store.js";
 import { createPaperlessClient } from "../paperless/client.js";
 import { handleClassify } from "./tools/classify.js";
 import { handleExport } from "./tools/export.js";
 import { handleQuery } from "./tools/query.js";
 import { handleReceipt } from "./tools/receipt.js";
 import { handleSearch } from "./tools/search.js";
 // ---------------------------------------------------------------------------
 // Configuration
 // ---------------------------------------------------------------------------
 config(); // Load .env
 function requireEnv(key: string): string {
  const value = process.env[key];
  if (!value) {
    throw new Error(`Missing required environment variable: ${key}`);
  }
  return value;
 }
 // ---------------------------------------------------------------------------
 // Service initialization
 // ---------------------------------------------------------------------------
 const paperless = createPaperlessClient({
  baseUrl: requireEnv("PAPERLESS_URL"),
  token: requireEnv("PAPERLESS_TOKEN"),
 });
 const ollama = createOllamaClient({
  baseUrl: process.env["OLLAMA_URL"] ?? "http://localhost:11434",
  model: process.env["OLLAMA_MODEL"] ?? "qwen2.5:14b",
  embeddingModel: process.env["OLLAMA_EMBEDDING_MODEL"] ?? "nomic-embed-text",
 });
 const vectorStore = createVectorStore({
  dbPath: process.env["VECTOR_DB_PATH"] ?? "./data/vectors.db",
 });
 // ---------------------------------------------------------------------------
 // Shared context for tool handlers
 // ---------------------------------------------------------------------------
 export interface ToolContext {
  readonly paperless: typeof paperless;
  readonly ollama: typeof ollama;
  readonly vectorStore: typeof vectorStore;
 }
 const ctx: ToolContext = { paperless, ollama, vectorStore };
 // ---------------------------------------------------------------------------
 // MCP Server setup
 // ---------------------------------------------------------------------------
 const server = new Server(
  {
    name: "papercortex",
    version: "0.1.0",
  },
  {
    capabilities: {
      tools: {},
    },
  },
 );
 /**
 * List all available PaperCortex tools.
 */
 server.setRequestHandler(ListToolsRequestSchema, async () => ({
  tools: [
    {
      name: "papercortex_search",
      description:
        "Semantic search across all documents in Paperless-ngx. " +
        "Finds documents by meaning, not just keywords.",
      inputSchema: {
        type: "object" as const,
        properties: {
          query: {
            type: "string",
            description: "Natural language search query",
          },
          limit: {
            type: "number",
            description: "Maximum number of results (default: 10)",
          },
          tags: {
            type: "array",
            items: { type: "string" },
            description: "Filter by tag names",
          },
        },
        required: ["query"],
      },
    },
    {
      name: "papercortex_classify",
      description:
        "Auto-classify a document using local AI. " +
        "Suggests tags, document type, and correspondent.",
      inputSchema: {
        type: "object" as const,
        properties: {
          documentId: {
            type: "number",
            description: "Paperless-ngx document ID",
          },
          applyTags: {
            type: "boolean",
            description: "Automatically apply suggested tags (default: false)",
          },
        },
        required: ["documentId"],
      },
    },
    {
      name: "papercortex_receipt",
      description:
        "Extract structured data from a receipt document: " +
        "vendor, date, amounts, tax, line items.",
      inputSchema: {
        type: "object" as const,
        properties: {
          documentId: {
            type: "number",
            description: "Paperless-ngx document ID of the receipt",
          },
        },
        required: ["documentId"],
      },
    },
    {
      name: "papercortex_query",
      description:
        "Ask natural language questions about your documents. " +
        'Example: "How much did I spend on office supplies in Q1 2024?"',
      inputSchema: {
        type: "object" as const,
        properties: {
          question: {
            type: "string",
            description: "Natural language question about your documents",
          },
          maxDocuments: {
            type: "number",
            description:
              "Maximum documents to include in context (default: 5)",
          },
        },
        required: ["question"],
      },
    },
    {
      name: "papercortex_export",
      description:
        "Export receipt data as DATEV-compatible CSV for German accounting, " +
        "or as generic CSV.",
      inputSchema: {
        type: "object" as const,
        properties: {
          documentIds: {
            type: "array",
            items: { type: "number" },
            description: "Document IDs to export",
          },
          format: {
            type: "string",
            enum: ["datev", "csv"],
            description: "Export format (default: datev)",
          },
        },
        required: ["documentIds"],
      },
    },
  ],
 }));
 /**
 * Route tool calls to their respective handlers.
 */
 server.setRequestHandler(CallToolRequestSchema, async (request) => {
  const { name, arguments: args } = request.params;
  try {
    switch (name) {
      case "papercortex_search":
        return await handleSearch(ctx, args as Record<string, unknown>);
      case "papercortex_classify":
        return await handleClassify(ctx, args as Record<string, unknown>);
      case "papercortex_receipt":
        return await handleReceipt(ctx, args as Record<string, unknown>);
      case "papercortex_query":
        return await handleQuery(ctx, args as Record<string, unknown>);
      case "papercortex_export":
        return await handleExport(ctx, args as Record<string, unknown>);
      default:
        return {
          content: [
            { type: "text" as const, text: `Unknown tool: ${name}` },
          ],
          isError: true,
        };
    }
  } catch (error) {
    const message =
      error instanceof Error ? error.message : "Unknown error occurred";
    return {
      content: [{ type: "text" as const, text: `Error: ${message}` }],
      isError: true,
    };
  }
 });
 // ---------------------------------------------------------------------------
 // Start server
 // ---------------------------------------------------------------------------
 async function main(): Promise<void> {
  const transport = new StdioServerTransport();
  await server.connect(transport);
  console.error("PaperCortex MCP Server running on stdio");
 }
 main().catch((error) => {
  console.error("Fatal error starting PaperCortex:", error);
  process.exit(1);
 });
--- a/src/mcp-server/tools/classify.ts
+++ b/src/mcp-server/tools/classify.ts
@ -0,0 +1,117 @@
 /**
 * Auto-classification tool for the PaperCortex MCP Server.
 *
 * Uses local LLM to analyze document content and suggest appropriate
 * tags, document types, and correspondents.
 */
 import type { ToolContext } from "../index.js";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 interface ClassifyArgs {
  readonly documentId: number;
  readonly applyTags?: boolean;
 }
 interface ClassificationResult {
  readonly suggestedTags: readonly string[];
  readonly suggestedType: string | null;
  readonly suggestedCorrespondent: string | null;
  readonly summary: string;
  readonly language: string;
  readonly confidence: number;
 }
 // ---------------------------------------------------------------------------
 // Prompts
 // ---------------------------------------------------------------------------
 const CLASSIFY_SYSTEM_PROMPT = `You are a document classification assistant. Analyze the document content and provide structured classification.
 Respond with valid JSON only:
 {
  "suggestedTags": ["tag1", "tag2"],
  "suggestedType": "invoice|contract|receipt|letter|report|tax_document|bank_statement|insurance|warranty|manual|other",
  "suggestedCorrespondent": "Company or person name",
  "summary": "One sentence summary",
  "language": "ISO 639-1 code",
  "confidence": 0.0 to 1.0
 }`;
 // ---------------------------------------------------------------------------
 // Handler
 // ---------------------------------------------------------------------------
 /**
 * Handle a `papercortex_classify` tool call.
 *
 * 1. Fetch document content from Paperless-ngx.
 * 2. Send content to Ollama for classification.
 * 3. Optionally apply suggested tags back to Paperless-ngx.
 *
 * TODO: Match suggested tags against existing Paperless-ngx tags
 * TODO: Create new tags automatically when confidence is high
 * TODO: Learn from user corrections to improve classification
 */
 export async function handleClassify(
  ctx: ToolContext,
  args: Record<string, unknown>,
 ): Promise<{ content: Array<{ type: "text"; text: string }> }> {
  const { documentId, applyTags = false } = args as unknown as ClassifyArgs;
  // Fetch document from Paperless-ngx
  const document = await ctx.paperless.getDocument(documentId);
  if (!document.content || document.content.trim().length === 0) {
    return {
      content: [
        {
          type: "text",
          text: `Document #${documentId} has no text content. OCR may not have completed.`,
        },
      ],
    };
  }
  // Classify using Ollama
  const prompt = `Classify this document:\n\nTitle: ${document.title}\n\nContent:\n${document.content.slice(0, 4000)}`;
  const completion = await ctx.ollama.complete(prompt, CLASSIFY_SYSTEM_PROMPT);
  let classification: ClassificationResult;
  try {
    classification = JSON.parse(completion.text) as ClassificationResult;
  } catch {
    return {
      content: [
        {
          type: "text",
          text: `Classification failed: LLM did not return valid JSON.\nRaw response: ${completion.text.slice(0, 500)}`,
        },
      ],
    };
  }
  // Optionally apply tags
  let appliedNote = "";
  if (applyTags && classification.suggestedTags.length > 0) {
    // TODO: Look up tag IDs from Paperless-ngx, create missing tags
    appliedNote =
      "\n\nNote: Tag application is not yet implemented. " +
      "Tags need to be matched against existing Paperless-ngx tags.";
  }
  const output =
    `Classification for Document #${documentId} "${document.title}":\n\n` +
    `Type: ${classification.suggestedType ?? "unknown"}\n` +
    `Correspondent: ${classification.suggestedCorrespondent ?? "unknown"}\n` +
    `Tags: ${classification.suggestedTags.join(", ") || "none"}\n` +
    `Language: ${classification.language}\n` +
    `Summary: ${classification.summary}\n` +
    `Confidence: ${(classification.confidence * 100).toFixed(0)}%` +
    appliedNote;
  return { content: [{ type: "text", text: output }] };
 }
--- a/src/mcp-server/tools/export.ts
+++ b/src/mcp-server/tools/export.ts
@ -0,0 +1,116 @@
 /**
 * DATEV/CSV export tool for the PaperCortex MCP Server.
 *
 * Exports receipt data in accounting-compatible formats.
 */
 import { createReceiptExtractor } from "../../receipt/extractor.js";
 import { createDatevExporter } from "../../receipt/datev.js";
 import type { ToolContext } from "../index.js";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 interface ExportArgs {
  readonly documentIds: readonly number[];
  readonly format?: "datev" | "csv";
 }
 // ---------------------------------------------------------------------------
 // Handler
 // ---------------------------------------------------------------------------
 /**
 * Handle a `papercortex_export` tool call.
 *
 * 1. Extract receipt data from all specified documents.
 * 2. Format as DATEV or generic CSV.
 * 3. Return the CSV content.
 *
 * TODO: Add file output option (save to disk)
 * TODO: Add date range filtering
 * TODO: Add DATEV header metadata (consultant/client numbers from config)
 */
 export async function handleExport(
  ctx: ToolContext,
  args: Record<string, unknown>,
 ): Promise<{ content: Array<{ type: "text"; text: string }> }> {
  const { documentIds, format = "datev" } = args as unknown as ExportArgs;
  if (!documentIds || documentIds.length === 0) {
    return {
      content: [
        {
          type: "text",
          text: "Error: at least one document ID is required for export.",
        },
      ],
    };
  }
  // Extract receipt data from all documents
  const extractor = createReceiptExtractor({
    ollama: ctx.ollama,
    paperless: ctx.paperless,
  });
  const receipts = await extractor.extractBatch(documentIds);
  if (format === "datev") {
    // TODO: Read consultant/client numbers from configuration
    const exporter = createDatevExporter({
      consultantNumber: 0,
      clientNumber: 0,
    });
    const receiptsForExport = receipts.map((r) => ({
      documentId: r.documentId,
      vendor: r.vendor,
      date: r.date,
      totalAmount: r.totalAmount,
      taxRate: r.taxRate,
      category: r.category,
    }));
    const csv = exporter.generateCsv(receiptsForExport);
    return {
      content: [
        {
          type: "text",
          text:
            `DATEV export for ${receipts.length} receipt(s):\n\n` +
            "```csv\n" +
            csv +
            "\n```\n\n" +
            "Copy this CSV content into a file and import into your " +
            "DATEV-compatible accounting software.",
        },
      ],
    };
  }
  // Generic CSV format
  const header = "Document ID;Vendor;Date;Amount;Tax Rate;Tax Amount;Currency;Category";
  const rows = receipts.map(
    (r) =>
      `${r.documentId};${r.vendor};${r.date};${r.totalAmount.toFixed(2)};` +
      `${r.taxRate ?? ""};${r.taxAmount?.toFixed(2) ?? ""};${r.currency};${r.category ?? ""}`,
  );
  const csv = [header, ...rows].join("\n");
  return {
    content: [
      {
        type: "text",
        text:
          `CSV export for ${receipts.length} receipt(s):\n\n` +
          "```csv\n" +
          csv +
          "\n```",
      },
    ],
  };
 }
--- a/src/mcp-server/tools/query.ts
+++ b/src/mcp-server/tools/query.ts
@ -0,0 +1,110 @@
 /**
 * Natural language query tool for the PaperCortex MCP Server.
 *
 * Answers questions about documents using RAG (Retrieval-Augmented Generation):
 * retrieves relevant documents via semantic search, then generates an answer
 * using the local LLM with document context.
 */
 import type { ToolContext } from "../index.js";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 interface QueryArgs {
  readonly question: string;
  readonly maxDocuments?: number;
 }
 // ---------------------------------------------------------------------------
 // Prompts
 // ---------------------------------------------------------------------------
 const QUERY_SYSTEM_PROMPT = `You are a document analysis assistant. Answer the user's question based ONLY on the provided document excerpts. If the documents don't contain enough information to answer, say so clearly.
 Be precise with numbers, dates, and amounts. Cite document IDs when referencing specific documents.`;
 // ---------------------------------------------------------------------------
 // Handler
 // ---------------------------------------------------------------------------
 /**
 * Handle a `papercortex_query` tool call.
 *
 * Uses RAG (Retrieval-Augmented Generation):
 * 1. Embed the question and retrieve relevant documents.
 * 2. Build a context from retrieved documents.
 * 3. Generate an answer using the local LLM.
 *
 * TODO: Add conversation history for follow-up questions
 * TODO: Add source citation with page numbers
 * TODO: Implement query decomposition for complex questions
 */
 export async function handleQuery(
  ctx: ToolContext,
  args: Record<string, unknown>,
 ): Promise<{ content: Array<{ type: "text"; text: string }> }> {
  const { question, maxDocuments = 5 } = args as unknown as QueryArgs;
  if (!question || question.trim().length === 0) {
    return {
      content: [{ type: "text", text: "Error: question cannot be empty." }],
    };
  }
  // Step 1: Retrieve relevant documents
  const queryEmbedding = await ctx.ollama.embed(question);
  const relevantDocs = ctx.vectorStore.search(queryEmbedding.vector, {
    limit: maxDocuments,
    minScore: 0.3,
  });
  if (relevantDocs.length === 0) {
    return {
      content: [
        {
          type: "text",
          text:
            `I couldn't find any relevant documents to answer: "${question}"\n\n` +
            "The vector store may need to be populated first, or your documents " +
            "may not contain information related to this question.",
        },
      ],
    };
  }
  // Step 2: Build context from retrieved documents
  const context = relevantDocs
    .map(
      (doc) =>
        `--- Document #${doc.documentId}: ${doc.title} (relevance: ${doc.score.toFixed(2)}) ---\n` +
        doc.content.slice(0, 2000),
    )
    .join("\n\n");
  // Step 3: Generate answer with context
  const prompt =
    `Based on the following documents, answer this question: "${question}"\n\n` +
    `Documents:\n${context}`;
  const completion = await ctx.ollama.complete(prompt, QUERY_SYSTEM_PROMPT);
  const sourcesNote = relevantDocs
    .map(
      (doc) =>
        `  - Document #${doc.documentId}: ${doc.title} (score: ${doc.score.toFixed(2)})`,
    )
    .join("\n");
  return {
    content: [
      {
        type: "text",
        text:
          `${completion.text}\n\n` +
          `---\nSources (${relevantDocs.length} documents):\n${sourcesNote}`,
      },
    ],
  };
 }
--- a/src/mcp-server/tools/receipt.ts
+++ b/src/mcp-server/tools/receipt.ts
@ -0,0 +1,76 @@
 /**
 * Receipt extraction tool for the PaperCortex MCP Server.
 *
 * Extracts structured receipt data from Paperless-ngx documents
 * using local LLM analysis.
 */
 import { createReceiptExtractor } from "../../receipt/extractor.js";
 import type { ToolContext } from "../index.js";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 interface ReceiptArgs {
  readonly documentId: number;
 }
 // ---------------------------------------------------------------------------
 // Handler
 // ---------------------------------------------------------------------------
 /**
 * Handle a `papercortex_receipt` tool call.
 *
 * 1. Fetch document from Paperless-ngx.
 * 2. Extract receipt data using LLM.
 * 3. Return structured receipt information.
 *
 * TODO: Cache extraction results to avoid re-processing
 * TODO: Add confidence thresholds and human review flags
 * TODO: Store extracted data back as Paperless-ngx custom fields
 */
 export async function handleReceipt(
  ctx: ToolContext,
  args: Record<string, unknown>,
 ): Promise<{ content: Array<{ type: "text"; text: string }> }> {
  const { documentId } = args as unknown as ReceiptArgs;
  const extractor = createReceiptExtractor({
    ollama: ctx.ollama,
    paperless: ctx.paperless,
  });
  const receipt = await extractor.extract(documentId);
  // Format line items table
  const lineItemsTable =
    receipt.lineItems.length > 0
      ? receipt.lineItems
          .map(
            (item, i) =>
              `  ${i + 1}. ${item.description} | ` +
              `${item.quantity}x ${item.unitPrice.toFixed(2)} = ${item.totalPrice.toFixed(2)}`,
          )
          .join("\n")
      : "  No line items extracted";
  const output =
    `Receipt Data for Document #${documentId}:\n\n` +
    `Vendor: ${receipt.vendor}\n` +
    `Address: ${receipt.vendorAddress ?? "N/A"}\n` +
    `Tax ID: ${receipt.vendorTaxId ?? "N/A"}\n` +
    `Date: ${receipt.date}\n` +
    `Currency: ${receipt.currency}\n` +
    `\nAmounts:\n` +
    `  Subtotal: ${receipt.subtotal?.toFixed(2) ?? "N/A"}\n` +
    `  Tax (${receipt.taxRate ?? "?"}%): ${receipt.taxAmount?.toFixed(2) ?? "N/A"}\n` +
    `  Total: ${receipt.totalAmount.toFixed(2)}\n` +
    `\nPayment: ${receipt.paymentMethod ?? "N/A"}\n` +
    `Category: ${receipt.category ?? "uncategorized"}\n` +
    `Confidence: ${(receipt.confidence * 100).toFixed(0)}%\n` +
    `\nLine Items:\n${lineItemsTable}`;
  return { content: [{ type: "text", text: output }] };
 }
--- a/src/mcp-server/tools/search.ts
+++ b/src/mcp-server/tools/search.ts
@ -0,0 +1,87 @@
 /**
 * Semantic search tool for the PaperCortex MCP Server.
 *
 * Performs vector similarity search across all embedded documents,
 * returning the most semantically relevant results.
 */
 import type { ToolContext } from "../index.js";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 interface SearchArgs {
  readonly query: string;
  readonly limit?: number;
  readonly tags?: readonly string[];
 }
 // ---------------------------------------------------------------------------
 // Handler
 // ---------------------------------------------------------------------------
 /**
 * Handle a `papercortex_search` tool call.
 *
 * 1. Generate an embedding for the search query via Ollama.
 * 2. Search the local vector store for similar documents.
 * 3. Return ranked results with scores and metadata.
 *
 * TODO: Add hybrid search (combine vector + keyword for better recall)
 * TODO: Add date range filtering
 * TODO: Add result caching for repeated queries
 */
 export async function handleSearch(
  ctx: ToolContext,
  args: Record<string, unknown>,
 ): Promise<{ content: Array<{ type: "text"; text: string }> }> {
  const { query, limit = 10, tags } = args as unknown as SearchArgs;
  if (!query || query.trim().length === 0) {
    return {
      content: [{ type: "text", text: "Error: search query cannot be empty." }],
    };
  }
  // Generate embedding for the query
  const queryEmbedding = await ctx.ollama.embed(query);
  // Search vector store
  const results = ctx.vectorStore.search(queryEmbedding.vector, {
    limit,
    minScore: 0.4,
    tagFilter: tags ? [...tags] : undefined,
  });
  if (results.length === 0) {
    return {
      content: [
        {
          type: "text",
          text: `No documents found matching "${query}". The vector store may need to be populated first.`,
        },
      ],
    };
  }
  // Format results
  const formatted = results
    .map(
      (r, i) =>
        `${i + 1}. [Document #${r.documentId}] (score: ${r.score.toFixed(3)})\n` +
        `   Title: ${r.title}\n` +
        `   Tags: ${r.tags.length > 0 ? r.tags.join(", ") : "none"}\n` +
        `   Preview: ${r.content.slice(0, 200).replace(/\n/g, " ")}...`,
    )
    .join("\n\n");
  return {
    content: [
      {
        type: "text",
        text: `Found ${results.length} documents matching "${query}":\n\n${formatted}`,
      },
    ],
  };
 }
--- a/src/paperless/client.ts
+++ b/src/paperless/client.ts
@ -0,0 +1,182 @@
 /**
 * Paperless-ngx REST API client.
 *
 * Provides typed access to documents, correspondents, tags, and document types.
 * All methods return immutable result objects.
 *
 * @example
 * ```ts
 * const client = createPaperlessClient({
 *   baseUrl: "http://localhost:8000",
 *   token: "your-api-token",
 * });
 * const docs = await client.getDocuments({ query: "invoice" });
 * ```
 */
 import type {
  Correspondent,
  DocumentSearchParams,
  DocumentType,
  PaginatedResponse,
  PaperlessConfig,
  PaperlessDocument,
  Tag,
 } from "./types.js";
 // ---------------------------------------------------------------------------
 // Client interface
 // ---------------------------------------------------------------------------
 export interface PaperlessClient {
  /** Fetch a single document by ID. */
  getDocument(id: number): Promise<PaperlessDocument>;
  /** Search / list documents with optional filters. */
  getDocuments(
    params?: DocumentSearchParams,
  ): Promise<PaginatedResponse<PaperlessDocument>>;
  /** Fetch all correspondents. */
  getCorrespondents(): Promise<PaginatedResponse<Correspondent>>;
  /** Fetch all tags. */
  getTags(): Promise<PaginatedResponse<Tag>>;
  /** Fetch all document types. */
  getDocumentTypes(): Promise<PaginatedResponse<DocumentType>>;
  /** Download the original file content of a document. */
  downloadDocument(id: number): Promise<ArrayBuffer>;
  /** Update tags on a document (immutable -- returns the updated doc). */
  updateDocumentTags(
    id: number,
    tagIds: readonly number[],
  ): Promise<PaperlessDocument>;
 }
 // ---------------------------------------------------------------------------
 // Implementation
 // ---------------------------------------------------------------------------
 /**
 * Create a new Paperless-ngx API client.
 *
 * @param config - Connection configuration (URL + token).
 * @returns A {@link PaperlessClient} instance.
 */
 export function createPaperlessClient(config: PaperlessConfig): PaperlessClient {
  const { baseUrl, token, timeout = 30_000 } = config;
  const headers: Record<string, string> = {
    Authorization: `Token ${token}`,
    "Content-Type": "application/json",
    Accept: "application/json; version=3",
  };
  /**
   * Internal fetch wrapper with timeout and error handling.
   */
  async function request<T>(
    path: string,
    options: RequestInit = {},
  ): Promise<T> {
    const url = `${baseUrl.replace(/\/+$/, "")}/api${path}`;
    const controller = new AbortController();
    const timer = setTimeout(() => controller.abort(), timeout);
    try {
      const response = await fetch(url, {
        ...options,
        headers: { ...headers, ...((options.headers as Record<string, string>) ?? {}) },
        signal: controller.signal,
      });
      if (!response.ok) {
        const body = await response.text().catch(() => "");
        throw new Error(
          `Paperless API error: ${response.status} ${response.statusText} -- ${body}`,
        );
      }
      return (await response.json()) as T;
    } finally {
      clearTimeout(timer);
    }
  }
  /**
   * Build query string from search params.
   */
  function buildQuery(params?: DocumentSearchParams): string {
    if (!params) return "";
    const entries = Object.entries(params).filter(
      ([, v]) => v !== undefined && v !== null,
    );
    if (entries.length === 0) return "";
    const searchParams = new URLSearchParams();
    for (const [key, value] of entries) {
      if (Array.isArray(value)) {
        searchParams.set(key, value.join(","));
      } else {
        searchParams.set(key, String(value));
      }
    }
    return `?${searchParams.toString()}`;
  }
  return {
    async getDocument(id) {
      return request<PaperlessDocument>(`/documents/${id}/`);
    },
    async getDocuments(params) {
      return request<PaginatedResponse<PaperlessDocument>>(
        `/documents/${buildQuery(params)}`,
      );
    },
    async getCorrespondents() {
      return request<PaginatedResponse<Correspondent>>("/correspondents/");
    },
    async getTags() {
      return request<PaginatedResponse<Tag>>("/tags/");
    },
    async getDocumentTypes() {
      return request<PaginatedResponse<DocumentType>>("/document_types/");
    },
    async downloadDocument(id) {
      const url = `${baseUrl.replace(/\/+$/, "")}/api/documents/${id}/download/`;
      const controller = new AbortController();
      const timer = setTimeout(() => controller.abort(), timeout);
      try {
        const response = await fetch(url, {
          headers: { Authorization: `Token ${token}` },
          signal: controller.signal,
        });
        if (!response.ok) {
          throw new Error(
            `Paperless download error: ${response.status} ${response.statusText}`,
          );
        }
        return await response.arrayBuffer();
      } finally {
        clearTimeout(timer);
      }
    },
    async updateDocumentTags(id, tagIds) {
      return request<PaperlessDocument>(`/documents/${id}/`, {
        method: "PATCH",
        body: JSON.stringify({ tags: [...tagIds] }),
      });
    },
  };
 }
--- a/src/paperless/types.ts
+++ b/src/paperless/types.ts
@ -0,0 +1,126 @@
 /**
 * TypeScript type definitions for the Paperless-ngx REST API.
 *
 * Based on Paperless-ngx API v3+.
 * @see https://docs.paperless-ngx.com/api/
 */
 // ---------------------------------------------------------------------------
 // Pagination
 // ---------------------------------------------------------------------------
 /** Generic paginated response envelope from Paperless-ngx. */
 export interface PaginatedResponse<T> {
  readonly count: number;
  readonly next: string | null;
  readonly previous: string | null;
  readonly results: readonly T[];
 }
 // ---------------------------------------------------------------------------
 // Core entities
 // ---------------------------------------------------------------------------
 export interface PaperlessDocument {
  readonly id: number;
  readonly correspondent: number | null;
  readonly document_type: number | null;
  readonly storage_path: number | null;
  readonly title: string;
  readonly content: string;
  readonly tags: readonly number[];
  readonly created: string;
  readonly created_date: string;
  readonly modified: string;
  readonly added: string;
  readonly archive_serial_number: number | null;
  readonly original_file_name: string;
  readonly archived_file_name: string | null;
  readonly owner: number | null;
  readonly notes: readonly DocumentNote[];
  readonly custom_fields: readonly CustomFieldValue[];
 }
 export interface DocumentNote {
  readonly id: number;
  readonly note: string;
  readonly created: string;
  readonly user: number;
 }
 export interface CustomFieldValue {
  readonly field: number;
  readonly value: string | number | boolean | null;
 }
 export interface Correspondent {
  readonly id: number;
  readonly slug: string;
  readonly name: string;
  readonly match: string;
  readonly matching_algorithm: number;
  readonly is_insensitive: boolean;
  readonly document_count: number;
  readonly last_correspondence: string | null;
 }
 export interface DocumentType {
  readonly id: number;
  readonly slug: string;
  readonly name: string;
  readonly match: string;
  readonly matching_algorithm: number;
  readonly is_insensitive: boolean;
  readonly document_count: number;
 }
 export interface Tag {
  readonly id: number;
  readonly slug: string;
  readonly name: string;
  readonly color: string;
  readonly text_color: string;
  readonly match: string;
  readonly matching_algorithm: number;
  readonly is_insensitive: boolean;
  readonly is_inbox_tag: boolean;
  readonly document_count: number;
 }
 export interface StoragePath {
  readonly id: number;
  readonly slug: string;
  readonly name: string;
  readonly path: string;
  readonly match: string;
  readonly matching_algorithm: number;
  readonly is_insensitive: boolean;
  readonly document_count: number;
 }
 // ---------------------------------------------------------------------------
 // Search & filter
 // ---------------------------------------------------------------------------
 export interface DocumentSearchParams {
  readonly query?: string;
  readonly correspondent__id?: number;
  readonly document_type__id?: number;
  readonly tags__id__all?: readonly number[];
  readonly tags__id__none?: readonly number[];
  readonly created__date__gt?: string;
  readonly created__date__lt?: string;
  readonly ordering?: string;
  readonly page?: number;
  readonly page_size?: number;
 }
 // ---------------------------------------------------------------------------
 // API client configuration
 // ---------------------------------------------------------------------------
 export interface PaperlessConfig {
  readonly baseUrl: string;
  readonly token: string;
  readonly timeout?: number;
 }
--- a/src/receipt/datev.ts
+++ b/src/receipt/datev.ts
@ -0,0 +1,171 @@
 /**
 * DATEV export formatter.
 *
 * Generates DATEV-compatible CSV files for import into German accounting
 * software (DATEV Unternehmen Online, lexoffice, sevDesk, etc.).
 *
 * Implements the DATEV "Buchungsstapel" (posting batch) format v7.0+.
 *
 * @see https://developer.datev.de/datev/platform/en/dtvf/formate
 *
 * @example
 * ```ts
 * const exporter = createDatevExporter({ consultantNumber: 12345, clientNumber: 67890 });
 * const csv = exporter.generateCsv(receiptData);
 * writeFileSync("./export.csv", csv);
 * ```
 */
 import { stringify } from "csv-stringify/sync";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 export interface DatevConfig {
  /** DATEV consultant number (Beraternummer). */
  readonly consultantNumber: number;
  /** DATEV client number (Mandantennummer). */
  readonly clientNumber: number;
  /** Fiscal year start (1-12, default: 1 for January). */
  readonly fiscalYearStart?: number;
  /** Default debit account length (SKR03/SKR04). */
  readonly accountLength?: 4 | 5;
 }
 export interface DatevBookingEntry {
  readonly amount: number;
  readonly debitAccount: string;
  readonly creditAccount: string;
  readonly taxCode: string;
  readonly date: string;
  readonly description: string;
  readonly documentNumber: string;
  readonly costCenter?: string;
 }
 export interface ReceiptForExport {
  readonly documentId: number;
  readonly vendor: string;
  readonly date: string;
  readonly totalAmount: number;
  readonly taxRate: number | null;
  readonly category: string | null;
 }
 export interface DatevExporter {
  /** Generate DATEV CSV from receipt data. */
  generateCsv(receipts: readonly ReceiptForExport[]): string;
  /** Map a receipt to a DATEV booking entry. */
  mapToBooking(receipt: ReceiptForExport): DatevBookingEntry;
 }
 // ---------------------------------------------------------------------------
 // Constants
 // ---------------------------------------------------------------------------
 /**
 * Map expense categories to SKR03 accounts.
 * TODO: Add SKR04 mapping support
 * TODO: Make configurable via user settings
 */
 const SKR03_ACCOUNT_MAP: Record<string, string> = {
  office_supplies: "4930",
  travel: "4660",
  food: "4650",
  telephone: "4920",
  postage: "4910",
  insurance: "4360",
  rent: "4210",
  advertising: "4600",
  software: "4964",
  hardware: "4980",
  consulting: "4950",
  training: "4945",
  vehicle: "4500",
  default: "4900",
 };
 /**
 * Map tax rates to DATEV tax codes (Steuerschluessel).
 */
 const TAX_CODE_MAP: Record<number, string> = {
  19: "9",   // 19% USt (standard)
  7: "8",    // 7% USt (reduced)
  0: "0",    // Tax-free
 };
 // ---------------------------------------------------------------------------
 // Implementation
 // ---------------------------------------------------------------------------
 /**
 * Create a DATEV-format exporter for receipt data.
 *
 * TODO: Implement DATEV header line with metadata (consultant, client, date range)
 * TODO: Add validation for account numbers against SKR03/SKR04
 * TODO: Support DATEV XML format (Buchungsdaten v5.0)
 */
 export function createDatevExporter(config: DatevConfig): DatevExporter {
  const {
    consultantNumber: _consultantNumber,
    clientNumber: _clientNumber,
    fiscalYearStart: _fiscalYearStart = 1,
    accountLength: _accountLength = 4,
  } = config;
  function mapToBooking(receipt: ReceiptForExport): DatevBookingEntry {
    const category = receipt.category ?? "default";
    const debitAccount =
      SKR03_ACCOUNT_MAP[category] ?? SKR03_ACCOUNT_MAP["default"];
    const taxRate = receipt.taxRate ?? 19;
    const taxCode = TAX_CODE_MAP[taxRate] ?? TAX_CODE_MAP[19];
    // Parse date to DD.MM format for DATEV
    const dateParts = receipt.date.split("-");
    const datevDate =
      dateParts.length === 3
        ? `${dateParts[2]}${dateParts[1]}`
        : receipt.date;
    return {
      amount: receipt.totalAmount,
      debitAccount,
      creditAccount: "1200", // Bank account (SKR03 default)
      taxCode,
      date: datevDate,
      description: receipt.vendor.slice(0, 60), // DATEV max 60 chars
      documentNumber: `PC-${receipt.documentId}`,
      costCenter: undefined,
    };
  }
  function generateCsv(receipts: readonly ReceiptForExport[]): string {
    const bookings = receipts.map(mapToBooking);
    // DATEV Buchungsstapel columns
    const rows = bookings.map((b) => [
      b.amount.toFixed(2).replace(".", ","), // Umsatz (amount with comma)
      "S",                                    // Soll/Haben (S = Soll/Debit)
      b.taxCode,                              // BU-Schluessel (tax code)
      b.debitAccount,                         // Gegenkonto (offset account)
      b.date,                                 // Belegdatum (document date)
      b.documentNumber,                       // Belegfeld 1 (document number)
      "",                                     // Belegfeld 2
      b.description,                          // Buchungstext (description)
      "",                                     // Umsatzsteuer-ID
      b.creditAccount,                        // Konto (account)
      b.costCenter ?? "",                     // Kostenstelle (cost center)
    ]);
    return stringify(rows, {
      delimiter: ";",
      quoted: true,
      record_delimiter: "\r\n",
    });
  }
  return { generateCsv, mapToBooking };
 }
--- a/src/receipt/extractor.ts
+++ b/src/receipt/extractor.ts
@ -0,0 +1,170 @@
 /**
 * Receipt data extraction using local LLM via Ollama.
 *
 * Extracts structured data from receipt documents: vendor, date, amounts,
 * tax breakdown, line items, and payment method. Uses the Paperless-ngx
 * OCR content and enriches it with LLM analysis.
 *
 * @example
 * ```ts
 * const extractor = createReceiptExtractor({ ollama, paperless });
 * const receipt = await extractor.extract(documentId);
 * console.log(receipt.vendor, receipt.totalAmount, receipt.taxAmount);
 * ```
 */
 import type { OllamaClient } from "../embeddings/ollama.js";
 import type { PaperlessClient } from "../paperless/client.js";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 export interface ReceiptData {
  readonly documentId: number;
  readonly vendor: string;
  readonly vendorAddress: string | null;
  readonly vendorTaxId: string | null;
  readonly date: string;
  readonly currency: string;
  readonly subtotal: number | null;
  readonly taxRate: number | null;
  readonly taxAmount: number | null;
  readonly totalAmount: number;
  readonly paymentMethod: string | null;
  readonly lineItems: readonly LineItem[];
  readonly category: string | null;
  readonly confidence: number;
  readonly rawText: string;
 }
 export interface LineItem {
  readonly description: string;
  readonly quantity: number;
  readonly unitPrice: number;
  readonly totalPrice: number;
  readonly taxRate: number | null;
 }
 export interface ReceiptExtractorConfig {
  readonly ollama: OllamaClient;
  readonly paperless: PaperlessClient;
 }
 export interface ReceiptExtractor {
  /** Extract structured receipt data from a Paperless-ngx document. */
  extract(documentId: number): Promise<ReceiptData>;
  /** Batch-extract receipts from multiple documents. */
  extractBatch(documentIds: readonly number[]): Promise<readonly ReceiptData[]>;
 }
 // ---------------------------------------------------------------------------
 // Prompts
 // ---------------------------------------------------------------------------
 const EXTRACTION_SYSTEM_PROMPT = `You are a receipt data extraction assistant. Given the OCR text of a receipt, extract structured data in JSON format.
 Extract the following fields:
 - vendor: Company/store name
 - vendorAddress: Full address if visible
 - vendorTaxId: Tax ID / VAT number if visible (e.g., USt-IdNr, Steuernummer)
 - date: Date in ISO 8601 format (YYYY-MM-DD)
 - currency: ISO 4217 currency code (e.g., EUR, USD)
 - subtotal: Amount before tax (null if not distinguishable)
 - taxRate: Tax percentage as decimal (e.g., 19 for 19%)
 - taxAmount: Tax amount
 - totalAmount: Total amount including tax
 - paymentMethod: Payment method if visible (cash, card, etc.)
 - lineItems: Array of { description, quantity, unitPrice, totalPrice, taxRate }
 - category: Suggested expense category (office_supplies, travel, food, etc.)
 - confidence: Your confidence in the extraction (0.0 to 1.0)
 Respond ONLY with valid JSON. No explanation, no markdown.`;
 // ---------------------------------------------------------------------------
 // Implementation
 // ---------------------------------------------------------------------------
 /**
 * Create a receipt data extractor.
 *
 * TODO: Add support for image-based receipts (pass images to multimodal LLM)
 * TODO: Add receipt template matching for common vendors
 * TODO: Add currency conversion support
 */
 export function createReceiptExtractor(
  config: ReceiptExtractorConfig,
 ): ReceiptExtractor {
  const { ollama, paperless } = config;
  async function extractSingle(documentId: number): Promise<ReceiptData> {
    // Fetch the document content from Paperless-ngx
    const document = await paperless.getDocument(documentId);
    const ocrText = document.content;
    if (!ocrText || ocrText.trim().length === 0) {
      throw new Error(
        `Document ${documentId} has no OCR content. Ensure Paperless-ngx has processed the document.`,
      );
    }
    // Send to Ollama for structured extraction
    const prompt = `Extract receipt data from the following OCR text:\n\n---\n${ocrText}\n---`;
    const completion = await ollama.complete(prompt, EXTRACTION_SYSTEM_PROMPT);
    // Parse LLM response
    // TODO: Add robust JSON extraction (handle markdown code blocks, partial JSON)
    // TODO: Validate against Zod schema for type safety
    let parsed: Record<string, unknown>;
    try {
      parsed = JSON.parse(completion.text);
    } catch {
      throw new Error(
        `Failed to parse receipt extraction result for document ${documentId}. ` +
        `LLM response was not valid JSON.`,
      );
    }
    return {
      documentId,
      vendor: String(parsed.vendor ?? "Unknown"),
      vendorAddress: parsed.vendorAddress ? String(parsed.vendorAddress) : null,
      vendorTaxId: parsed.vendorTaxId ? String(parsed.vendorTaxId) : null,
      date: String(parsed.date ?? new Date().toISOString().split("T")[0]),
      currency: String(parsed.currency ?? "EUR"),
      subtotal: typeof parsed.subtotal === "number" ? parsed.subtotal : null,
      taxRate: typeof parsed.taxRate === "number" ? parsed.taxRate : null,
      taxAmount: typeof parsed.taxAmount === "number" ? parsed.taxAmount : null,
      totalAmount: typeof parsed.totalAmount === "number" ? parsed.totalAmount : 0,
      paymentMethod: parsed.paymentMethod ? String(parsed.paymentMethod) : null,
      lineItems: Array.isArray(parsed.lineItems)
        ? parsed.lineItems.map((item: Record<string, unknown>) => ({
            description: String(item.description ?? ""),
            quantity: Number(item.quantity ?? 1),
            unitPrice: Number(item.unitPrice ?? 0),
            totalPrice: Number(item.totalPrice ?? 0),
            taxRate: typeof item.taxRate === "number" ? item.taxRate : null,
          }))
        : [],
      category: parsed.category ? String(parsed.category) : null,
      confidence: typeof parsed.confidence === "number" ? parsed.confidence : 0.5,
      rawText: ocrText,
    };
  }
  return {
    extract: extractSingle,
    async extractBatch(documentIds) {
      // TODO: Add concurrency control (process N at a time)
      // TODO: Add progress reporting callback
      const results: ReceiptData[] = [];
      for (const id of documentIds) {
        const result = await extractSingle(id);
        results.push(result);
      }
      return results;
    },
  };
 }
--- a/src/receipt/matcher.ts
+++ b/src/receipt/matcher.ts
@ -0,0 +1,231 @@
 /**
 * Bank CSV transaction matching for receipts.
 *
 * Matches extracted receipt data against bank CSV exports to reconcile
 * transactions. Supports common German bank export formats (Sparkasse,
 * Volksbank, ING, DKB).
 *
 * @example
 * ```ts
 * const matcher = createTransactionMatcher();
 * const bankTxns = await matcher.parseBankCsv("./bank_export.csv");
 * const matches = matcher.matchReceipts(receipts, bankTxns);
 * ```
 */
 import { parse } from "csv-parse/sync";
 import { readFileSync } from "node:fs";
 // ---------------------------------------------------------------------------
 // Types
 // ---------------------------------------------------------------------------
 export interface BankTransaction {
  readonly date: string;
  readonly description: string;
  readonly amount: number;
  readonly currency: string;
  readonly iban: string | null;
  readonly bic: string | null;
  readonly reference: string | null;
  readonly rawLine: string;
 }
 export interface ReceiptMatchCandidate {
  readonly documentId: number;
  readonly vendor: string;
  readonly date: string;
  readonly totalAmount: number;
  readonly currency: string;
 }
 export interface MatchResult {
  readonly receipt: ReceiptMatchCandidate;
  readonly transaction: BankTransaction;
  readonly confidence: number;
  readonly matchReasons: readonly string[];
 }
 export interface UnmatchedItem {
  readonly type: "receipt" | "transaction";
  readonly item: ReceiptMatchCandidate | BankTransaction;
 }
 export interface MatchSummary {
  readonly matched: readonly MatchResult[];
  readonly unmatchedReceipts: readonly ReceiptMatchCandidate[];
  readonly unmatchedTransactions: readonly BankTransaction[];
  readonly matchRate: number;
 }
 export interface TransactionMatcher {
  /** Parse a bank CSV export file into structured transactions. */
  parseBankCsv(filePath: string, format?: BankCsvFormat): readonly BankTransaction[];
  /** Match receipts against bank transactions. */
  matchReceipts(
    receipts: readonly ReceiptMatchCandidate[],
    transactions: readonly BankTransaction[],
  ): MatchSummary;
 }
 export type BankCsvFormat = "auto" | "sparkasse" | "ing" | "dkb" | "volksbank" | "generic";
 // ---------------------------------------------------------------------------
 // Implementation
 // ---------------------------------------------------------------------------
 /**
 * Create a transaction matcher for bank CSV reconciliation.
 *
 * TODO: Add ML-based fuzzy matching for vendor names
 * TODO: Add support for MT940/CAMT.053 bank statement formats
 * TODO: Add date tolerance configuration (match within N days)
 */
 export function createTransactionMatcher(): TransactionMatcher {
  /**
   * Parse bank CSV with auto-detected or specified format.
   */
  function parseBankCsv(
    filePath: string,
    format: BankCsvFormat = "auto",
  ): readonly BankTransaction[] {
    const raw = readFileSync(filePath, "utf-8");
    // TODO: Implement format auto-detection based on header patterns
    // TODO: Add support for different CSV delimiters (semicolon for German exports)
    // TODO: Handle different date formats (DD.MM.YYYY, YYYY-MM-DD, MM/DD/YYYY)
    const _format = format; // Acknowledge format parameter for future use
    const records = parse(raw, {
      columns: true,
      skip_empty_lines: true,
      delimiter: ";",
      relaxColumnCount: true,
    }) as Record<string, string>[];
    return records.map((record): BankTransaction => {
      // Generic column mapping -- override per format
      // TODO: Implement format-specific column mappings
      return {
        date: record["Buchungstag"] ?? record["Date"] ?? record["Datum"] ?? "",
        description:
          record["Verwendungszweck"] ??
          record["Description"] ??
          record["Buchungstext"] ??
          "",
        amount: parseFloat(
          (record["Betrag"] ?? record["Amount"] ?? "0")
            .replace(/\./g, "")
            .replace(",", "."),
        ),
        currency: record["Waehrung"] ?? record["Currency"] ?? "EUR",
        iban: record["IBAN"] ?? null,
        bic: record["BIC"] ?? null,
        reference: record["Kundenreferenz"] ?? record["Reference"] ?? null,
        rawLine: JSON.stringify(record),
      };
    });
  }
  /**
   * Match receipts against bank transactions by amount and date proximity.
   */
  function matchReceipts(
    receipts: readonly ReceiptMatchCandidate[],
    transactions: readonly BankTransaction[],
  ): MatchSummary {
    const matched: MatchResult[] = [];
    const matchedReceiptIds = new Set<number>();
    const matchedTxnIndices = new Set<number>();
    // TODO: Implement smarter matching with vendor name fuzzy matching
    // TODO: Add configurable date tolerance window
    // TODO: Handle split transactions (one receipt, multiple bank entries)
    for (const receipt of receipts) {
      let bestMatch: { index: number; confidence: number; reasons: string[] } | null =
        null;
      for (let i = 0; i < transactions.length; i++) {
        if (matchedTxnIndices.has(i)) continue;
        const txn = transactions[i];
        const reasons: string[] = [];
        let confidence = 0;
        // Amount matching (exact or close)
        const amountDiff = Math.abs(Math.abs(txn.amount) - receipt.totalAmount);
        if (amountDiff < 0.01) {
          confidence += 0.5;
          reasons.push("exact_amount_match");
        } else if (amountDiff < 1.0) {
          confidence += 0.3;
          reasons.push("close_amount_match");
        }
        // Date matching
        const receiptDate = new Date(receipt.date).getTime();
        const txnDate = new Date(txn.date).getTime();
        const daysDiff = Math.abs(receiptDate - txnDate) / (1000 * 60 * 60 * 24);
        if (daysDiff < 1) {
          confidence += 0.3;
          reasons.push("same_day");
        } else if (daysDiff < 3) {
          confidence += 0.15;
          reasons.push("within_3_days");
        } else if (daysDiff < 7) {
          confidence += 0.05;
          reasons.push("within_7_days");
        }
        // Vendor name in description
        if (
          txn.description
            .toLowerCase()
            .includes(receipt.vendor.toLowerCase().slice(0, 8))
        ) {
          confidence += 0.2;
          reasons.push("vendor_in_description");
        }
        if (
          confidence > 0.5 &&
          (!bestMatch || confidence > bestMatch.confidence)
        ) {
          bestMatch = { index: i, confidence, reasons };
        }
      }
      if (bestMatch) {
        matched.push({
          receipt,
          transaction: transactions[bestMatch.index],
          confidence: bestMatch.confidence,
          matchReasons: bestMatch.reasons,
        });
        matchedReceiptIds.add(receipt.documentId);
        matchedTxnIndices.add(bestMatch.index);
      }
    }
    const unmatchedReceipts = receipts.filter(
      (r) => !matchedReceiptIds.has(r.documentId),
    );
    const unmatchedTransactions = transactions.filter(
      (_, i) => !matchedTxnIndices.has(i),
    );
    return {
      matched,
      unmatchedReceipts,
      unmatchedTransactions,
      matchRate:
        receipts.length > 0 ? matched.length / receipts.length : 0,
    };
  }
  return { parseBankCsv, matchReceipts };
 }
--- a/src/skill/SKILL.md
+++ b/src/skill/SKILL.md
@ -0,0 +1,72 @@
 # PaperCortex -- Document Intelligence Skill
 > A Claude Code skill for interacting with your Paperless-ngx document archive through AI-powered semantic search, classification, receipt extraction, and accounting export.
 ## Prerequisites
 - PaperCortex MCP Server running (see project README)
 - Paperless-ngx instance with API access
 - Ollama with `qwen2.5:14b` and `nomic-embed-text` models
 ## Available Tools
 ### papercortex_search
 Search documents by meaning, not just keywords.
 ```
 Search for: "office lease agreements from last year"
 Search for: "tax-relevant receipts over 500 EUR"
 Search for: "correspondence with insurance companies"
 ```
 ### papercortex_classify
 Auto-classify a document with AI-suggested tags, type, and correspondent.
 ```
 Classify document #1234
 Classify document #1234 and apply suggested tags
 ```
 ### papercortex_receipt
 Extract structured data from receipt documents.
 ```
 Extract receipt from document #5678
 ```
 Returns: vendor, date, amounts, tax breakdown, line items, category.
 ### papercortex_query
 Ask natural language questions about your document archive.
 ```
 "How much did I spend on office supplies in Q1 2024?"
 "Which invoices are still unpaid?"
 "Summarize all contracts expiring this year"
 ```
 ### papercortex_export
 Export receipt data for accounting software.
 ```
 Export documents #100, #101, #102 as DATEV CSV
 Export documents #200, #201 as generic CSV
 ```
 ## Workflow Examples
 ### Monthly Bookkeeping
 1. Search for all receipts from the current month
 2. Extract data from each receipt
 3. Export as DATEV CSV
 4. Import into accounting software
 ### Document Organization
 1. Find unclassified documents (no tags)
 2. Auto-classify each document
 3. Review and approve suggested tags
 ### Expense Analysis
 1. Query: "What were my top 5 expense categories last quarter?"
 2. Drill into specific categories with follow-up queries
 3. Export relevant receipts for documentation
--- a/tsconfig.json
+++ b/tsconfig.json
@ -0,0 +1,24 @@
 {
  "compilerOptions": {
    "target": "ES2022",
    "module": "ESNext",
    "moduleResolution": "bundler",
    "lib": ["ES2022"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true,
    "resolveJsonModule": true,
    "declaration": true,
    "declarationMap": true,
    "sourceMap": true,
    "noUnusedLocals": true,
    "noUnusedParameters": true,
    "noImplicitReturns": true,
    "noFallthroughCasesInSwitch": true
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules", "dist", "**/*.test.ts"]
 }