Compare commits

...

5 Commits

Author SHA1 Message Date
Rene Fichtmueller
830ab57c3c sync: record magatama ui cache runpod tooltip changelog fix 2026-05-06 17:24:54 +02:00
Rene Fichtmueller
77a4aab592 sync: record magatama training count source fix 2026-05-06 16:27:14 +02:00
Rene Fichtmueller
9bc84a89ee sync: record runpod serverless materialization check 2026-05-06 13:07:26 +02:00
Rene Fichtmueller
b5d9b4df03 sync: record runpod status truthfulness hardening 2026-05-06 12:18:17 +02:00
Rene Fichtmueller
364cd392c7 sync: record magatama runpod attack-paths atlas exposure fixes 2026-05-06 12:05:15 +02:00
6 changed files with 558 additions and 1 deletions

View File

@ -1,6 +1,6 @@
# Current TIP Sync State
Updated: 2026-05-06 10:28 UTC
Updated: 2026-05-06 15:24 UTC
## Active Policy
@ -27,6 +27,119 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
## Latest Work
- MAGATAMA frontend/runtime consistency was repaired again on 2026-05-06:
- dashboard and core were rebuilt locally and redeployed to Erik.
- live processes restarted successfully:
- `magatama-dashboard`
- `magatama`
- public `api/llm/status` now shows the true lane-export totals for `magatamallm`:
- `collectedExamples = 15620`
- `effectiveExamples = 15620`
- `evalExamples = 1736`
- `totalExamples = 17356`
- `newSinceLastTraining = 15620`
- root cause for the stale `1097` display:
- the RunPod start SSE path still logged the legacy deduplicated `fixes.jsonl` corpus.
- this was changed so RunPod launches no longer present the legacy `1097` count as the active training truth.
- after dataset refresh the UI now emits the lane manifest totals instead.
- RunPod completion handling was hardened:
- worker `COMPLETED` is no longer trusted blindly.
- MAGATAMA now scans RunPod worker logs for real training failures (`Traceback`, `SyntaxError`, non-zero exit, etc.) before treating the run as successful.
- if the worker logs show a hidden failure, MAGATAMA records this as `completed_with_worker_failure` instead of pretending the run succeeded.
- public findings state remains currently empty:
- `GET /api/findings?limit=1` returned `{"findings":[],"total":0}`
- this is now rendered with an explicit empty-state row instead of a visually blank table.
- Attack Paths empty-state is now intentionally explicit rather than looking broken.
- Frontend cache and scope handling were hardened:
- cache version bumped to `2026-05-06b`
- stale legacy `magatama_api_cache:*` entries are cleared
- per-endpoint TTLs added
- invalid or empty scope selections are normalized instead of silently leaving the UI in misleading empty views
- Switchblade rack port hover was materially improved:
- port chips now carry `data-tooltip`
- custom tooltip CSS is live on Erik
- the old browser-native “question mark only” behavior should be replaced by a readable hover bubble
- Changelog self-healing was added in core:
- stale cached changelog data older than 6h now forces a rebuild from git history
- verified live via dashboard proxy on Erik:
- `generatedAt = 2026-05-06T15:18:42.708Z`
- latest visible entries include `2026-04-30` items again instead of appearing frozen at `30.05`
- MAGATAMA training + Attack Paths + Atlas exposure were corrected again on 2026-05-06:
- the RunPod serverless training start failure was not a RunPod outage.
- root cause was missing training scripts on Erik (`training_full_refresh.ts` and related helpers were absent under `/opt/magatama/scripts`).
- Codex synced the full local `magatama/scripts/` tree to Erik, added a safe fallback in `scripts/model_registry_build.ts`, and synced the local `training-data/model-registry/` directory.
- verified on Erik:
- `pnpm training:refresh-all` now succeeds.
- fresh dataset totals after dedupe:
- `magatamallm`: `92,742` raw → `17,356` effective (`15,620 train / 1,736 eval`)
- `fo_blogllm`: `32` total (`28 train / 4 eval`)
- `tip_llm`: `40` total (`36 train / 4 eval`)
- important nuance:
- Codex did **not** execute the final Hugging Face publish step from Erik in this chat.
- local/script/build failures are fixed; external dataset publish still depends on the selected dataset source and explicit publish intent.
- MAGATAMA Attack Paths UX is no longer a misleading blank panel:
- the page now distinguishes between:
- no live attack paths
- historical fallback paths
- empty selected scope (`0 assets in scope`)
- when a user narrows the scope to a rack/location with zero scoped assets, the graph explicitly says so instead of looking broken.
- live dashboard HTML on Erik now contains:
- `Im aktuellen Scope liegen 0 Assets.`
- `Erweitere Standort oder Datacenter / Rack, damit MAGATAMA korrelierbare Assets und Pfade darstellen kann.`
- `Ohne offene mehrstufige Korrelationen bleibt die Graph-Sicht bewusst leer.`
- MAGATAMA code/training hardening was extended:
- `scripts/test_runpod_adapter.py` no longer loads tokenizer/model with `trust_remote_code=True`.
- `scripts/ollama_adapter_bridge.py` no longer loads tokenizer/model with `trust_remote_code=True`.
- this removed the live CODE finding around `HuggingFace trust_remote_code` on Erik.
- Atlas exposure logic was tightened to stop reopening noisy LAN management findings:
- generic `atlas-exposure` findings now only stay operationally open for exposure that is meaningful enough to track as a finding.
- internal RFC1918 management/service ports discovered by the broad atlas scan are no longer promoted into open Guard findings just because they exist on the LAN.
- host-specific posture for Proxmox / Erik / Mac Studio remains the job of explicit host-audit logic.
- after rebuild + deploy + health sync:
- live Postgres open findings returned to `0`.
- Follow-up hardening on the same block:
- the earlier RunPod error path in MAGATAMA dashboard was made more truthful.
- dataset preparation now distinguishes:
- local `training:refresh-all` failure
- optional Hugging Face publish failure
- URL-based dataset mode with no external publish required
- the training SSE flow now explicitly tells the operator whether RunPod is using:
- Hugging Face dataset source
- or MAGATAMA URL-bundle dataset source
- this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation.
- follow-up serverless verification on 2026-05-06 narrowed the remaining fault further:
- MAGATAMA submit logic now verifies that a RunPod job really exists under `/status/{jobId}` instead of trusting `/run`.
- payloads were aligned more closely with the official Axolotl serverless schema:
- `model_type=AutoModelForCausalLM`
- `tokenizer_type=AutoTokenizer`
- dataset `split: train`
- optimizer `adamw_torch_fused`
- verified full run attempt:
- job id `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
- disappeared as `not_found_after_submit` (`404 job not found`)
- verified canary after payload fix:
- job id `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
- immediately materialized as `IN_QUEUE`
- then still disappeared on later reconcile as `not_found_after_submit`
- current conclusion:
- the old MAGATAMA bug is fixed.
- the remaining problem is now likely on the RunPod endpoint/release side: jobs are accepted and briefly queued, but do not survive long enough to produce a durable serverless status lifecycle.
- operational rule:
- do not treat `submitted` or a brief `IN_QUEUE` as proof of a usable serverless training run.
- only trust the run once it reaches `IN_PROGRESS` or a durable terminal state with artifact evidence.
- follow-up training count fix on 2026-05-06 corrected the Training UI source-of-truth:
- MAGATAMA had still shown `1097` because the dashboard was counting the legacy deduplicated fix corpus instead of the current lane-specific RunPod export.
- dashboard now prefers `training-data/runpod/magatamallm/manifest.json` for the visible MagatamaLLM training count.
- synced current lane export to Erik and restarted `magatama-dashboard`.
- verified public API now returns:
- `collectedExamples = 1367`
- `effectiveExamples = 1367`
- `evalExamples = 152`
- `totalExamples = 1519`
- `newSinceLastTraining = 1367`
- if the browser still shows `1097`, treat it as stale cached UI and hard reload.
- MAGATAMA was repaired end-to-end to a clean operational baseline:
- live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
- open findings were reduced all the way to `0` in Postgres.

View File

@ -0,0 +1,152 @@
# 2026-05-06 — MAGATAMA RunPod / Attack Paths / Atlas Exposure Fixes
## Scope
This handoff captures the follow-up fixes after MAGATAMA had already been cleaned to zero findings earlier in the day, but three practical issues remained:
1. RunPod serverless training start was failing from MAGATAMA UI.
2. Attack Paths looked empty/broken to the operator.
3. Atlas exposure findings reopened as noisy internal LAN management alerts.
## What Was Actually Broken
### 1. RunPod training did not fail because of RunPod
User-facing message:
- `RunPod nicht erreichbar`
Real root cause on Erik:
- `/opt/magatama/package.json` already referenced `training:refresh-all` and `training:refresh-all:publish`
- but `/opt/magatama/scripts/training_full_refresh.ts` and related scripts were missing remotely
Additional follow-up break:
- `scripts/model_registry_build.ts` assumed `training-data/model-registry/external-sources.json` always existed remotely
### 2. Attack Paths page looked dead
The page was not broken, but it was misleading:
- selected system scope in the screenshot had `0 Assets in Scope`
- at the same time there were either:
- no multi-step correlated live paths, or
- no open correlated findings
Before the fix the empty canvas looked like a defect instead of an honest empty-state.
### 3. Atlas exposure reopened 28 Guard findings
Live breakdown before the final policy fix:
- `guard | atlas-exposure | high | 9`
- `guard | atlas-exposure | low | 19`
Examples:
- `Exposure: Open ports on 192.168.178.213`
- `Exposure: Open ports on 192.168.178.2`
- `Exposure: Open ports on 192.168.178.5`
These were not “internet exposed” incidents in the meaningful operational sense; they were generic LAN/internal management ports discovered by Atlas.
## Changes Made
### RunPod training pipeline
Synced to Erik:
- full local `/Users/renefichtmueller/Desktop/Claude Code/magatama/scripts/` tree into `/opt/magatama/scripts/`
- local `training-data/model-registry/` into `/opt/magatama/training-data/model-registry/`
Patched:
- `magatama/scripts/model_registry_build.ts`
Behavior change:
- missing external metadata files now fall back safely instead of crashing the refresh step
Verified on Erik:
- `pnpm training:refresh-all` now succeeds
Fresh effective dataset totals:
- `magatamallm`: `92,742 raw -> 17,356 effective`
- `fo_blogllm`: `32 total`
- `tip_llm`: `40 total`
Important note:
- Codex did **not** perform the final external Hugging Face publish step in this chat.
- Local refresh/build path is fixed.
### Attack Paths UI
Patched:
- `magatama/packages/core/src/routes/attack-paths.ts`
- `magatama/packages/dashboard/public/index-v2.html`
Behavior change:
- if no live paths exist, MAGATAMA can still show historical correlated paths when available
- if the user-selected scope contains `0` assets, the graph now says so explicitly
- if there are simply no open multi-step correlations, the page says that honestly
Live strings now present on Erik:
- `Im aktuellen Scope liegen 0 Assets.`
- `Erweitere Standort oder Datacenter / Rack, damit MAGATAMA korrelierbare Assets und Pfade darstellen kann.`
- `Ohne offene mehrstufige Korrelationen bleibt die Graph-Sicht bewusst leer.`
### trust_remote_code hardening
Patched:
- `magatama/scripts/test_runpod_adapter.py`
- `magatama/scripts/ollama_adapter_bridge.py`
Behavior change:
- local adapter/tokenizer/model loading no longer uses `trust_remote_code=True`
Reason:
- this was causing a live MAGATAMA CODE finding on Erik:
- `HuggingFace trust_remote_code`
### Atlas exposure policy
Patched:
- `magatama/packages/core/src/routes/health-atlas.ts`
Behavior change:
- generic Atlas portscan findings on RFC1918/internal assets are no longer automatically promoted into open Guard findings unless the exposure is critical enough to deserve operational tracking
- host-audit remains the authoritative place for explicit posture on Erik / Proxmox / Mac Studio
This removed the noisy LAN exposure findings without simply faking closure; the policy itself was corrected.
## Live Verification
After rebuild, deploy, restart, and health-triggered sync:
- `open findings = 0` in Postgres on Erik
- `scripts/test_runpod_adapter.py` on Erik no longer contains `trust_remote_code=True`
- dashboard empty-state strings for Attack Paths are present in the live HTML path
## Operational Meaning
- MAGATAMA is no longer reopening Guard noise for normal internal management ports discovered by the broad Atlas scan
- Attack Paths no longer looks “broken” when scope or data legitimately yields no graph
- RunPod dataset refresh/build is back to a working state on Erik
## TIP Policy Reminder
- TIPLLM only for robot/crawler planning
- Erik controller/light only
- heavy crawlers on Proxmox / Pis

View File

@ -0,0 +1,65 @@
# 2026-05-06 — MAGATAMA RunPod serverless materialization check
## Summary
MAGATAMA's RunPod submit path was hardened and re-tested against the queue-based Axolotl serverless endpoint `dheii186pfcuq7`.
## What changed
- Payload alignment was tightened toward the official Axolotl serverless schema:
- added `model_type=AutoModelForCausalLM`
- added `tokenizer_type=AutoTokenizer`
- switched dataset split declaration to `split: train`
- switched optimizer from `adamw_8bit` to `adamw_torch_fused`
- Both submit paths now distinguish between:
- `/run` accepted
- `/status/{job}` actually exists
- Updated files:
- `magatama/packages/dashboard/src/server.ts`
- `magatama/scripts/submit_runpod_training.ts`
## Verified behavior
### Full run attempt
- Submitted `magatamallm` 500-step run.
- Returned job id: `9bc4b16b-755b-465b-aadf-b46f2fe467a3-e2`
- Reconcile result shortly after:
- `not_found_after_submit`
- HTTP `404`
- `job not found`
### Canary run after payload/schema fix
- Submitted `magatamallm` seed-only canary.
- Returned job id: `a4ac6951-7ed7-43cb-80d8-5ab61533c2da-e2`
- Immediate submit-side verification saw real queue materialization:
- `runpod_status: IN_QUEUE`
- Reconcile roughly 45 seconds later still observed:
- `not_found_after_submit`
- HTTP `404`
- `job not found`
## Conclusion
The old MAGATAMA bug (blindly trusting `/run`) is fixed.
The remaining problem is now narrower and likely external to MAGATAMA itself:
- RunPod serverless currently accepts the submit and briefly materializes the job as `IN_QUEUE`,
- but the job disappears before a durable status/progress/completion lifecycle can be observed.
This means the endpoint/release is still not trustworthy enough for a full production training launch until it can keep a job alive beyond the initial queue stage.
## Operational rule
Do **not** treat `submitted` or even a brief `IN_QUEUE` as proof of a usable serverless training run.
A MAGATAMA serverless training run is only trustworthy when at least one of these is true:
- status progresses to `IN_PROGRESS`, or
- a durable terminal state is observed with artifact evidence.
## Open next step
- Inspect the actual RunPod serverless endpoint/release configuration and worker-side logs in RunPod UI.
- Only launch the full MagatamaLLM run after a canary survives beyond queue materialization.

View File

@ -0,0 +1,50 @@
# 2026-05-06 — MAGATAMA RunPod Status Truthfulness
## Why this was needed
After the script/registry repair, MAGATAMA could refresh the local RunPod datasets again, but the operator-facing status flow was still too coarse:
- failures in local dataset preparation
- failures in optional Hugging Face publish
- and actual RunPod availability
were too easy to confuse.
This produced the impression that “RunPod is broken” even when the real problem was just dataset preparation on Erik.
## Changes
Patched:
- `magatama/packages/dashboard/src/server.ts`
Behavior now:
- dataset source is normalized to either:
- `huggingface`
- `url`
- local dataset refresh (`training:refresh-all`) is wrapped with a dedicated error:
- `Dataset-Refresh fehlgeschlagen: ...`
- Hugging Face publish is wrapped with a dedicated error:
- `HuggingFace-Publish fehlgeschlagen: ...`
- if Hugging Face mode is selected but `HF_TOKEN` is missing, this is reported directly
- after successful preparation, the SSE stream now explicitly states:
- Hugging Face dataset source in use
- or URL-bundle dataset source in use, with no external publish required
## Live effect
The dashboard process was rebuilt and restarted on Erik after this change.
Result:
- RunPod preparation status is more honest
- operators can distinguish:
- data refresh problem
- optional external publish problem
- actual RunPod training job submission/polling problem
## Notes
- This does not itself force a Hugging Face publish.
- It only makes the control plane truthful about what step is happening and what actually failed.

View File

@ -0,0 +1,40 @@
# 2026-05-06 — MAGATAMA training count source fix
## Summary
MAGATAMA training UI was still showing `1097` because the dashboard counted the legacy deduplicated fix corpus instead of the current lane-specific RunPod export.
## Root cause
- Dashboard training summary read `getTrainingCorpusStats()` from `gitea-learning-pool/magatamallm/fixes.jsonl`.
- Live Erik state still had a huge raw `fixes.jsonl` and an old dedupe-derived effective count path.
- The actual current training source for RunPod is the lane export under:
- `training-data/runpod/magatamallm/magatamallm-sft-train.jsonl`
- `training-data/runpod/magatamallm/magatamallm-sft-eval.jsonl`
- `training-data/runpod/magatamallm/manifest.json`
## Fix
- `packages/dashboard/src/server.ts` now prefers the lane manifest for `magatamallm` training counts.
- Live summary now uses:
- `train = 1367`
- `eval = 152`
- `totalAfterDedupe = 1519`
- `duplicatesRemoved = 1368`
- Synced the current local `training-data/runpod/magatamallm/` directory to Erik.
- Restarted `magatama-dashboard`.
## Verified live
Public API now returns:
- `training.collectedExamples = 1367`
- `training.effectiveExamples = 1367`
- `training.evalExamples = 152`
- `training.totalExamples = 1519`
- `training.newSinceLastTraining = 1367`
- `training.collectionsPath = /opt/magatama/training-data/runpod/magatamallm/manifest.json`
## Operator note
If the UI still shows `1097`, it is a browser cache/stale page issue. Hard reload the MAGATAMA dashboard.

View File

@ -0,0 +1,137 @@
# MAGATAMA UI / Cache / RunPod / Tooltip / Changelog Fix
Date: 2026-05-06
Author: Codex
## Scope
Addressed the current MAGATAMA operator complaints in one block:
- training UI still showed `1097`
- findings page looked blank
- attack paths looked empty/broken
- Switchblade port hover only showed a help cursor / question mark
- changelog looked stale
## What Was Fixed
### 1. Training truth source
`magatamallm` RunPod launches still logged the old legacy deduplicated `fixes.jsonl` count (`1097`) during SSE startup.
This was corrected so RunPod launches now:
- still dedupe the legacy fix corpus where needed
- but no longer present that count as the operator-facing training truth
- instead emit the lane-specific RunPod manifest totals after dataset refresh
Live verified via public MAGATAMA API:
- `collectedExamples = 15620`
- `effectiveExamples = 15620`
- `evalExamples = 1736`
- `totalExamples = 17356`
- `newSinceLastTraining = 15620`
### 2. RunPod completion truthfulness
RunPod worker jobs could return `COMPLETED` even though the logs contained real training failures.
MAGATAMA now inspects worker logs for markers such as:
- `Traceback`
- `SyntaxError`
- non-zero exit status
- explicit train/fine-tune failure text
If such evidence exists, the run is recorded as worker-failed instead of being treated as a clean success.
### 3. Findings page no longer looks broken when empty
The live findings API currently returns:
- `findings = []`
- `total = 0`
The UI now renders an explicit empty-state row when there are no open findings or when filters hide everything, instead of leaving the table visually blank.
### 4. Attack Paths empty-state clarified
Attack Paths previously looked broken when the selected scope had zero assets.
The UI now explicitly states:
- the current scope has `0 assets`
- operators should widen location/datacenter/rack scope
- the graph stays intentionally empty when no correlated multi-step paths exist
### 5. Frontend cache + scope hardening
Frontend cache handling was improved:
- cache version bumped to `2026-05-06b`
- stale legacy `magatama_api_cache:*` entries are cleared
- per-endpoint TTLs were introduced
- invalid scope selections are normalized
- empty scoped selections reset rather than silently trapping the UI in misleading empty views
### 6. Switchblade port hover improved
The old port chips relied only on browser-native `title` behavior.
Now:
- port chips carry `data-tooltip`
- custom tooltip CSS is shipped live
- usage/state text should appear as a real hover bubble
Live Erik file check confirmed:
- `data-tooltip` markers present
- tooltip CSS present
### 7. Changelog self-healing
The public changelog cache in MAGATAMA core previously returned cached data indefinitely if structurally valid.
Now:
- cached changelog older than 6 hours triggers a rebuild from git history
Live verified on Erik through dashboard proxy:
- `generatedAt = 2026-05-06T15:18:42.708Z`
- latest entries include fresh `2026-04-30` material again
## Files Touched In MAGATAMA
- `packages/dashboard/public/index-v2.html`
- `packages/dashboard/src/server.ts`
- `packages/core/src/routes/changelog.ts`
## Deployment Status
Built locally and redeployed to Erik:
- dashboard dist synced
- core dist synced
- `index-v2.html` synced
- PM2 restarted:
- `magatama-dashboard`
- `magatama`
## Important Live Evidence
- public `api/llm/status` shows lane-export counts, not `1097`
- public `api/findings?limit=1` returns empty findings cleanly
- Erik live dashboard file contains:
- `API_CACHE_VERSION = '2026-05-06b'`
- `data-tooltip`
- `Im aktuellen Scope liegen 0 Assets.`
- `Klicken für Details`
## Open Truths
- current live findings are genuinely `0`; this is not a hidden frontend-only failure
- Attack Paths can still be empty if there are truly no scoped assets or no correlated attack stories
- RunPod serverless still needs endpoint-side reliability; the MAGATAMA-side truthfulness improvements do not by themselves fix a broken RunPod release/worker pipeline