sync: record live attack-path guidance fix

sync: record lane-specific runpod adoption versioning
sync: record magatamallm local training verification
2026-05-07 06:40:04 +02:00 · 2026-05-07 01:36:36 +02:00 · 2026-05-07 01:16:25 +02:00
4 changed files with 498 additions and 1 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -1,6 +1,6 @@
 # Current TIP Sync State

-Updated: 2026-05-06 22:55 UTC
+Updated: 2026-05-07 02:58 UTC

 ## Active Policy

@ -27,6 +27,163 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr

 ## Latest Work

+- MAGATAMA live follow-up on 2026-05-07:
+  - local Mac training API was rechecked after the lane-specific automation changes.
+  - current live truth:
+    - LaunchAgent `org.fichtmueller.magatama-train-api` is present and running
+    - process listens on `*:3214`
+    - localhost health now responds when checked outside sandbox restrictions:
+      - `GET http://127.0.0.1:3214/health`
+      - response:
+        - `status = ok`
+        - `service = magatama-train-api`
+        - `running = false`
+        - `pid = null`
+        - `updated_at = 2026-05-07T04:14:23Z`
+      - interpretation:
+        - the training API itself is healthy and reachable
+        - it is currently idle, not broken
+        - the actual next proof point must come from a fresh lane run that writes lane-specific `*-last_run.json`
+  - live Attack Paths UI bug was fixed and deployed to Erik:
+    - root cause:
+      - the `Open Fix Guidance` button inside the attack-path side panel only triggered a dummy toast and never opened a real finding/ticket detail
+    - fix:
+      - `magatama/packages/dashboard/public/index-v2.html`
+      - new helper:
+        - `openFixGuidanceForNode(nodeId)`
+      - behavior:
+        - if the clicked graph node maps to a real finding ID, MAGATAMA now opens the existing ticket/finding detail drawer via `openTicket(id)`
+        - if the node is only a synthetic path node with no backing finding, MAGATAMA now shows an explicit warning instead of pretending to open guidance
+    - live deployment:
+      - updated `index-v2.html` was rsynced to:
+        - `/opt/magatama/packages/dashboard/public/index-v2.html`
+      - `pm2 restart magatama-dashboard` executed on Erik
+      - deployed file on Erik verified with:
+        - `openFixGuidanceForNode`
+        - `Open Fix Guidance`
+  - operator consequence:
+    - Attack Paths no longer contain a placebo “Open Fix Guidance” action
+    - clicking it should now open the actual MAGATAMA finding/ticket guidance path when the graph node represents a real finding
+
+- MAGATAMA training automation was hardened locally on 2026-05-07 for all three lanes:
+  - target lanes:
+    - `magatamallm`
+    - `fo_blogllm`
+    - `tip_llm`
+  - core root cause confirmed:
+    - RunPod dataset refresh / lane export already worked
+    - RunPod jobs often reached `COMPLETED`
+    - but model adoption/version truth still depended on a single shared:
+      - `~/magatama-llm/fine-tuning/last_run.json`
+    - this made lane status and successful return/adoption ambiguous across models
+    - the training modal could also collapse late stream/adoption failures into a generic `network error`
+  - local code fixes now in place:
+    - `magatama/packages/fine-tuner/training_api.py`
+      - lane-specific last-run files added:
+        - `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
+        - `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
+        - `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
+      - legacy `last_run.json` remains only as backward-compatible mirror for `magatamallm`
+      - successful RunPod adoption now creates:
+        - a release alias per lane, e.g. `<active-alias>-rN`
+      - active alias switching sequence is now:
+        - candidate model imported
+        - smoke-tested
+        - release alias created
+        - stable active alias repointed to that release alias
+      - adoption report now includes:
+        - `version_counter`
+        - `release_alias`
+    - `magatama/packages/fine-tuner/train.py`
+      - local metrics writing now also respects lane-specific last-run files via `TRAINING_LANE`
+    - `magatama/packages/dashboard/src/server.ts`
+      - `/api/llm/status` now reads lane-specific last-run metadata first
+      - `release_alias` is preferred as visible model version when present
+      - RunPod SSE catch now distinguishes:
+        - real generic training failure
+        - `COMPLETED` but no artifact / failed adoption
+      - the latter is now rendered as a truthful return/adoption failure, not a vague dataset/network issue
+    - `magatama/packages/dashboard/public/index-v2.html`
+      - training modal now suppresses misleading late generic `network error` if the server already emitted a terminal training status
+      - if the stream ends without a final terminal server event, the UI now explicitly says the registry/adoption state must be checked
+      - if the backend reports:
+        - completed without artifact
+        - completed without HF model
+        - completed but adoption failed
+        the modal now shows that exact reason
+  - local verification:
+    - `python3 -m py_compile` passed for:
+      - `training_api.py`
+      - `train.py`
+    - dashboard build passed:
+      - `pnpm -C packages/dashboard build`
+  - current operational blocker:
+    - live deployment to Erik was **not yet completed in this step**
+    - direct SSH checks returned:
+      - `Connection refused`
+      - then `Operation timed out`
+    - because of that, the new lane-specific automation logic is locally ready, but not yet confirmed live on Erik for the currently running:
+      - `tip_llm`
+      - `fo_blogllm`
+  - practical consequence:
+    - the code path is now prepared for full automation:
+      - pull from lane-specific training pool
+      - train on RunPod
+      - verify artifact existence
+      - adopt locally
+      - create new release alias/version
+      - repoint stable active alias
+      - show truthful status in UI
+    - but the current live Erik run still needs redeploy + verification once SSH is reachable again
+
+- MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07:
+  - result:
+    - the lane export / dataset refresh worked
+    - a new locally adopted MagatamaLLM model did **not** land
+    - active MAGATAMA provider remains the older alias:
+      - `ollama:magatama-coder:latest`
+  - live/public evidence:
+    - `GET https://magatama.fichtmueller.org/api/llm/status`
+      - `activeProvider = ollama:magatama-coder:latest`
+      - `autoFixProvider = ollama:magatama-coder:latest`
+      - `training.lastTrainingAt = 2026-05-06T22:43:20Z`
+      - `training.modelVersion = magatama-coder:latest`
+      - `training.activeRun = null`
+    - this means the UI timestamp currently reflects the latest dataset/training-state update, not proof of a newly adopted local model.
+  - local Mac evidence:
+    - `ollama list` still shows:
+      - `magatama-coder:latest` → modified `3 weeks ago`
+      - `magatama-llm-v2-0:latest` → modified `11 days ago`
+    - no newer Magatama candidate/import alias appeared locally
+  - registry/adoption evidence:
+    - Erik lane manifest exists and is fresh:
+      - `/opt/magatama/training-data/runpod/magatamallm/manifest.json`
+      - `generatedAt = 2026-05-06T22:45:15.944Z`
+      - `train = 15679`
+      - `eval = 1743`
+      - `total = 17422`
+    - but Erik had no populated local adoption/registry state files in:
+      - `/opt/magatama/training-data/model-registry/models.json`
+      - `/opt/magatama/training-data/model-registry/runs.json`
+      - `/opt/magatama/training-data/model-registry/active.json`
+      - `/opt/magatama/data/llm-status.json`
+    - local repo only had historical `training-data/model-registry/training-runs.json`
+  - historical run evidence:
+    - recent `magatamallm` training-run records still show:
+      - `submitted`
+      - then `not_found_after_submit`
+      - or other non-adopted / worker-failure states
+    - there is still no verified “completed_and_adopted” proof for a new MagatamaLLM local model.
+  - operational conclusion:
+    - current truth:
+      - dataset/lane preparation works
+      - local model adoption is still the missing step
+      - MAGATAMA does **not** currently know more than the already active `magatama-coder:latest` alias
+    - next fix block remains:
+      - make RunPod/local completion count only when adoption succeeds
+      - persist adoption report + model registry state
+      - update active alias and version only after smoke-tested import succeeds
+
 - MAGATAMA Switchblade port intelligence is now truly flowing end-to-end on 2026-05-06:
  - live root cause:
    - Switchblade itself already had the rich SG350 data (`description`, LLDP neighbor, peer port, octets), but MAGATAMA had still shown mostly flat port chips.
--- a/sync/history/2026-05-07-magatama-attack-path-fix-guidance-live-deploy.md
+++ b/sync/history/2026-05-07-magatama-attack-path-fix-guidance-live-deploy.md
@ -0,0 +1,76 @@
+# MAGATAMA Attack-Path Fix Guidance Live Deploy
+
+Date: 2026-05-07 UTC
+
+## Scope
+
+- MAGATAMA attack-path side panel
+- local Mac training API reachability/truth check
+
+## Findings
+
+### 1. `Open Fix Guidance` was a placebo button
+
+The Attack Paths detail sidebar rendered a real CTA labeled `Open Fix Guidance`, but the click handler only executed:
+
+- `toast('Fix guidance opened','info')`
+
+No real drawer, ticket, or finding guidance path opened from that action.
+
+### 2. Local training API was not dead; it was just idle
+
+The local training API service for MAGATAMA lane automation is managed by:
+
+- `org.fichtmueller.magatama-train-api`
+
+Live checks showed:
+
+- LaunchAgent state: running
+- port listener on `*:3214`
+- health response on localhost when checked outside sandbox restrictions:
+  - `status = ok`
+  - `service = magatama-train-api`
+  - `running = false`
+  - `pid = null`
+
+Interpretation:
+
+- the API process is healthy and reachable
+- it is currently idle between runs
+- the remaining proof point for automation is a fresh lane training run that writes back lane-specific run metadata and completes local adoption/version switching
+
+## Fix Applied
+
+File:
+
+- `magatama/packages/dashboard/public/index-v2.html`
+
+Changes:
+
+- added `openFixGuidanceForNode(nodeId)`
+- `showNodeDetail(n)` now wires the CTA to the new helper instead of a toast
+- if the graph node maps to a real finding:
+  - MAGATAMA opens the existing finding/ticket detail via `openTicket(id)`
+- if the node is synthetic and has no backing finding:
+  - MAGATAMA now shows a clear warning toast instead of pretending guidance opened
+
+## Live Deployment
+
+Updated file copied to Erik:
+
+- `/opt/magatama/packages/dashboard/public/index-v2.html`
+
+Dashboard restarted:
+
+- `pm2 restart magatama-dashboard`
+
+Remote file verification confirmed presence of:
+
+- `openFixGuidanceForNode`
+- `Open Fix Guidance`
+
+## Operational Result
+
+- Attack Paths no longer expose a fake remediation CTA
+- the CTA now routes into the actual MAGATAMA guidance/detail path when the node represents a real finding
+- local training API health is confirmed, but lane-specific successful return/adoption still needs validation with a fresh real training run
--- a/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md
+++ b/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md
@ -0,0 +1,170 @@
+# MAGATAMA Lane-Specific RunPod Adoption + Versioning
+
+Date: 2026-05-07
+
+## Scope
+
+Harden MAGATAMA training automation for:
+
+- `magatamallm`
+- `fo_blogllm`
+- `tip_llm`
+
+Goal:
+
+- lane-specific training pools remain isolated
+- RunPod `COMPLETED` counts only when model return/adoption is real
+- active lane model gets a new release/version marker after successful adoption
+- dashboard status and errors remain truthful
+
+## Problem
+
+The data/build side of training already worked:
+
+- lane-specific RunPod datasets were built
+- RunPod jobs were submitted
+- registry often showed `IN_PROGRESS` / `COMPLETED`
+
+But the end of the chain remained weak:
+
+1. adoption/version truth still depended on one shared:
+   - `~/magatama-llm/fine-tuning/last_run.json`
+2. multiple lanes could therefore overwrite the same success marker
+3. the modal could degrade late-stream adoption failures into a generic `network error`
+4. the user requirement was stricter:
+   - training pool -> RunPod -> artifact -> local import -> version bump -> active alias switch
+   - all fully automatic
+
+## Code changes made locally
+
+### 1. Lane-specific last-run metadata
+
+File:
+
+- `magatama/packages/fine-tuner/training_api.py`
+
+Added:
+
+- `lane_last_run_file(lane)`
+
+Resulting files:
+
+- `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
+- `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
+- `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
+
+Compatibility:
+
+- `magatamallm` still mirrors to legacy:
+  - `~/magatama-llm/fine-tuning/last_run.json`
+
+### 2. Automatic release alias / version step
+
+File:
+
+- `magatama/packages/fine-tuner/training_api.py`
+
+Added:
+
+- `next_release_metadata(lane, active_model)`
+- release alias creation
+
+New adoption sequence:
+
+1. RunPod artifact imported to candidate model
+2. candidate smoke tests pass
+3. release alias is created:
+   - example shape: `<active-alias>-rN`
+4. stable active alias is repointed to that release alias
+
+This means the lane now receives a concrete new release/version marker after successful adoption.
+
+### 3. Dashboard lane status truth
+
+File:
+
+- `magatama/packages/dashboard/src/server.ts`
+
+Changed:
+
+- `/api/llm/status` now reads lane-specific last-run metadata first
+- `release_alias` is preferred as visible model version
+- this prevents one lane from falsely inheriting another lane's last successful run marker
+
+### 4. Truthful RunPod terminal failure messaging
+
+Files:
+
+- `magatama/packages/dashboard/src/server.ts`
+- `magatama/packages/dashboard/public/index-v2.html`
+
+Changed:
+
+- if RunPod says `COMPLETED` but:
+  - no model artifact exists
+  - no HF repo appears
+  - adoption fails
+
+the UI now reports that exact reason instead of collapsing into a vague generic failure
+
+Frontend hardening:
+
+- avoid showing a misleading late `network error` after the server already emitted a terminal training event
+- if the stream dies without a terminal event, the modal says so explicitly
+
+### 5. Local training metrics future-proofed
+
+File:
+
+- `magatama/packages/fine-tuner/train.py`
+
+Changed:
+
+- metrics now also respect lane-specific last-run files via `TRAINING_LANE`
+
+## Local verification
+
+Passed:
+
+- `python3 -m py_compile .../training_api.py .../train.py`
+- `pnpm -C .../packages/dashboard build`
+
+## Live deployment state
+
+Not yet completed in this step.
+
+Reason:
+
+- direct Erik access failed during this block:
+  - `ssh: connect to host 82.165.222.127 port 22: Connection refused`
+  - later also `Operation timed out`
+
+Therefore:
+
+- the automation fix is locally ready
+- but not yet verified live against the currently running:
+  - `tip_llm`
+  - `fo_blogllm`
+
+## Operational next step
+
+Once Erik SSH is reachable again:
+
+1. deploy updated:
+   - `training_api.py`
+   - `train.py`
+   - dashboard build / server bundle
+2. restart:
+   - `magatama-dashboard`
+   - Mac-side training API if used
+3. verify lane-specific status:
+   - `tip_llm`
+   - `fo_blogllm`
+   - `magatamallm`
+4. verify that a successful RunPod training now results in:
+   - artifact found
+   - adoption report present
+   - lane-specific `*-last_run.json`
+   - release alias incremented
+   - stable alias repointed
+
--- a/sync/history/2026-05-07-magatamallm-local-training-verification.md
+++ b/sync/history/2026-05-07-magatamallm-local-training-verification.md
@ -0,0 +1,94 @@
+# 2026-05-07 – MagatamaLLM Local Training Verification
+
+## Question
+
+Did the recent local / MAGATAMA-side MagatamaLLM training actually succeed and increase the active model’s knowledge?
+
+## Answer
+
+No. The dataset refresh succeeded, but a newer locally adopted MagatamaLLM model was **not** verified.
+
+## Evidence
+
+### 1. Public MAGATAMA status
+
+`GET https://magatama.fichtmueller.org/api/llm/status`
+
+Observed:
+- `activeProvider = ollama:magatama-coder:latest`
+- `autoFixProvider = ollama:magatama-coder:latest`
+- `training.lastTrainingAt = 2026-05-06T22:43:20Z`
+- `training.modelVersion = magatama-coder:latest`
+- `training.activeRun = null`
+
+Interpretation:
+- the dashboard timestamp reflects the latest dataset/training-state update
+- it does **not** prove that a new local model was imported and activated
+
+### 2. Local Ollama state on the Mac
+
+`ollama list`
+
+Relevant entries:
+- `magatama-coder:latest` → modified `3 weeks ago`
+- `magatama-llm-v2-0:latest` → modified `11 days ago`
+
+Interpretation:
+- no newly imported Magatama candidate/adopted model is visible locally
+- the active alias still points to an older model image
+
+### 3. Dataset/lane export did work
+
+Fresh Erik manifest exists:
+- `/opt/magatama/training-data/runpod/magatamallm/manifest.json`
+
+Observed:
+- `generatedAt = 2026-05-06T22:45:15.944Z`
+- `train = 15679`
+- `eval = 1743`
+- `total = 17422`
+
+Interpretation:
+- the lane export / pool sync is healthy
+- training input exists and was rebuilt
+
+### 4. Adoption/registry proof is missing
+
+On Erik, these expected local state files were absent:
+- `/opt/magatama/training-data/model-registry/models.json`
+- `/opt/magatama/training-data/model-registry/runs.json`
+- `/opt/magatama/training-data/model-registry/active.json`
+- `/opt/magatama/data/llm-status.json`
+
+Interpretation:
+- no trustworthy proof that a new model artifact was imported, registered, and activated
+
+### 5. Historical run records still show failed/non-adopted outcomes
+
+Local `training-data/model-registry/training-runs.json` still contains recent `magatamallm` runs such as:
+- `submitted`
+- `not_found_after_submit`
+
+There is still no verified “completed_and_adopted” proof for a new MagatamaLLM local model.
+
+## Conclusion
+
+Current state:
+- pool refresh works
+- lane export works
+- active alias/version switching after training is still not proven
+
+Therefore:
+- MagatamaLLM did **not** yet gain a verified newer local knowledge state from the recent run attempts
+- MAGATAMA is still operating on the older active alias `magatama-coder:latest`
+
+## Next Required Fix
+
+The remaining training-automation gap is still:
+
+1. run completes
+2. artifact existence is verified
+3. artifact is adopted/imported locally
+4. smoke tests pass
+5. active alias + model version are updated
+6. only then mark training as successful
Author	SHA1	Message	Date
Rene Fichtmueller	01d0365fbf	sync: record live attack-path guidance fix	2026-05-07 06:40:04 +02:00
Rene Fichtmueller	61328b0607	sync: record lane-specific runpod adoption versioning	2026-05-07 01:36:36 +02:00
Rene Fichtmueller	a6278a5041	sync: record magatamallm local training verification	2026-05-07 01:16:25 +02:00