sync: record live attack-path guidance fix

sync: record lane-specific runpod adoption versioning
sync: record magatamallm local training verification
2026-05-07 06:40:04 +02:00 · 2026-05-07 01:36:36 +02:00 · 2026-05-07 01:16:25 +02:00
4 changed files with 498 additions and 1 deletions
--- a/sync/CURRENT.md
+++ b/sync/CURRENT.md
@ -1,6 +1,6 @@
 # Current TIP Sync State
-Updated: 2026-05-06 22:55 UTC
+Updated: 2026-05-07 02:58 UTC
 ## Active Policy
@ -27,6 +27,163 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
 ## Latest Work
 - MAGATAMA live follow-up on 2026-05-07:
  - local Mac training API was rechecked after the lane-specific automation changes.
  - current live truth:
    - LaunchAgent `org.fichtmueller.magatama-train-api` is present and running
    - process listens on `*:3214`
    - localhost health now responds when checked outside sandbox restrictions:
      - `GET http://127.0.0.1:3214/health`
      - response:
        - `status = ok`
        - `service = magatama-train-api`
        - `running = false`
        - `pid = null`
        - `updated_at = 2026-05-07T04:14:23Z`
      - interpretation:
        - the training API itself is healthy and reachable
        - it is currently idle, not broken
        - the actual next proof point must come from a fresh lane run that writes lane-specific `*-last_run.json`
  - live Attack Paths UI bug was fixed and deployed to Erik:
    - root cause:
      - the `Open Fix Guidance` button inside the attack-path side panel only triggered a dummy toast and never opened a real finding/ticket detail
    - fix:
      - `magatama/packages/dashboard/public/index-v2.html`
      - new helper:
        - `openFixGuidanceForNode(nodeId)`
      - behavior:
        - if the clicked graph node maps to a real finding ID, MAGATAMA now opens the existing ticket/finding detail drawer via `openTicket(id)`
        - if the node is only a synthetic path node with no backing finding, MAGATAMA now shows an explicit warning instead of pretending to open guidance
    - live deployment:
      - updated `index-v2.html` was rsynced to:
        - `/opt/magatama/packages/dashboard/public/index-v2.html`
      - `pm2 restart magatama-dashboard` executed on Erik
      - deployed file on Erik verified with:
        - `openFixGuidanceForNode`
        - `Open Fix Guidance`
  - operator consequence:
    - Attack Paths no longer contain a placebo “Open Fix Guidance” action
    - clicking it should now open the actual MAGATAMA finding/ticket guidance path when the graph node represents a real finding
 - MAGATAMA training automation was hardened locally on 2026-05-07 for all three lanes:
  - target lanes:
    - `magatamallm`
    - `fo_blogllm`
    - `tip_llm`
  - core root cause confirmed:
    - RunPod dataset refresh / lane export already worked
    - RunPod jobs often reached `COMPLETED`
    - but model adoption/version truth still depended on a single shared:
      - `~/magatama-llm/fine-tuning/last_run.json`
    - this made lane status and successful return/adoption ambiguous across models
    - the training modal could also collapse late stream/adoption failures into a generic `network error`
  - local code fixes now in place:
    - `magatama/packages/fine-tuner/training_api.py`
      - lane-specific last-run files added:
        - `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
        - `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
        - `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
      - legacy `last_run.json` remains only as backward-compatible mirror for `magatamallm`
      - successful RunPod adoption now creates:
        - a release alias per lane, e.g. `<active-alias>-rN`
      - active alias switching sequence is now:
        - candidate model imported
        - smoke-tested
        - release alias created
        - stable active alias repointed to that release alias
      - adoption report now includes:
        - `version_counter`
        - `release_alias`
    - `magatama/packages/fine-tuner/train.py`
      - local metrics writing now also respects lane-specific last-run files via `TRAINING_LANE`
    - `magatama/packages/dashboard/src/server.ts`
      - `/api/llm/status` now reads lane-specific last-run metadata first
      - `release_alias` is preferred as visible model version when present
      - RunPod SSE catch now distinguishes:
        - real generic training failure
        - `COMPLETED` but no artifact / failed adoption
      - the latter is now rendered as a truthful return/adoption failure, not a vague dataset/network issue
    - `magatama/packages/dashboard/public/index-v2.html`
      - training modal now suppresses misleading late generic `network error` if the server already emitted a terminal training status
      - if the stream ends without a final terminal server event, the UI now explicitly says the registry/adoption state must be checked
      - if the backend reports:
        - completed without artifact
        - completed without HF model
        - completed but adoption failed
        the modal now shows that exact reason
  - local verification:
    - `python3 -m py_compile` passed for:
      - `training_api.py`
      - `train.py`
    - dashboard build passed:
      - `pnpm -C packages/dashboard build`
  - current operational blocker:
    - live deployment to Erik was **not yet completed in this step**
    - direct SSH checks returned:
      - `Connection refused`
      - then `Operation timed out`
    - because of that, the new lane-specific automation logic is locally ready, but not yet confirmed live on Erik for the currently running:
      - `tip_llm`
      - `fo_blogllm`
  - practical consequence:
    - the code path is now prepared for full automation:
      - pull from lane-specific training pool
      - train on RunPod
      - verify artifact existence
      - adopt locally
      - create new release alias/version
      - repoint stable active alias
      - show truthful status in UI
    - but the current live Erik run still needs redeploy + verification once SSH is reachable again
 - MAGATAMA local MagatamaLLM training state was re-verified on 2026-05-07:
  - result:
    - the lane export / dataset refresh worked
    - a new locally adopted MagatamaLLM model did **not** land
    - active MAGATAMA provider remains the older alias:
      - `ollama:magatama-coder:latest`
  - live/public evidence:
    - `GET https://magatama.fichtmueller.org/api/llm/status`
      - `activeProvider = ollama:magatama-coder:latest`
      - `autoFixProvider = ollama:magatama-coder:latest`
      - `training.lastTrainingAt = 2026-05-06T22:43:20Z`
      - `training.modelVersion = magatama-coder:latest`
      - `training.activeRun = null`
    - this means the UI timestamp currently reflects the latest dataset/training-state update, not proof of a newly adopted local model.
  - local Mac evidence:
    - `ollama list` still shows:
      - `magatama-coder:latest` → modified `3 weeks ago`
      - `magatama-llm-v2-0:latest` → modified `11 days ago`
    - no newer Magatama candidate/import alias appeared locally
  - registry/adoption evidence:
    - Erik lane manifest exists and is fresh:
      - `/opt/magatama/training-data/runpod/magatamallm/manifest.json`
      - `generatedAt = 2026-05-06T22:45:15.944Z`
      - `train = 15679`
      - `eval = 1743`
      - `total = 17422`
    - but Erik had no populated local adoption/registry state files in:
      - `/opt/magatama/training-data/model-registry/models.json`
      - `/opt/magatama/training-data/model-registry/runs.json`
      - `/opt/magatama/training-data/model-registry/active.json`
      - `/opt/magatama/data/llm-status.json`
    - local repo only had historical `training-data/model-registry/training-runs.json`
  - historical run evidence:
    - recent `magatamallm` training-run records still show:
      - `submitted`
      - then `not_found_after_submit`
      - or other non-adopted / worker-failure states
    - there is still no verified “completed_and_adopted” proof for a new MagatamaLLM local model.
  - operational conclusion:
    - current truth:
      - dataset/lane preparation works
      - local model adoption is still the missing step
      - MAGATAMA does **not** currently know more than the already active `magatama-coder:latest` alias
    - next fix block remains:
      - make RunPod/local completion count only when adoption succeeds
      - persist adoption report + model registry state
      - update active alias and version only after smoke-tested import succeeds
 - MAGATAMA Switchblade port intelligence is now truly flowing end-to-end on 2026-05-06:
  - live root cause:
    - Switchblade itself already had the rich SG350 data (`description`, LLDP neighbor, peer port, octets), but MAGATAMA had still shown mostly flat port chips.
--- a/sync/history/2026-05-07-magatama-attack-path-fix-guidance-live-deploy.md
+++ b/sync/history/2026-05-07-magatama-attack-path-fix-guidance-live-deploy.md
@ -0,0 +1,76 @@
 # MAGATAMA Attack-Path Fix Guidance Live Deploy
 Date: 2026-05-07 UTC
 ## Scope
 - MAGATAMA attack-path side panel
 - local Mac training API reachability/truth check
 ## Findings
 ### 1. `Open Fix Guidance` was a placebo button
 The Attack Paths detail sidebar rendered a real CTA labeled `Open Fix Guidance`, but the click handler only executed:
 - `toast('Fix guidance opened','info')`
 No real drawer, ticket, or finding guidance path opened from that action.
 ### 2. Local training API was not dead; it was just idle
 The local training API service for MAGATAMA lane automation is managed by:
 - `org.fichtmueller.magatama-train-api`
 Live checks showed:
 - LaunchAgent state: running
 - port listener on `*:3214`
 - health response on localhost when checked outside sandbox restrictions:
  - `status = ok`
  - `service = magatama-train-api`
  - `running = false`
  - `pid = null`
 Interpretation:
 - the API process is healthy and reachable
 - it is currently idle between runs
 - the remaining proof point for automation is a fresh lane training run that writes back lane-specific run metadata and completes local adoption/version switching
 ## Fix Applied
 File:
 - `magatama/packages/dashboard/public/index-v2.html`
 Changes:
 - added `openFixGuidanceForNode(nodeId)`
 - `showNodeDetail(n)` now wires the CTA to the new helper instead of a toast
 - if the graph node maps to a real finding:
  - MAGATAMA opens the existing finding/ticket detail via `openTicket(id)`
 - if the node is synthetic and has no backing finding:
  - MAGATAMA now shows a clear warning toast instead of pretending guidance opened
 ## Live Deployment
 Updated file copied to Erik:
 - `/opt/magatama/packages/dashboard/public/index-v2.html`
 Dashboard restarted:
 - `pm2 restart magatama-dashboard`
 Remote file verification confirmed presence of:
 - `openFixGuidanceForNode`
 - `Open Fix Guidance`
 ## Operational Result
 - Attack Paths no longer expose a fake remediation CTA
 - the CTA now routes into the actual MAGATAMA guidance/detail path when the node represents a real finding
 - local training API health is confirmed, but lane-specific successful return/adoption still needs validation with a fresh real training run
--- a/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md
+++ b/sync/history/2026-05-07-magatama-lane-specific-runpod-adoption-versioning.md
@ -0,0 +1,170 @@
 # MAGATAMA Lane-Specific RunPod Adoption + Versioning
 Date: 2026-05-07
 ## Scope
 Harden MAGATAMA training automation for:
 - `magatamallm`
 - `fo_blogllm`
 - `tip_llm`
 Goal:
 - lane-specific training pools remain isolated
 - RunPod `COMPLETED` counts only when model return/adoption is real
 - active lane model gets a new release/version marker after successful adoption
 - dashboard status and errors remain truthful
 ## Problem
 The data/build side of training already worked:
 - lane-specific RunPod datasets were built
 - RunPod jobs were submitted
 - registry often showed `IN_PROGRESS` / `COMPLETED`
 But the end of the chain remained weak:
 1. adoption/version truth still depended on one shared:
   - `~/magatama-llm/fine-tuning/last_run.json`
 2. multiple lanes could therefore overwrite the same success marker
 3. the modal could degrade late-stream adoption failures into a generic `network error`
 4. the user requirement was stricter:
   - training pool -> RunPod -> artifact -> local import -> version bump -> active alias switch
   - all fully automatic
 ## Code changes made locally
 ### 1. Lane-specific last-run metadata
 File:
 - `magatama/packages/fine-tuner/training_api.py`
 Added:
 - `lane_last_run_file(lane)`
 Resulting files:
 - `~/magatama-llm/fine-tuning/magatamallm-last_run.json`
 - `~/magatama-llm/fine-tuning/fo_blogllm-last_run.json`
 - `~/magatama-llm/fine-tuning/tip_llm-last_run.json`
 Compatibility:
 - `magatamallm` still mirrors to legacy:
  - `~/magatama-llm/fine-tuning/last_run.json`
 ### 2. Automatic release alias / version step
 File:
 - `magatama/packages/fine-tuner/training_api.py`
 Added:
 - `next_release_metadata(lane, active_model)`
 - release alias creation
 New adoption sequence:
 1. RunPod artifact imported to candidate model
 2. candidate smoke tests pass
 3. release alias is created:
   - example shape: `<active-alias>-rN`
 4. stable active alias is repointed to that release alias
 This means the lane now receives a concrete new release/version marker after successful adoption.
 ### 3. Dashboard lane status truth
 File:
 - `magatama/packages/dashboard/src/server.ts`
 Changed:
 - `/api/llm/status` now reads lane-specific last-run metadata first
 - `release_alias` is preferred as visible model version
 - this prevents one lane from falsely inheriting another lane's last successful run marker
 ### 4. Truthful RunPod terminal failure messaging
 Files:
 - `magatama/packages/dashboard/src/server.ts`
 - `magatama/packages/dashboard/public/index-v2.html`
 Changed:
 - if RunPod says `COMPLETED` but:
  - no model artifact exists
  - no HF repo appears
  - adoption fails
 the UI now reports that exact reason instead of collapsing into a vague generic failure
 Frontend hardening:
 - avoid showing a misleading late `network error` after the server already emitted a terminal training event
 - if the stream dies without a terminal event, the modal says so explicitly
 ### 5. Local training metrics future-proofed
 File:
 - `magatama/packages/fine-tuner/train.py`
 Changed:
 - metrics now also respect lane-specific last-run files via `TRAINING_LANE`
 ## Local verification
 Passed:
 - `python3 -m py_compile .../training_api.py .../train.py`
 - `pnpm -C .../packages/dashboard build`
 ## Live deployment state
 Not yet completed in this step.
 Reason:
 - direct Erik access failed during this block:
  - `ssh: connect to host 82.165.222.127 port 22: Connection refused`
  - later also `Operation timed out`
 Therefore:
 - the automation fix is locally ready
 - but not yet verified live against the currently running:
  - `tip_llm`
  - `fo_blogllm`
 ## Operational next step
 Once Erik SSH is reachable again:
 1. deploy updated:
   - `training_api.py`
   - `train.py`
   - dashboard build / server bundle
 2. restart:
   - `magatama-dashboard`
   - Mac-side training API if used
 3. verify lane-specific status:
   - `tip_llm`
   - `fo_blogllm`
   - `magatamallm`
 4. verify that a successful RunPod training now results in:
   - artifact found
   - adoption report present
   - lane-specific `*-last_run.json`
   - release alias incremented
   - stable alias repointed
--- a/sync/history/2026-05-07-magatamallm-local-training-verification.md
+++ b/sync/history/2026-05-07-magatamallm-local-training-verification.md
@ -0,0 +1,94 @@
 # 2026-05-07 – MagatamaLLM Local Training Verification
 ## Question
 Did the recent local / MAGATAMA-side MagatamaLLM training actually succeed and increase the active model’s knowledge?
 ## Answer
 No. The dataset refresh succeeded, but a newer locally adopted MagatamaLLM model was **not** verified.
 ## Evidence
 ### 1. Public MAGATAMA status
 `GET https://magatama.fichtmueller.org/api/llm/status`
 Observed:
 - `activeProvider = ollama:magatama-coder:latest`
 - `autoFixProvider = ollama:magatama-coder:latest`
 - `training.lastTrainingAt = 2026-05-06T22:43:20Z`
 - `training.modelVersion = magatama-coder:latest`
 - `training.activeRun = null`
 Interpretation:
 - the dashboard timestamp reflects the latest dataset/training-state update
 - it does **not** prove that a new local model was imported and activated
 ### 2. Local Ollama state on the Mac
 `ollama list`
 Relevant entries:
 - `magatama-coder:latest` → modified `3 weeks ago`
 - `magatama-llm-v2-0:latest` → modified `11 days ago`
 Interpretation:
 - no newly imported Magatama candidate/adopted model is visible locally
 - the active alias still points to an older model image
 ### 3. Dataset/lane export did work
 Fresh Erik manifest exists:
 - `/opt/magatama/training-data/runpod/magatamallm/manifest.json`
 Observed:
 - `generatedAt = 2026-05-06T22:45:15.944Z`
 - `train = 15679`
 - `eval = 1743`
 - `total = 17422`
 Interpretation:
 - the lane export / pool sync is healthy
 - training input exists and was rebuilt
 ### 4. Adoption/registry proof is missing
 On Erik, these expected local state files were absent:
 - `/opt/magatama/training-data/model-registry/models.json`
 - `/opt/magatama/training-data/model-registry/runs.json`
 - `/opt/magatama/training-data/model-registry/active.json`
 - `/opt/magatama/data/llm-status.json`
 Interpretation:
 - no trustworthy proof that a new model artifact was imported, registered, and activated
 ### 5. Historical run records still show failed/non-adopted outcomes
 Local `training-data/model-registry/training-runs.json` still contains recent `magatamallm` runs such as:
 - `submitted`
 - `not_found_after_submit`
 There is still no verified “completed_and_adopted” proof for a new MagatamaLLM local model.
 ## Conclusion
 Current state:
 - pool refresh works
 - lane export works
 - active alias/version switching after training is still not proven
 Therefore:
 - MagatamaLLM did **not** yet gain a verified newer local knowledge state from the recent run attempts
 - MAGATAMA is still operating on the older active alias `magatama-coder:latest`
 ## Next Required Fix
 The remaining training-automation gap is still:
 1. run completes
 2. artifact existence is verified
 3. artifact is adopted/imported locally
 4. smoke tests pass
 5. active alias + model version are updated
 6. only then mark training as successful
Author	SHA1	Message	Date
Rene Fichtmueller	01d0365fbf	sync: record live attack-path guidance fix	2026-05-07 06:40:04 +02:00
Rene Fichtmueller	61328b0607	sync: record lane-specific runpod adoption versioning	2026-05-07 01:36:36 +02:00
Rene Fichtmueller	a6278a5041	sync: record magatamallm local training verification	2026-05-07 01:16:25 +02:00