From b5d9b4df0381fe7815dcfcc7d7e8d1e359d8c576 Mon Sep 17 00:00:00 2001 From: Rene Fichtmueller Date: Wed, 6 May 2026 12:18:17 +0200 Subject: [PATCH] sync: record runpod status truthfulness hardening --- sync/CURRENT.md | 12 ++++- ...-06-magatama-runpod-status-truthfulness.md | 50 +++++++++++++++++++ 2 files changed, 61 insertions(+), 1 deletion(-) create mode 100644 sync/history/2026-05-06-magatama-runpod-status-truthfulness.md diff --git a/sync/CURRENT.md b/sync/CURRENT.md index 5348cea..918afe1 100644 --- a/sync/CURRENT.md +++ b/sync/CURRENT.md @@ -1,6 +1,6 @@ # Current TIP Sync State -Updated: 2026-05-06 12:02 UTC +Updated: 2026-05-06 12:21 UTC ## Active Policy @@ -60,6 +60,16 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr - host-specific posture for Proxmox / Erik / Mac Studio remains the job of explicit host-audit logic. - after rebuild + deploy + health sync: - live Postgres open findings returned to `0`. +- Follow-up hardening on the same block: + - the earlier RunPod error path in MAGATAMA dashboard was made more truthful. + - dataset preparation now distinguishes: + - local `training:refresh-all` failure + - optional Hugging Face publish failure + - URL-based dataset mode with no external publish required + - the training SSE flow now explicitly tells the operator whether RunPod is using: + - Hugging Face dataset source + - or MAGATAMA URL-bundle dataset source + - this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation. - MAGATAMA was repaired end-to-end to a clean operational baseline: - live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun. diff --git a/sync/history/2026-05-06-magatama-runpod-status-truthfulness.md b/sync/history/2026-05-06-magatama-runpod-status-truthfulness.md new file mode 100644 index 0000000..3752e8c --- /dev/null +++ b/sync/history/2026-05-06-magatama-runpod-status-truthfulness.md @@ -0,0 +1,50 @@ +# 2026-05-06 — MAGATAMA RunPod Status Truthfulness + +## Why this was needed + +After the script/registry repair, MAGATAMA could refresh the local RunPod datasets again, but the operator-facing status flow was still too coarse: + +- failures in local dataset preparation +- failures in optional Hugging Face publish +- and actual RunPod availability + +were too easy to confuse. + +This produced the impression that “RunPod is broken” even when the real problem was just dataset preparation on Erik. + +## Changes + +Patched: + +- `magatama/packages/dashboard/src/server.ts` + +Behavior now: + +- dataset source is normalized to either: + - `huggingface` + - `url` +- local dataset refresh (`training:refresh-all`) is wrapped with a dedicated error: + - `Dataset-Refresh fehlgeschlagen: ...` +- Hugging Face publish is wrapped with a dedicated error: + - `HuggingFace-Publish fehlgeschlagen: ...` +- if Hugging Face mode is selected but `HF_TOKEN` is missing, this is reported directly +- after successful preparation, the SSE stream now explicitly states: + - Hugging Face dataset source in use + - or URL-bundle dataset source in use, with no external publish required + +## Live effect + +The dashboard process was rebuilt and restarted on Erik after this change. + +Result: + +- RunPod preparation status is more honest +- operators can distinguish: + - data refresh problem + - optional external publish problem + - actual RunPod training job submission/polling problem + +## Notes + +- This does not itself force a Hugging Face publish. +- It only makes the control plane truthful about what step is happening and what actually failed.