sync: record runpod status truthfulness hardening
This commit is contained in:
parent
364cd392c7
commit
b5d9b4df03
@ -1,6 +1,6 @@
|
||||
# Current TIP Sync State
|
||||
|
||||
Updated: 2026-05-06 12:02 UTC
|
||||
Updated: 2026-05-06 12:21 UTC
|
||||
|
||||
## Active Policy
|
||||
|
||||
@ -60,6 +60,16 @@ When work touches TIP, Magatama, LLM Gateway, bridges, auth, or shared Erik infr
|
||||
- host-specific posture for Proxmox / Erik / Mac Studio remains the job of explicit host-audit logic.
|
||||
- after rebuild + deploy + health sync:
|
||||
- live Postgres open findings returned to `0`.
|
||||
- Follow-up hardening on the same block:
|
||||
- the earlier RunPod error path in MAGATAMA dashboard was made more truthful.
|
||||
- dataset preparation now distinguishes:
|
||||
- local `training:refresh-all` failure
|
||||
- optional Hugging Face publish failure
|
||||
- URL-based dataset mode with no external publish required
|
||||
- the training SSE flow now explicitly tells the operator whether RunPod is using:
|
||||
- Hugging Face dataset source
|
||||
- or MAGATAMA URL-bundle dataset source
|
||||
- this avoids misleading `RunPod not reachable` wording when the actual failure is in dataset preparation.
|
||||
|
||||
- MAGATAMA was repaired end-to-end to a clean operational baseline:
|
||||
- live guard host-audits for Erik, Mac Studio, and Proxmox were corrected and rerun.
|
||||
|
||||
@ -0,0 +1,50 @@
|
||||
# 2026-05-06 — MAGATAMA RunPod Status Truthfulness
|
||||
|
||||
## Why this was needed
|
||||
|
||||
After the script/registry repair, MAGATAMA could refresh the local RunPod datasets again, but the operator-facing status flow was still too coarse:
|
||||
|
||||
- failures in local dataset preparation
|
||||
- failures in optional Hugging Face publish
|
||||
- and actual RunPod availability
|
||||
|
||||
were too easy to confuse.
|
||||
|
||||
This produced the impression that “RunPod is broken” even when the real problem was just dataset preparation on Erik.
|
||||
|
||||
## Changes
|
||||
|
||||
Patched:
|
||||
|
||||
- `magatama/packages/dashboard/src/server.ts`
|
||||
|
||||
Behavior now:
|
||||
|
||||
- dataset source is normalized to either:
|
||||
- `huggingface`
|
||||
- `url`
|
||||
- local dataset refresh (`training:refresh-all`) is wrapped with a dedicated error:
|
||||
- `Dataset-Refresh fehlgeschlagen: ...`
|
||||
- Hugging Face publish is wrapped with a dedicated error:
|
||||
- `HuggingFace-Publish fehlgeschlagen: ...`
|
||||
- if Hugging Face mode is selected but `HF_TOKEN` is missing, this is reported directly
|
||||
- after successful preparation, the SSE stream now explicitly states:
|
||||
- Hugging Face dataset source in use
|
||||
- or URL-bundle dataset source in use, with no external publish required
|
||||
|
||||
## Live effect
|
||||
|
||||
The dashboard process was rebuilt and restarted on Erik after this change.
|
||||
|
||||
Result:
|
||||
|
||||
- RunPod preparation status is more honest
|
||||
- operators can distinguish:
|
||||
- data refresh problem
|
||||
- optional external publish problem
|
||||
- actual RunPod training job submission/polling problem
|
||||
|
||||
## Notes
|
||||
|
||||
- This does not itself force a Hugging Face publish.
|
||||
- It only makes the control plane truthful about what step is happening and what actually failed.
|
||||
Loading…
x
Reference in New Issue
Block a user