transceiver-db/blog-training-data/blog-017-dom-readings-lie.md
Rene Fichtmueller 285a91b945 feat(training): add blog-016 through blog-030 — 15 expert training articles
Adds 15 Sonnet-quality blog articles for fo-blog-v1 fine-tuning:
tutorials, comparisons, tech deep-dives covering 400G/800G topics.
Also adds seed-blog-training-data.py script for learning_corpus import.
2026-04-06 17:59:14 +02:00

8.0 KiB

title type target_audience score
Why DOM Readings Lie: What Your Transceiver Is Not Telling You technology_deep_dive technical 9/10

DOM data is the first place engineers look when a link is misbehaving, and it is frequently the last place they find the actual cause. The problem is not that Digital Optical Monitoring is useless — it is that the values it exposes are proxies for physical conditions, and the relationship between the proxy and the condition breaks down in specific, predictable ways that most engineers never learn because the link usually works and the discrepancy never surfaces. When the link is marginal, those discrepancies become the difference between a correct diagnosis and two hours of misguided troubleshooting.

Start with the measurement window. SFF-8636 and CMIS specifications define DOM registers as rolling averages over an implementation-defined interval. Most module vendors use windows between 100 ms and 500 ms, but nothing in the standard mandates a specific value, and vendors do not generally publish what window their modules use. What this means in practice is that a burst error event lasting 10 ms — long enough to drop 267,000 frames on a 100G path — produces a transient in instantaneous RX power that may reduce the average register value by less than 0.1 dB. The register reads as completely normal. Meanwhile, the switch's post-FEC counters may also look normal because RS-FEC corrected the burst. The pre-FEC BER counter, if the platform exposes it, will show elevated symbol errors for that 100 ms averaging window and then return to baseline. An engineer looking at DOM thirty seconds after the event sees nothing. The link is declared healthy. The event repeats every few hours at peak utilization.

TX bias current is the DOM parameter that tells the truth about module aging, and almost nobody monitors it. TX power is what engineers watch, but TX power is actively regulated by the module's automatic power control circuit, which adjusts bias current to maintain a target output level as the laser ages. The result is that TX power remains stable and within spec even as the laser diode degrades, because the control loop is doing its job — right up until the bias current hits the maximum value the driver circuit can supply, at which point TX power collapses. By the time TX power deviates from its nominal value, the module has been in a failure trajectory for months. The bias current trend over time is the leading indicator. A VCSEL-based 25G SFP28 that shipped at 6 mA of bias current and is now running at 14 mA against a maximum alarm threshold of 17 mA has less than a year of life remaining under steady operating temperature. TX power still reads nominal. DOM says the module is healthy.

Temperature compensation is a specific mechanism that makes thermal alarms misleading on modern modules. QSFP28 and QSFP-DD modules implement a lookup table that adjusts the reported TX power and RX power values based on the measured die temperature, because optical output and receiver sensitivity are temperature-dependent. The compensation makes the power readings appear stable across the module's operating temperature range. What it masks is that a module running at 68°C cage temperature — which is measurable via the temperature register — is operating in a region where VCSEL degradation rate accelerates by roughly a factor of two for every 10°C above 60°C, based on published Arrhenius model data from major VCSEL vendors. The DOM temperature register is not alarmed because 68°C is within the module's specified operating range. The TX power register looks fine because the compensation table adjusted it. The engineer sees no flags. The module is being consumed at twice the rate of a module running at 55°C in a well-cooled cage.

DOM cannot measure what happens outside the module. This is obvious when stated directly but it is routinely forgotten during troubleshooting. RX power is measured at the photodetector inside the module, after the light has passed through the receiver lens, the wavelength filter, and the mode conditioner on multimode variants. It does not know whether the 0.8 dB of loss between the transmitting module and the receiving module comes from a fiber bend, a dirty connector, a mismatched fiber type, or a partially engaged MPO. It reports a number. The number is correct as a measurement of optical power at that point in the optical path. The interpretation of what caused that power level is entirely left to the engineer, who frequently blames the module when the answer is the connector.

The RX power low warning threshold in DOM is set by the module manufacturer at the point where the optical link is approaching receiver sensitivity limits. On a QSFP28 100GBASE-LR4 module that value is typically around -11 dBm against a receiver sensitivity of -13.5 dBm. An RX power reading of -11.5 dBm triggers a warning, and the instinct is to replace the transceiver. But the relevant question is whether the -11.5 dBm represents a degraded module or a degraded fiber path. If the module was receiving -9.5 dBm at commissioning and now receives -11.5 dBm, 2 dB of loss has appeared somewhere in the path. Fiber loss does not spontaneously increase over time unless something physical changed — a bend radius violation introduced during a cable tray reorganization, connector contamination, or physical damage to the patch cord. The DOM reading did not change inside the module. The fiber changed. A correct diagnosis requires comparing current DOM values against commissioning baselines, not against the manufacturer's alarm thresholds.

The correct way to use DOM data involves understanding which registers have physical meaning and which are derived or estimated. The temperature register is a direct measurement from a thermistor on the module substrate — it is the most reliable DOM value. The TX bias current register is a direct measurement from the driver circuit — it is the best aging indicator. The TX power register is measured at the laser's monitor photodetector and is generally accurate but is affected by the APC loop. The RX power register is measured at the receiver photodetector and is accurate but is a local measurement at the end of the optical path, not a characterization of the path itself. Voltage supply registers are accurate and useful for identifying power rail problems on the line card. The supply voltage dropping below 3.2V on a nominal 3.3V module is a real failure indicator that shows up in DOM before any optical parameter deviates.

Flexoptix EEPROM programming makes it possible to reconfigure module alarm and warning thresholds to match the actual optical power budget of the specific deployment rather than the generic thresholds the manufacturer ships with. A module deployed on a 15 km LR4 path with 2.5 dB of measured fiber loss and 4.5 dB of margin has very different appropriate alarm thresholds than the same module on a 2 km path with 0.8 dB of loss and 6.2 dB of margin. Platform-specific programming also ensures that the DOM data appears correctly in the management plane of the target switch platform, which matters because some platforms apply alarm masks differently depending on the vendor ID in the module EEPROM. Generic modules from the field sometimes have alarm thresholds set to the absolute minimum the standard requires, which generates false alarms on healthy links and trains engineers to ignore DOM warnings — which is exactly the behavior you do not want when a real marginal link appears.

The engineers who get the most diagnostic value from DOM are the ones who treat it as a trending tool rather than an instantaneous health indicator. Polling TX bias current and cage temperature weekly, graphing the trends over months, and setting actionable thresholds based on those trends rather than on the manufacturer's alarm register gives you actual predictive value. A bias current that has increased by 20 percent over six months on a module that is eighteen months old is a replacement candidate at the next maintenance window, not when the link fails at 3 AM on a Tuesday.