Adding diverse topic coverage: - blog-008: buying_guide — OEM vs compatible real cost numbers - blog-009: migration_guide — 100G→400G what actually breaks - blog-010: technology_deep_dive — QSFP-DD vs OSFP form factor reality - blog-011: tutorial — transceiver procurement checklist All follow FO rules: no markdown headers in body, no bullet lists, one thesis, engineer voice, ~1000 words. Total training set: 11 articles.
6.9 KiB
| title | type | target_audience | score |
|---|---|---|---|
| 100G to 400G Migration: What Actually Breaks and Why | migration_guide | technical | 9/10 |
Every 100G to 400G migration story starts the same way. The planning phase looks clean. The vendor presentations are reassuring. The lab tests pass. Then you push the first production links and something unexpected happens. Not catastrophic — just wrong. The errors you didn't plan for, the connectors that "worked fine" at 100G but don't at 400G, the fiber path you've run for six years that's suddenly marginal.
This is not a guide about what to buy. It's about what breaks, why it breaks at 400G when it was fine at 100G, and what to verify before you're doing it at 2 AM with traffic on it.
The first thing that breaks is your assumptions about fiber. 100G SR4 over OM4 has 150 meters of reach. 400G SR8 over OM4 has 100 meters. Your 120-meter cross-connect that's been solid for four years is now out of spec. You didn't change the fiber. You didn't change the topology. You changed the optic and the speed and suddenly the link is marginal. This affects more deployments than people admit. Before migrating any fabric, pull your cable plant documentation and verify every run against the new reach specs. If your cable plant documentation is "we think it's around X meters," measure it.
The second thing that changes is MPO polarity. This one has ended careers. 100G SR4 uses MPO-12 connectors. So does 400G DR4. But 400G SR8 uses MPO-16. If your migration path goes through an intermediate step — 100G SR4 to 400G SR4 to 400G SR8 — you're changing the connector type. And if you're using breakout cables to connect to servers or legacy switches, the polarity matters. Method B and Method C MPO polarity wiring work differently. An MPO trunk that was working fine with your 100G SR4 might work or might not with your 400G modules depending on the polarity. Test with the actual module before deploying. Don't assume the previous polarity map is valid.
The loss budget changes significantly at 400G, and this is where most marginal fiber plants get exposed. At 100G LR4 (1310nm, single-mode), you have 6.3 dB of loss budget. The typical link: 2 LC connectors at 0.3 dB each, 10km of single-mode fiber at 0.35 dB/km = 3.5 dB, leaving 2.4 dB of margin. That's fine. At 400G FR4 (same wavelength, same fiber), you have 6.0 dB of loss budget. But FR4 covers 2km, not 10km. If you're doing 400G FR4 over campus fiber with multiple patch panels, you might be at 3-4 dB of connector loss alone plus the fiber run. You don't have the same margin as your old 100G LR4.
Clean the connectors. I mean actually clean them, not "we cleaned them a few years ago." Dirty fiber connectors account for a disproportionate share of 400G link issues because the power budget margin is tighter. At 100G you were getting away with connectors that add 0.5-1 dB of extra loss instead of the spec 0.3 dB. At 400G, that 0.7 dB extra loss per connector times 4 connectors in a path (4 patch panel connections) is 2.8 dB of unexpected loss. On a path with 2.4 dB of margin, you're over budget before the fiber even enters the picture.
The fiber type question comes up on every migration and the answer is the same: single-mode for anything over 500 meters, with DR4 for runs up to 500 meters and LR4/FR4 for longer runs. Multi-mode works fine for short-reach 400G within a data center. What doesn't work is trying to push 400G LR4 modules over multi-mode fiber. Not because the optic will fail — it'll launch light just fine. Because the modal dispersion in multi-mode fiber will destroy the signal quality at 400G speeds. The SMF/MMF question was forgiven at 10G, barely workable at 100G in some cases, and not workable at 400G.
The switch configuration side of 400G migrations has its own landmines. The most common: auto-negotiation behavior changes. At 100G, auto-neg is either on or off and usually works either way. At 400G QSFP-DD, the link training and auto-negotiation process is more complex. Some platforms default to different settings. When you migrate from a 100G switch to a 400G switch at the top-of-rack level, the server NICs that are now receiving 400G signals may not train properly if auto-neg is configured inconsistently. Test the actual NIC firmware on the actual server against the 400G switch you're deploying, not against the vendor's interoperability matrix. That matrix was built in a lab with specific firmware versions that may not match what you're running.
The breakout question also changes at 400G. 100G switches commonly offered 4x25G breakout from QSFP28 ports. 400G switches offer 4x100G breakout from QSFP-DD ports. If you're connecting legacy 100G servers to a 400G spine, you're either running them on dedicated 100G ToR switches (wasteful) or using breakout cables from the 400G switch. The 4x100G breakout works well when supported. The 2x200G breakout from some platforms is less universally supported in the ecosystem. Know your breakout requirements before committing to a platform.
The coherent optics question for 400G only applies to specific topologies — DCI, long-haul backbone, anything with DWDM. For data center fabric, the answer is non-coherent: DR4 for intra-DC, FR4 for campus, LR4 for longer campus runs. Coherent 400ZR is for WAN extension over DWDM infrastructure, not for general fabric. If someone is suggesting 400ZR for your data center spine, they're either wrong about the use case or your network has unusual topology requirements.
The DOM monitoring gap in many networks becomes visible during 400G migration. At 100G, you might have been running without per-port power monitoring because the margins were comfortable. At 400G, you need to know your TX power, RX power, pre-FEC BER, and temperature for every port. Not weekly in a report. In your monitoring system, with alerts. The first 400G link degradation you catch proactively through monitoring will justify the setup time. Without it, you find out from users reporting slow transfers or packet loss.
The migration sequence that works: start with a single spine-leaf pair, run at full line rate for two weeks before migrating the rest, collect baseline DOM data during that period, identify any outliers in fiber paths or connector quality, fix them before scaling. The migration sequence that creates problems: migrate the whole fabric in a weekend because the maintenance window is approved. You don't know which paths are marginal until the traffic is on them, and "marginal" at 400G can mean intermittent errors instead of clean failures.
None of this is reason to delay a 400G migration. The technology is mature, the compatible optics are available, the switch ecosystem is solid. The reason to delay is rushing it. The fiber plant surprises are real. The connector cleaning is necessary. The DOM monitoring is non-optional. Do those three things and most 400G migrations are unremarkable. Skip them and you'll remember the migration for years, for the wrong reasons.