transceiver-db/blog-085-ai-inference-cluster-optics-requirements.md at reconcile-2026-06-04

Rene Fichtmueller 772ce2074d feat: add blog training articles 056-100 for fo-blog-v3 fine-tuning

45 expert articles covering: Cisco/Juniper/Arista optic compatibility mechanics,
100G/400G/800G optics selection, DWDM/ROADM/WSS architecture, fiber standards,
coherent pluggables, AI cluster optics, carrier timing, EEPROM programming,
market pricing 2026, hyperscale procurement, transceiver failure analysis, and more.

2026-04-07 08:59:16 +02:00

7.5 KiB

Raw Permalink Blame History

title

slug

type

The Standard Stack: Why 400G SR4 at the ToR

For GPU-to-ToR (Top-of-Rack) connectivity, 400G SR4 over OM4 multimode fiber has emerged as the near-universal choice in 2025–2026 deployments, and the reasons are worth stating explicitly rather than accepting as given.

GPU servers connecting to the network use either NVIDIA ConnectX-7, ConnectX-8 (for InfiniBand/Ethernet dual-mode), or Broadcom Thor-2 NICs. The NICs use QSFP-DD or OSFP host connectors, and at the 400G generation, 400G SR4 covers the ToR-to-server distance in any realistic rack configuration — 1 m to 100 m. A server NIC to the ToR switch is typically under 10 m, comfortably within the 100 m SR4 reach on OM4.

The cost point for 400G SR4 from compatible vendors has dropped to $150–250 per module in 2025. Given that a 64-GPU training cluster at 400G per GPU requires 128 modules (64 server-side, 64 switch-side), the total NIC-to-ToR optics cost is $20,000–30,000 — a small fraction of the overall cluster cost where each H100 GPU costs $30,000–40,000.

Active Optical Cables (AOCs) at 400G are an alternative for fixed-length runs: an AOC integrates the transceivers into the cable ends, eliminating the SFP connector interface. AOCs are slightly cheaper than transceiver-plus-passive-cable for the same length, but they're not field-repairable if one end fails. For short in-rack runs to ToR switches in production clusters, the preference has shifted toward passive direct-attach copper (DAC) at 1–3 m (no optical components, lowest latency, lowest cost) and 400G SR4 active optics for runs beyond 3–5 m.

When DR4 Makes Sense for Spine

The spine layer in an AI cluster — the switches connecting ToR switches to each other in the leaf-spine fabric — typically uses single-mode optics because the inter-rack cabling distances exceed OM4 multimode reach.

400G DR4 (single-mode, 4 lanes at 100G PAM4, up to 500 m on OS2) is the standard spine optic for medium to large clusters. The 500 m reach covers any reasonable datacenter floor layout, including multi-building campus clusters. DR4 uses a parallel single-mode fiber array (PSM4 architecture for the optical interface — 4 transmit and 4 receive fibers in an MPO-12 connector), which means the fiber infrastructure between spine switches uses 8-fiber MPO trunk cables.

FR4 (single-mode, 4-wavelength CWDM, up to 2 km) is an option for clusters spread across wider geographies — campus interconnects or edge AI deployments where the compute nodes are distributed. FR4 costs roughly 40–60% more than DR4 for the same 400G capacity, so the additional cost needs to be justified by the actual distance requirement.

For clusters using all-NVLINK (NVSwitch-based all-to-all connectivity for training), the GPU-to-NVSwitch fabric is handled by NVIDIA's proprietary NVLink cables — not standard Ethernet optics. The Ethernet fabric in these configurations handles the "north-south" traffic (storage, user connections, parameter servers) rather than the all-reduce gradient traffic that dominates AI training bandwidth. The optics requirements for the management/external fabric are therefore less demanding than for the training fabric.

InfiniBand vs. Ethernet from an Optics Perspective

The InfiniBand versus Ethernet debate for AI cluster networking involves many considerations — latency, software stack, operational complexity — but from a pure optics perspective, the differences are modest.

HDR InfiniBand (200G) uses QSFP56 or 2x100G interfaces. 400G HDR200 uses QSFP-DD. The optics for InfiniBand at these speeds are physically identical to Ethernet optics (same form factors, same fiber types, same wavelengths). The distinction is in how they're programmed: an InfiniBand HCA uses the same SR4 optic as an Ethernet NIC, but the EEPROM may declare the module as InfiniBand-protocol-supporting via the media type field in the SFF-8636 extended identifier.

NDR InfiniBand (400G) and XDR InfiniBand (800G) use OSFP or QSFP-DD form factors. The physical optics market has largely converged for both protocols at these speeds.

The practical OpEx difference: InfiniBand switches (Mellanox/NVIDIA QM9700, for example) are more restrictive about optic compatibility than Ethernet switches. NVIDIA requires Mellanox-qualified or NVIDIA-tested optics for supported configurations, and the list of approved compatible vendors is shorter than for Ethernet. Engineers planning InfiniBand-based clusters should verify optic compatibility against the specific switch model before procurement.

What 800G Changes at the Rack Level

800G is starting to appear in production AI clusters, primarily in hyperscale training deployments. The transition from 400G to 800G at the ToR level has specific fiber infrastructure implications.

800G SR8 requires MPO-16 or dual MPO-12 per port, compared to 400G SR4's single MPO-12. In a fully-wired 64-port 800G ToR switch, the fiber count entering the switch increases proportionally. A 400G ToR switch with 64 ports requires 64 MPO-12 fiber connectors; the same chassis running 800G SR8 requires 128 MPO-12 (or 64 MPO-16). This doubles the fiber density at the top of the rack and requires pre-wiring the floor with 16-fiber-per-direction infrastructure rather than 12-fiber-per-direction.

For clusters being built from scratch in 2026, designing for 800G fiber infrastructure while deploying 400G today is the correct approach. The incremental cost of running 16-fiber-per-direction trunk cables versus 12-fiber-per-direction is modest at installation time, and avoiding a complete re-cabling when upgrading to 800G pays for the upfront investment.

The GPU NIC side of 800G is also advancing. NVIDIA's B100 and B200 GPU servers use ConnectX-8 NICs at 400G Ethernet per port (two ports per NIC = 800G per GPU), not single 800G ports. The GPU fabric bandwidth is achieved by port bonding rather than single 800G pluggables, which means the current generation of AI servers still maps well to 400G switch ports and 400G SR4 optics.

Practical Procurement Guidance

For AI cluster procurement in 2026, the practical recommendations are straightforward: use 400G SR4 OM4 for all server-to-ToR connections, use 400G DR4 OS2 for ToR-to-spine connections, plan the fiber plant for 16-fiber-per-direction capacity even if deploying 12-fiber-per-direction optics today, and verify InfiniBand optic compatibility against the switch model if using InfiniBand fabric.

The compatible transceiver market is well-established for 400G SR4 and DR4. Multiple vendors (Innolight, Eoptolink, Coherent, Flexoptix) supply these in large quantities with competitive pricing and good technical documentation. Total optic cost for a 1,000 GPU cluster in a standard leaf-spine architecture is approximately $400,000–600,000 — budget accordingly, and verify pricing before locking into a BOM with only OEM optics.

7.5 KiB Raw Permalink Blame History Unescape Escape