Comparative Analysis of Scale-Out RoCE Network Traffic Patterns and Loads in Training Large Language Models#
As large-scale AI workloads continue to grow, understanding network behavior becomes critical. This blog analyzes scale-out RoCE traffic patterns and loads in large language model training, helping you uncover bottlenecks, improve performance, and design more scalable ROCm-based systems.
The proliferation of large language models (LLMs) has transformed the landscape of artificial intelligence, enabling sophisticated applications in natural language processing, multimodal reasoning, and generative tasks. Training these models at unprecedented scales demands distributed computing environments that span tens to hundreds of thousands of GPUs, which high-performance networks interconnect to manage the very large data exchanges that synchronization and computation require. Remote direct memory access over converged Ethernet (RoCE) has emerged as a pivotal technology in this domain, providing RDMA functionalities over Ethernet infrastructures, which offer a balance of performance, scalability, and cost-effectiveness compared to traditional InfiniBand (IB) networks. This blog delivers an exhaustive comparison of the scale-out architectures, RoCE network traffic patterns, and resultant loads during the training phases of four flagship LLMs: OpenAI’s ChatGPT 4.0 (built on GPT-4), Meta’s Llama 3, DeepSeek AI’s DeepSeek-V2, and xAI’s Grok 4.0.
Synthesizing insights from a broad array of technical reports, peer-reviewed papers, engineering blogs, and industry analyses up to November 11, 2025, this expanded analysis scrutinizes essential dimensions including cluster topologies, GPU densities, parallelism paradigms, communication primitives, traffic burstiness, flow entropy, bandwidth consumption, latency constraints, jitter profiles, and resilience mechanisms. While GPT-4 and DeepSeek-V2 predominantly leveraged InfiniBand for its low-latency guarantees, Llama 3 adopted a hybrid RoCE and IB approach, and Grok 4.0 fully embraced Ethernet-based RoCE through NVIDIA’s Spectrum-X platform. Despite these variances, the core traffic patterns—which exhibit bursty elephant flows from collective operations like AllReduce and All-to-All—remain consistent across models, rooted in shared distributed training frameworks such as PyTorch, JAX, and NCCL.
Key findings underscore that hyperscale deployments, as seen in Grok 4.0’s 200,000-GPU Colossus cluster, engender network loads orders of magnitude higher than smaller-scale efforts like DeepSeek-V2 (estimated 2,000-50,000 GPUs), with per-server bandwidth peaking at terabits per second and aggregate fabric demands in the petabits range. RoCE’s integration in Meta and xAI’s infrastructures highlights its economic advantages—potentially reducing costs by 30-50% over IB—while exposing exigencies for advanced congestion control, adaptive routing, and telemetry to mitigate low-entropy hashing and incast phenomena. This edition augments the discourse with granular quantitative comparisons, extended discussions on optimization strategies, historical context, and forward-looking recommendations, supported by an enriched set of citations and references for scholarly rigor.
Understanding Network Demands in Distributed LLM Training#
The trajectory of LLM development has been marked by exponential growth in model parameters, dataset volumes, and computational requirements, necessitating scale-out paradigms that distribute workloads across vast GPU arrays. From the early days of transformer-based models like BERT in 2018 to contemporary behemoths exceeding a trillion parameters, training processes have evolved to incorporate multi-faceted parallelism, generating network traffic that tests the limits of modern interconnects. RoCE, evolving from its inception in 2010 as an extension of RDMA protocols, has matured into a cornerstone for AI infrastructures, facilitating zero-copy data transfers with minimal CPU involvement over ubiquitous Ethernet fabrics.
This post builds upon prior analyses by delving deeper into the intricacies of RoCE deployment in LLM training, contrasting it with InfiniBand where applicable. We explore how traffic patterns—dominated by periodic, high-volume bursts—manifest differently across models due to architectural nuances, such as MoE gating in GPT-4 and DeepSeek-V2 versus dense transformers in Llama 3. Historical context reveals that early LLM trainings, like GPT-3 in 2020, relied heavily on IB for its lossless guarantees, but economic pressures and Ethernet advancements have propelled RoCE adoption, as evidenced by Meta’s 2024 disclosures. Our scope encompasses:
Scale-out foundations: Detailed cluster metrics, including power consumption and fault rates.
Traffic dynamics: In-depth flow characterizations, entropy analyses, and primitive breakdowns.
Load quantification: Bandwidth, latency, utilization, and MFU metrics with comparative benchmarks.
Operational challenges: Congestion, scaling bottlenecks, and RoCE-specific mitigations.
Comparative frameworks: Tabular syntheses and qualitative insights.
Evolutionary perspectives: From legacy HPC to AI-centric fabrics.
Prospective optimizations: Emerging technologies like in-network computing.
NB: Sources span authoritative outlets, ensuring a balanced view as of November 11, 2025.
Background on Distributed LLM Training and RoCE#
Evolution of Distributed Training Paradigms#
The genesis of distributed LLM training traces to the need to surmount memory and compute constraints in single-node setups. Data parallelism, pioneered in frameworks like TensorFlow circa 2015, replicates models across devices, synchronizing through ring-based AllReduce algorithms. Tensor parallelism, refined in Megatron-LM (2019), shards tensors to enable larger models, entailing AllGather/ReduceScatter operations that scale quadratically with degree. Pipeline parallelism, as in GPipe (2018), stages layers to overlap computations, introducing point-to-point traffic with bubble inefficiencies. Switch Transformers (2021) popularized MoE architectures, adding All-to-All for expert routing, which amplifies burstiness but reduces active parameters.
In practice, hybrids like ZeRO (2020) optimize memory through sharding, while libraries such as NCCL (2017) implement hierarchical collectives to minimize global communications. Traffic bursts align with mini-batch cycles: forward passes incur lighter loads, backward passes spike with gradients, and MoE adds routing overheads—up to 20% of cycles in models like DeepSeek-V2. Historical scaling, from GPT-2’s modest clusters to GPT-4’s 25,000 GPUs, has intensified these patterns, with inter-node data volumes reaching petabytes per run.
RoCE Protocol: Technical Underpinnings and AI Adaptations#
The IETF standardized RoCEv2 in 2014, encapsulating RDMA in UDP/IP and leveraging Ethernet’s ecosystem for scalability. Core features include PFC for lossless queues, ECN for congestion signaling, and QP multiplexing to handle low-entropy flows. In AI contexts, RoCE supports multi-rail configurations, where servers bond multiple NICs (e.g., 400G x 9 in Grok’s Colossus) for aggregate throughputs exceeding 3 Tb/s.
Compared to IB, RoCE offers 20-50% cost savings through commodity switches but demands tuning: PFC headroom to avert pauses, UDF hashing for ECMP balance, and DCQCN variants for ECN. LLM-specific adaptations include receiver-initiated RDMA to offload senders and in-network telemetry for proactive load balancing, as deployed in Meta’s RoCE clusters. Traffic archetypes—elephants (>100 MB tensors), bursts (millisecond-scale), low entropy (fixed graphs)—exacerbate incasts, where many-to-one flows overwhelm receivers, necessitating deep buffers (e.g., 100 MB per port).
Quantifying Network Loads in LLM Contexts#
Loads escalate with model scale: a 1T-parameter model might transfer 1-2 TB per iteration across 10,000 GPUs, with peaks at 80% of link capacity. Latency sensitivities (<5 μs for intra-rack, <20 μs inter-rack) ensure high MFU (Model FLOPS Utilization); jitter from congestion can halve utilization. Historical benchmarks show IB achieving 95% efficiency in HPC, while RoCE, post-tuning, matches at 90% in AI workloads. Failures, occurring every few hours in large clusters, trigger resync floods, underscoring fault-tolerant designs.
Model Overviews and Training Infrastructures#
ChatGPT 4.0 (GPT-4)#
Architectural intricacies: GPT-4, a 1.76T-parameter MoE model with 8 experts (~220B each), integrates vision through multimodal embeddings. OpenAI trained the model on trillions of tokens from web, code, and images. Its MoE gating employs top-2 routing, balancing load across experts while minimizing activation sparsity.
Scale-out deployment: Azure’s supercluster ran the training with ~25,000 A100 GPUs over 90-100 days, equating to ~2.15e25 FLOPs. Power draw: ~5-10 MW, with 8-GPU nodes in dense racks.
Network fabric: IB (200G HDR) in a hierarchical Clos, offering ~200 Gb/s per link and sub-μs latencies. No RoCE; IB’s credit-based flow control preferred for stability.
Parallelism details: DP across replicas, TP for tensors (degree 8-16), PP (4-8 stages), MoE All-to-All. NCCL hierarchies reduce global traffic by 50%.
Llama 3#
Architectural depth: 405B dense parameters. Meta pre-trained the model on 15T tokens with GQA and scaled RMSNorm for efficiency.
Scale-out deployment: Dual 24,576-H100 clusters (~49,000 total), months-long runs at ~400 TFLOPS/GPU, ~20 MW power.
Network fabric: Hybrid: 400G RoCE (Ethernet) and IB (Quantum-2). RoCE uses 2-stage Clos, 1:2 oversubscription, QP scaling for entropy.
Parallelism details: ZeRO DP, PP/TP hybrids; topology-aware rank assignment cuts cross-zone traffic 30%.
DeepSeek-V2#
Architectural innovations: 236B MoE (16 experts, 2-3 active), MLA for KV compression. DeepSeek trained the model on 8.1T tokens.
Scale-out deployment: 2,000-50,000 H800 GPUs, weeks-long, ~300K GPU-hours/trillion tokens, cost ~$5-6M.
Network fabric: IB fat-tree, 400G links; no RoCE, focusing on reliability.
Parallelism details: 16-way PP, 8-way expert, ZeRO-1 DP; device-limited routing bounds All-to-All.
Grok 4.0#
Architectural focus: ~1.7T parameters, multimodal with tool-use, RLHF on proprietary data.
Scale-out deployment: Colossus with 200,000 H100s, 246M GPU-hours, 99% uptime, gigawatt power.
Network fabric: Spectrum-X RoCE Ethernet, 3.6 Tb/s/server, flat multi-rail with adaptive routing.
Parallelism details: Hyperscale DP/TP/PP/MoE; JAX for resilience.
Comparative Analysis of Scale-Out Architectures#
Scale-out metrics reveal divergent strategies: Grok’s 200,000 GPUs prioritize volume, DeepSeek’s efficiency.
Network Topologies: Llama 3 and Grok 4.0 use Clos topologies in their RoCE-based Ethernet networks, while GPT-4 and DeepSeek-V2 employ fat-tree topologies with InfiniBand.
Node densities remain uniform, but Grok’s multi-rail mitigates intra-rack bottlenecks.
Model |
GPUs |
Power (MW) |
Topology |
Oversubscription |
Cost Estimate |
|---|---|---|---|---|---|
GPT-4 |
25,000 A100 |
5-10 |
Tree IB |
Low |
$100M+ |
Llama 3 |
49,000 H100 |
20 |
Tree Hybrid |
1:2 |
$500M+ |
DeepSeek-V2 |
2,000-50,000 H800 |
5-15 |
Tree IB |
Minimal |
$5-6M |
Grok 4.0 |
200,000 H100 |
1000+ |
Rail Hybrid |
None |
$Billion+ |
Historical evolution: From 1,000-GPU clusters in 2020 to Grok’s 2025 hyperscale.
RoCE Network Traffic Patterns#
LLM training generates highly structured, periodic network traffic whose characteristics are largely determined by the parallelism strategy and model architecture. Across all four models examined, elephant flows—transfers exceeding 100 MB representing gradient tensors or activation shards—account for approximately 80% of total bytes on the fabric, while the remaining 20% consists of small control and synchronization messages. These large flows arrive in tight, millisecond-scale bursts synchronized to mini-batch boundaries: light traffic during the forward pass, a sharp gradient spike during the backward pass, and, in MoE models, a second burst from expert routing.
Burstiness and flow entropy. A persistent challenge in RoCE deployments is low flow entropy: because computation graphs are static, the same source-destination GPU pairs communicate on the same QP hash throughout training, causing ECMP to map them consistently to the same switch paths. This head-of-line blocking can underutilize 40–60% of available paths. Meta’s Llama 3 deployment addressed this by scaling QP counts per connection, improving effective path diversity by approximately 40%. Grok 4.0’s Spectrum-X fabric uses adaptive per-packet routing to sidestep the ECMP entropy problem at the cost of increased fabric complexity.
Communication primitive breakdown. The dominant collectives are:
AllReduce (data and tensor parallelism): Ring- or tree-based gradient aggregation, generating sustained bidirectional traffic proportional to the number of participating GPUs. In Llama 3, AllReduce across 24,576-GPU partitions produces 400 Gb/s link-level peaks.
AllGather / ReduceScatter (tensor parallelism sharding): Symmetric to AllReduce but decomposed for ZeRO-style optimizer sharding; prevalent in Llama 3 and DeepSeek-V2.
All-to-All (MoE expert routing): Each GPU sends and receives a token subset from every other GPU in the expert group. In GPT-4 and DeepSeek-V2, this contributes up to 20% of per-iteration traffic with high burstiness and unpredictable incast patterns.
Point-to-Point (pipeline parallelism): Activation and gradient tensors forwarded between pipeline stages. Traffic is bounded and directional, making it the most ECMP-friendly pattern.
DeepSeek-V2 overlaps All-to-All with computation through device-limited expert routing, substantially reducing the exposed communication window. Grok 4.0 relies on JAX’s runtime to pipeline primitives across its 200,000-GPU fabric, where even modest per-GPU inefficiencies multiply to petabit-scale effects.
Network Loads and Performance Metrics#
Bandwidth#
Per-link and per-server bandwidth requirements scale dramatically with cluster size and model architecture. GPT-4’s InfiniBand HDR links sustain approximately 200 Gb/s per link under AllReduce load, with inter-node traffic bounded by the 25,000-GPU cluster’s hierarchical Clos topology. Llama 3’s hybrid RoCE/IB fabric sees 300 Gb/s burst peaks on 400G Ethernet links during ReduceScatter phases, with aggregate fabric demand reaching hundreds of petabits per day across its dual 24,576-GPU clusters. DeepSeek-V2 operates at similar per-link rates (~200 Gb/s) on a much smaller footprint, keeping total fabric load 5–10× lower than Llama 3. Grok 4.0’s Colossus cluster is in a class of its own: nine 400G NICs per server yield 3.6 Tb/s per host, with aggregate fabric demand in the petabits-per-second range at full utilization.
Latency and Jitter#
Collective operation latency directly determines Model FLOP Utilization (MFU). Intra-rack latency must remain below 5 μs and inter-rack below 20 μs to avoid stalling GPU pipelines. Properly tuned RoCE (DCQCN congestion control, PFC headroom calibrated per-queue, ECN thresholds set below buffer saturation) achieves these targets and matches InfiniBand at 90–95% efficiency. Jitter is the more insidious problem: a transient congestion event causing 50–100 μs additional latency on a single collective can stall an entire pipeline stage across thousands of GPUs, effectively halving MFU for that iteration.
Model FLOP Utilization#
Reported MFU values across the four deployments range from 38–50% of theoretical peak:
Model |
MFU (reported) |
Primary efficiency constraint |
|---|---|---|
GPT-4 |
~38–45% |
MoE routing overhead, IB credit latency |
Llama 3 |
~45–50% |
Cross-zone RoCE/IB boundary overhead |
DeepSeek-V2 |
~40–45% |
Device-limited All-to-All, smaller cluster |
Grok 4.0 |
~38–42% |
Hyperscale stragglers, fault recovery |
Hardware faults—occurring every few hours in clusters of tens of thousands of GPUs—trigger checkpoint reloads and gradient resynchronization floods that temporarily saturate the fabric and spike utilization well above steady-state levels.
Challenges and Optimizations in RoCE Environments#
Congestion Control#
Incast is the dominant failure mode in RoCE LLM fabrics: AllReduce and All-to-All drive many-to-one traffic patterns at switch aggregation points, filling shallow buffers in microseconds and triggering PFC pause frames that propagate backward through the fabric. DCQCN (Data Center Quantized Congestion Notification) mitigates this by using ECN marks to ramp down sender rates before buffers saturate, but tuning is non-trivial—ECN thresholds too high delay reaction; too low cause unnecessary rate reductions. Meta’s SIGCOMM 2024 paper documents per-queue PFC headroom calibration as critical for Llama 3’s RoCE stability. In-network reduction offloads collective operations onto switch ASICs, reducing the volume of traffic reaching the host by up to 5×, and represents the most impactful single optimization available in supported hardware.
ECMP and Path Diversity#
Standard ECMP hashes on a five-tuple (source IP, destination IP, protocol, source port, destination port). Because RoCE QPs use fixed source/destination pairs throughout training, flows consistently hash to the same paths, leaving parallel paths idle. Solutions include:
QP proliferation: Opening multiple QPs per peer (as in Llama 3) spreads flows across more hash buckets.
User-defined hashing (UDF): Extending the ECMP hash to include RoCE/IB-specific fields such as Queue Pair Number (QPN) or Destination QKey.
Adaptive routing: Per-packet or per-flowlet rerouting based on real-time congestion signals, as deployed in Grok 4.0’s Spectrum-X fabric. This eliminates entropy dependence at the cost of potential packet reordering, which RoCEv2 receivers must handle through reorder buffers.
Scaling Bottlenecks#
Cross-zone traffic—where tensor-parallel or pipeline-parallel communication must traverse fat-tree spine layers—disproportionately loads high-radix switches. Topology-aware rank assignment, which places communicating GPU pairs within the same rack or pod where possible, reduces cross-zone traffic by up to 30% in Llama 3’s deployment. At Grok’s 200,000-GPU scale, even a 1% load imbalance translates to terabits of misrouted traffic, making telemetry-driven scheduler feedback loops essential.
RoCE versus InfiniBand: Operational Trade-offs#
InfiniBand’s credit-based flow control provides inherently lossless, low-jitter delivery without PFC-induced pause propagation, which is why GPT-4 and DeepSeek-V2 retained it despite higher cost. RoCE’s advantages—30–50% lower hardware cost, commodity switch ecosystem, and integration with standard IP tooling—are compelling at scale, but they require operational investment that IB abstracts away: explicit DCQCN tuning, PFC storm prevention, and UDF configuration. Hybrid deployments (Llama 3) balance this by using IB for latency-sensitive intra-rack collectives and RoCE for inter-rack bulk transfers.
Implications for AMD Instinct GPU Deployments#
The network traffic patterns and optimization strategies documented across these four deployments are directly applicable to clusters built on AMD Instinct GPUs running the ROCm software stack. AMD Instinct MI300X GPUs, with 192 GB of HBM3 memory per accelerator and interconnects supporting up to 800 Gb/s per node through AMD Infinity Fabric and RoCE-capable NICs, are deployed in scale-out configurations that generate the same classes of traffic analyzed here.
AllReduce and collective efficiency with ROCm. The ROCm Communication Collectives Library
(RCCL) implements ring- and tree-based AllReduce, AllGather, and ReduceScatter on AMD hardware,
with support for multi-node topologies through the rccl-tests benchmark suite. RCCL’s integration
with ROCm’s HIP runtime enables the same NCCL-compatible API surface that PyTorch and JAX use,
meaning frameworks validated on NVIDIA hardware migrate directly to AMD clusters. The elephant
flow and burst characteristics described for Llama 3-scale deployments apply equally to
MI300X-based clusters running equivalent workloads.
RoCE fabric considerations for Instinct clusters. AMD Instinct MI300X nodes typically pair
with 400G Ethernet NICs (e.g., Broadcom Thor2 or Mellanox ConnectX-7) in multi-rail configurations.
The congestion control challenges—PFC storm risk, ECMP entropy collapse under static computation
graphs, and incast at AllReduce aggregation points—are hardware-agnostic and require the same
DCQCN tuning, QP proliferation, and topology-aware rank assignment described for Meta’s and xAI’s
deployments. AMD’s open ROCm toolchain makes per-collective profiling straightforward through
rocprof and rocprofiler-sdk, enabling operators to identify which collectives dominate
network load and target optimizations accordingly.
Design recommendations for AMD-based AI networks. Based on the cross-model analysis in this paper, operators deploying AMD Instinct clusters at scale should:
Enable UDF hashing on Ethernet switches to include QPN in the ECMP hash, preventing entropy collapse on RCCL’s fixed QP assignments.
Calibrate PFC headroom per queue following DCQCN guidelines, paying particular attention to the MI300X’s high-bandwidth HBM injection rates which can fill switch buffers faster than CPU-based servers.
Use topology-aware rank assignment in RCCL by setting
NCCL_TOPO_FILEor equivalent ROCm topology descriptors to keep tensor-parallel groups within the same rack.Profile AllReduce vs. All-to-All ratios for your specific model architecture; MoE models on MI300X clusters will exhibit the same All-to-All burstiness documented for GPT-4 and DeepSeek-V2, and may benefit from device-limited routing strategies analogous to DeepSeek’s approach.
As AMD Instinct deployments scale to tens of thousands of GPUs, the lessons from Llama 3 and Grok 4.0’s RoCE operations—particularly adaptive routing and in-network telemetry—will become increasingly relevant to maintaining high MFU at cluster scale.
Comparative Insights, Historical Context, and Future Directions#
Comparative Synthesis#
The four models represent a spectrum of scale-cost-reliability trade-offs. GPT-4 and DeepSeek-V2 prioritized operational stability over cost, retaining InfiniBand despite its premium. Meta and xAI accepted RoCE’s operational complexity in exchange for significant cost savings and the ability to procure commodity Ethernet at hyperscale. DeepSeek-V2’s efficiency-first approach—achieving competitive model quality at 1/10th Grok’s infrastructure spend—demonstrates that network design choices are inseparable from model architecture choices: device-limited MoE routing was specifically designed to bound the All-to-All traffic that would otherwise make a smaller IB cluster infeasible.
Historical Context#
The shift from InfiniBand to RoCE in AI fabrics mirrors an earlier transition in HPC: InfiniBand dominated from roughly 2005–2018 because Ethernet lacked the congestion control primitives needed for RDMA. RoCEv2 standardization (IETF, 2014) and the subsequent development of DCQCN made Ethernet-based RDMA viable, but it was the economics of hyperscale—Meta, Microsoft, and xAI each deploying tens of thousands of GPUs—that made the operational investment worthwhile. GPT-3 in 2020 trained on InfiniBand; Llama 3 in 2024 used a hybrid; Grok 4.0 in 2025 went fully Ethernet. The trajectory is clear, though InfiniBand retains a performance advantage in smaller, latency- sensitive deployments.
Future Directions#
Several emerging technologies are poised to reshape RoCE deployments:
Ultra Ethernet Consortium (UEC): UEC 1.0, ratified in 2023, defines end-to-end congestion control, packet trimming, and multipathing primitives purpose-built for AI workloads, addressing the core limitations of RoCEv2 without abandoning Ethernet economics.
In-network computing: Switch-based AllReduce offload (as in NVIDIA’s Sharp or Broadcom’s equivalent) removes collective bottlenecks from the host data path. As collective volumes grow with model scale, in-network reduction will become a standard rather than an optional feature.
800G and beyond: Next-generation Ethernet at 800 Gb/s per port will double per-server bandwidth headroom, partially deferring the need for multi-rail configurations currently required at Grok’s scale.
Optical circuit switching: Reconfigurable optical fabrics can provide lossless, zero-queuing bandwidth between pod pairs during AllReduce bursts, complementing packet-switched fabrics for predictable collective traffic.
Summary#
The comparative analysis of RoCE network traffic patterns across GPT-4, Llama 3, DeepSeek-V2, and Grok 4.0 reveals both the universality of the underlying communication patterns and the divergence in how leading AI organizations have chosen to address them. AllReduce, All-to-All, and pipeline point-to-point flows dominate every deployment; the differences lie in scale, topology, and the operational sophistication applied to congestion control and path diversity.
Llama 3 and Grok 4.0 demonstrate that RoCE at hyperscale is operationally viable when paired with rigorous DCQCN tuning, topology-aware scheduling, and adaptive routing. DeepSeek-V2 shows that co-designing model architecture—specifically MoE routing locality—with network constraints can achieve competitive results at a fraction of the infrastructure investment. GPT-4’s retention of InfiniBand reflects a deliberate reliability-over-cost decision that remains valid for deployments where operational simplicity outweighs procurement savings.
For data center architects and AI infrastructure engineers, the actionable takeaways are: (1) entropy mitigation through QP proliferation or adaptive routing is non-negotiable at scale; (2) in-network reduction is the highest-leverage optimization available on supported hardware; (3) topology-aware rank assignment can reduce cross-zone traffic by 20–30% at zero hardware cost; and (4) hybrid RoCE/IB configurations remain a pragmatic choice for organizations with mixed latency and throughput requirements. As Ultra Ethernet matures and 800G port speeds arrive, the economic case for RoCE will only strengthen, making these operational lessons foundational knowledge for the next generation of AI infrastructure.
Additional Resources#
InfiniBand VS. RoCE v2: Which is Best Network Architecture for AI
[PDF] Rail-only: A Low-Cost High-Performance Network for Training LLMs
GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision
As OpenAI releases GPT-4, Microsoft details Azure AI infrastructure
From bare metal to a 70B model: infrastructure set-up and scripts
What AI Infrastructure Requires & The InfiniBand vs. RoCE Debate
Scaling Llama 3 Training with Efficient Parallelism Strategies
Guest Post: Building Meta’s GenAI Infrastructure - Hammerspace
[PDF] RDMA over Ethernet for Distributed AI Training at Meta Scale
Efficient Pre-training of Llama 3-like model architectures using torchtitan on Amazon SageMaker
InfiniBand vs Ethernet for AI Clusters: Effective GPU Networks in 2025
DeepSeek-V2: A Strong, Economical, and Efficient Mixture - GitHub
InfiniBand VS. RoCE v2: Which is Best Network Architecture for AI
RoCE Vs InfiniBand: Key Differences, Performance & Use Cases
Grok 4 Accelerates AI Arms Race: Progress and Unresolved Perils
Disclaimers#
Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.