Chaojun Hou

Chaojun Hou#

Chaojun Hou is currently a PMTS Software Development Engineer in the Training at Scale team. He is a cloud‑native architect focused on scheduling and reliability for AI at scale. His work centers on resource scheduling and online/offline workload colocation to maximize utilization while protecting online QoS—together with fault‑tolerant design for large‑scale model training, spanning compute, storage, networking, observability, and resilience. He has extensive hands‑on experience across the cloud stack and is adept at systems and deep‑learning‑engineering optimizations that move AI from prototype to production.

Posts by Chaojun Hou