Posts by Jorge Parada

From Build to Benchmark: ONNX Model Serving with Triton Inference Server on AMD GPUs

Triton Inference Server is an open-source platform designed to streamline AI inferencing. It supports the deployment, scaling, and inference of trained models from multiple frameworks, including ONNX Runtime, TensorFlow, PyTorch, and others. It runs across cloud, data center, and edge environments, making it adaptable for diverse AI workloads.

Read more ...


Scale LLM Inference with Multi-Node Infrastructure

Horizontal scaling of compute resources has become a critical aspect of modern computing due to the ever-increasing growth in data and computational demands. Unlike vertical scaling, which focuses on enhancing an individual system’s resources, horizontal scaling enables the expansion of a system’s capabilities by adding more instances or nodes working in parallel. In this way, it ensures high availability and low latency of the service, making it essential to handle diverse workloads and ensure optimal user experience.

Read more ...