Posts tagged Inference

Inferencing and serving with vLLM on AMD GPUs

04 April 2024

vLLM is a high-performance, memory-efficient serving engine for large language models (LLMs). It leverages PagedAttention and continuous batching techniques to rapidly process LLM requests. PagedAttention optimizes memory utilization by partitioning the Key-Value (KV) cache into manageable blocks. The KV cache stores previously computed keys and values, enabling the model to focus on calculating attention solely for the current token. These blocks are subsequently managed through a lookup table, akin to memory page handling in operating systems.

Tags
AI/ML
C++
Compiler
Computer Vision
Generative AI
HPC
Inference
Installation
Julia
Kernel
LLM
Linear Algebra
MONAI
MPI
Memory
Mixed Precision
Mixtral
Mixture of Experts
Multimodal
NUMA
Natural Language Processing
NeRF
Neural Collaborative Filtering
OpenMP
Optimization
Partner Applications
Performance
Profiling
Programming Languages
PyTorch
RAG
ResNet
Scientific computing
Segmentation
Serving
Speech to Text
Stable Diffusion
TensorFlow
Tracing