mini-sglang is a minimal, educational re-implementation of an LLM inference serving engine inspired by SGLang. It provides a full stack from HTTP API server down to GPU kernels: a FastAPI-based HTTP/ZMQ server, a continuous-batching scheduler with paged KV-cache (naive and radix-tree prefix-sharing variants), a pluggable attention backend system (FlashAttention, FlashInfer, TensorRT-LLM), tensor-parallel distributed inference via NCCL, and support for multiple model architectures (LLaMA, Qwen2, Qwen3, Qwen3-MoE). The codebase is structured as a Python package with C++/CUDA/Triton extensions compiled via JIT and AOT loaders.
102 files
Start Reading →