Tiny-vLLM: High-Performance C++ LLM Inference Engine
Show HN feature reveals Tiny-vLLM, a lightweight C++ and CUDA inference engine designed to outperform Python-based alter…
12 articles about 'LLM inference'
Show HN feature reveals Tiny-vLLM, a lightweight C++ and CUDA inference engine designed to outperform Python-based alter…
Developers evaluate local LLM frameworks for Mac, prioritizing inference speed and memory efficiency over raw model size…
Cloudflare unveils disaggregated prefill architecture and custom Infire inference engine to run large language models ef…
Arm Holdings announces a new neural processing unit architecture optimized for running large language models directly on…
MIT researchers unveil a new sparse attention mechanism that dramatically reduces LLM inference costs while preserving m…
Google's new Gemma 4 open-weight models leverage speculative decoding to deliver up to 3x faster inference with no quali…
A practical breakdown of AI workloads the RTX 5060 Ti 16GB can handle, from local LLMs to voice recognition and agent fr…
A practical guide to dramatically boosting LLM inference speed using vLLM and NVIDIA TensorRT-LLM frameworks.
A new study reveals Mixture-of-Experts models activate only a fraction of parameters during inference, slashing compute …
Developers debate the best cloud platforms for running Zhipu AI's GLM5.1, raising questions about reliability, speed, an…
A new paper from Google DeepMind and Turing Award winner David Patterson reveals the staggering hardware costs of LLM in…
When you have multiple unrelated questions for an LLM, splitting them into parallel requests almost always beats batchin…