NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
When does fragmentation occur in the CUDA caching allocator? (docs.pytorch.org)
keynha 4 days ago [-]
In LLM serving, I treat the failure mode at the end of this (long-lived blocks interleaved with short-lived ones, which expandable segments still can't merge across) as the steady state, not an edge case: weights and graph buffers sit forever while per-request KV churns. So I've stopped relying on the caching allocator for KV at all. vLLM reserves one big region at startup and pages fixed-size KV blocks itself, so the allocator never sees the churn. Same fragmentation, solved one layer up.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 01:58:59 GMT+0000 (Coordinated Universal Time) with Vercel.