2025-05-16 16:04:00
github.com
Run larger context windows and heavier LLMs on your Mac by applying different quantization precision to keys vs values in the attention mechanism’s KV cache. KVSplit enables you to:
- Reduce memory usage by up to 72% with minimal quality loss
- Run 2-3x longer contexts in the same memory budget
- Maintain or improve inference speed compared to FP16
- Optimize for Apple Silicon with full Metal support
Configuration | VRAM @ 8K tokens | Tokens/sec | Perplexity Change |
---|---|---|---|
FP16 (base) | 176.00 MB (100%) | 54,360 | — |
K8V8 (8-bit) | 93.50 MB (47%) | 51,503 | +0.03% |
K8V4 | 71.50 MB (41%) | 57,438 | +0.86% |
K4V8 | 71.50 MB (41%) | 58,690 | +6.06% |
K4V4 (4-bit) | 49.50 MB (28%) | 55,193 | +6.15% |
Configuration | 128 tokens | 2048 tokens | 4096 tokens | 8192 tokens |
---|---|---|---|---|
FP16 (baseline) | 5.50 MB | 44.00 MB | 88.00 MB | 176.00 MB |
K8V8 (8-bit) | 2.92 MB | 23.38 MB | 46.75 MB | 93.50 MB |
K8V4 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB |
K4V8 (mixed) | 2.23 MB | 17.88 MB | 35.75 MB | 71.50 MB |
K4V4 (4-bit) | 1.55 MB | 12.38 MB | 24.75 MB | 49.50 MB |
- Independent quantization of keys and values in the KV cache
- Optimized for Apple Silicon with Metal support
- Comprehensive benchmarking suite with perplexity measurement
- Memory usage and performance analysis tools
- Publication-quality visualization tools
- Easy setup and usage
- macOS (tested on Apple Silicon)
- Homebrew package manager
- Xcode Command Line Tools
# Clone the repository
git clone https://github.com/dipampaul17/KVSplit.git
cd kvsplit
# Run the installer script
chmod +x scripts/install_kvsplit.sh
./scripts/install_kvsplit.sh
The installer will:
- Set up the project structure
- Clone and build llama.cpp with Metal support
- Configure for differentiated KV cache quantization
- Download a small test model (optional)
- Set up Python environment for visualization
Want to see the benefits immediately? Run a quick comparison with your model:
# Run quick comparison with different configurations
python scripts/quick_compare.py --model models/your-model.gguf
This will show you a side-by-side comparison of FP16, K8V8, K8V4, K4V8, and K4V4 with memory usage, speed, and quality metrics.
Configuration | VRAM @ 8K tokens | Memory Savings | Quality Impact |
---|---|---|---|
FP16 (base) | 176.00 MB | — | — |
K8V8 (8-bit) | 93.50 MB | 47% | +0.03% |
K8V4 | 71.50 MB | 59% | +0.86% |
K4V8 | 71.50 MB | 59% | +6.06% |
K4V4 (4-bit) | 49.50 MB | 72% | +6.15% |
Using KVSplit doesn’t just save memory—it often improves inference speed by 5-15%!
Configuration | Tokens/sec (8K ctx) | Speedup vs FP16 |
---|---|---|
FP16 | 54,360 | — |
K8V8 | 51,503 | -5.3% |
K8V4 | 57,438 | +5.7% |
K4V8 | 58,690 | +8.0% |
K4V4 | 55,193 | +1.5% |
kvsplit/
├── llama.cpp/ # Optimized llama.cpp build
├── models/ # LLM model files
├── scripts/ # Utility scripts
│ ├── benchmark_kvsplit.py # Comprehensive benchmark tool
│ ├── install_kvsplit.sh # One-command installer
│ ├── quick_compare.py # Quick comparison utility
│ ├── capture_memory.sh # GIF creation for memory visualization
│ └── visualize_results.py # Generate publication-quality plots
├── results/ # Benchmark results (CSV/JSON)
├── plots/ # Generated visualizations
└── README.md # This file
KV cache memory is dominated by storing key and value vectors for each token. Our research has revealed a critical insight: keys are significantly more sensitive to quantization than values.
- Asymmetric Impact: Keys require higher precision than values for maintaining quality
- Sweet Spot: K8V4 (8-bit keys, 4-bit values) provides optimal balance
- Only 0.86% perplexity degradation vs. FP16
- 59% memory reduction
- Faster inference than FP16
- Confirmation: K4V8 configuration shows 7x more quality degradation than K8V4, despite using the same total bits
This asymmetry allows for more efficient memory usage without compromising model quality, enabling longer context windows and larger models on consumer hardware.
# Baseline (FP16)
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
-t 8 --flash-attn
# ⭐ RECOMMENDED: 8-bit keys, 4-bit values (K8V4)
# Best balance of quality and memory savings
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
-t 8 --flash-attn --kvq 8
# 4-bit keys, 8-bit values (K4V8)
# Shows why key precision matters more than value precision
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
-t 8 --flash-attn --kvq-key 4 --kvq-val 8
# 4-bit keys and values (K4V4)
# Maximum memory savings (72% reduction) with acceptable quality
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf -p "Your prompt" \
-t 8 --flash-attn --kvq 4
# Run with a 32K context (would require ~1.4GB in FP16, only ~400MB with K8V4)
./llama.cpp/build/bin/llama-cli -m models/your-model.gguf \
-c 32768 -n 4096 -t 8 --flash-attn --kvq 8 \
-f your-long-document.txt
Flag | Description | Recommendation |
---|---|---|
-t 8 |
Number of threads | 8 is optimal for most Apple Silicon chips |
--flash-attn |
Enables optimized attention | Recommended for Apple Silicon |
--kvq N |
Sets both key and value bits to N | Use --kvq 8 for K8V4 configuration |
--kvq-key N |
Sets key bits only | Key precision has major quality impact |
--kvq-val N |
Sets value bits only | Value precision has minor quality impact |
-c N |
Context size in tokens | Longer contexts benefit more from KVSplit |
-n N |
Number of tokens to generate | Adjust based on your needs |
-f FILE |
Input file | For processing documents |
-m MODEL |
Model path | Path to your .gguf model file |
For comprehensive performance analysis, use our full benchmark suite:
# Run the full benchmark suite (all configurations and sequence lengths)
python scripts/benchmark_kvsplit.py
# Run a specific configuration test
python scripts/benchmark_kvsplit.py --config K8V4 --seq-len 4096
# Generate publication-quality visualizations
python scripts/visualize_results.py
The benchmarking script provides thorough measurements of:
- 📊 Memory Usage: VRAM and KV cache specifically
- ⚡ Performance: Tokens per second across different sequence lengths
- 🎯 Quality: Perplexity measurement using llama-perplexity
- 📈 Scaling: How memory usage and performance scale with sequence length
Results are saved in CSV/JSON formats with automatic summary statistics, and the visualization script generates publication-quality plots showing key insights.
MIT
You can visualize memory savings with our capture tool:
# Capture memory reduction in Activity Monitor
./scripts/capture_memory.sh
- Metal Performance: Fully optimized for Apple’s Metal framework
- Memory Efficiency: Critical for memory-constrained M1/M2/M3 devices
- Activity Monitor: Use our
capture_memory.sh
script to visualize real-time memory reductions - Alignment: 256B page alignment in llama.cpp means actual memory savings might differ slightly from theoretical calculations
- Differentiated Precision: Independent key and value bit precision (K8V4, K4V8, etc)
- Apple Silicon Optimization: Full Metal support for M1/M2/M3 chips
- Comprehensive Benchmarking: Memory, speed, and quality metrics
- Publication-Quality Visualization: Beautiful plots for analysis
- Simple User Interface: One-command install and quick comparison tools
- Memory Visualization: Tools to capture and visualize memory savings
This project implements ideas from recent research including:
- “More for Keys, Less for Values: Adaptive KV Cache Quantization” (2024)
- “Unifying KV Cache Compression for Large Language Models with LeanKV” (2025)
Additional credits:
Contributions are welcome! Please open an issue or submit a pull request.
-
Best Overall: 🌟 K8V4 🌟 (8-bit keys, 4-bit values)
- 59% memory reduction with only 0.86% quality loss
- Improved inference speed (+5.7% vs FP16)
- Great balance of quality and efficiency
-
Absolute Maximum Memory Savings: K4V4 (4-bit keys and values)
- 72% memory reduction with ~6% quality loss
- Good for memory-constrained devices
- Acceptable for less sensitive applications
-
Best for Very Long Contexts: K8V4 or K4V4
- Memory savings compound with context length
- Run 2-3x longer contexts in the same memory budget
MIT
Contributions are welcome! Please open an issue or submit a pull request.
Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.
Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.