Get Started with NTK
NTK is a single Rust binary. The setup takes under 5 minutes.
NTK requires Claude Code (the CLI) installed and configured. Layers 1+2 run pure Rust and work out-of-the-box. Layer 3 inference is optional and configured later in Step 4 — Ollama is installed then, not now.
# Check Claude Code is installed
claude --version
The installer enumerates every discrete GPU on the machine (NVIDIA + AMD, any number and mix), lists them, and asks which release variant to install: NVIDIA (-gpu CUDA / Metal build), AMD (-cpu build with post-install Vulkan guidance), or CPU only. When piped non-interactively, the choice is auto-selected from detection; override with NTK_INSTALL_PLATFORM=nvidia|amd|cpu.
curl -fsSL https://ntk.valraw.com/install.sh | sh
$tmp = "$env:TEMP\ntk-install.ps1" irm https://ntk.valraw.com/install.ps1 -OutFile $tmp powershell -ExecutionPolicy Bypass -File $tmp
The one-liner irm ... | iex works only when you preset
$env:NTK_INSTALL_PLATFORM = 'nvidia' | 'amd' | 'cpu' —
otherwise the GPU picker prompt is lost in the pipe buffering.
Or install from source:
# Auto-picks the right --features flag for your hardware: ./scripts/build.sh # Linux / macOS .\scripts\build.ps1 # Windows # Or vanilla Cargo: cargo build --release # CPU / Ollama (works everywhere) cargo build --release --features cuda # NVIDIA (needs CUDA Toolkit) cargo build --release --features metal # Apple Silicon
Note: the cuda feature requires the CUDA Toolkit (nvcc on PATH). Install it separately — Cargo does not ship SDKs. There is no AMD feature flag; AMD users build the CPU variant and configure a Vulkan llama-server via ntk model setup.
This patches ~/.claude/settings.json to add the PostToolUse hook, copies the hook script, and creates ~/.ntk/config.json with sensible defaults. That's the full scope of ntk init — model backend and GPU choice are handled in Step 4.
ntk init -g # Claude Code (default) ntk init -g --opencode # OpenCode ntk init -g --cursor # Cursor — MCP via ~/.cursor/mcp.json ntk init -g --zed # Zed — MCP via context_servers ntk init -g --continue # Continue — MCP via mcpServers[]
The -g flag patches the global settings. Safe to re-run (idempotent). Cursor, Zed, and Continue register NTK as a Model Context Protocol server (ntk mcp-server) that the agent calls on-demand — no ntk start daemon required for MCP mode.
The model setup wizard detects every NVIDIA and AMD GPU on the system (multi-GPU is fully supported — 2× NVIDIA, 1× NVIDIA + 1× AMD, etc.), lets you pick one, and installs / configures the backend. The chosen vendor is saved to config.model.gpu_vendor and enforced at runtime — picking AMD on a machine that also has NVIDIA actually routes inference to AMD. It also installs Ollama on demand when you pick that backend — so there is no separate Ollama step.
ntk model setup # backend + GPU wizard # Once the backend is set, pull the default model: ntk model pull # phi3:mini Q5_K_M (~2GB) # Or choose a specific quantization: ntk model pull --quant q4_k_m # smaller, faster ntk model pull --quant q6_k # higher quality
AMD users: the wizard detects your card (Polaris, Vega, RDNA) via the Windows driver registry or Linux sysfs — ROCm is not required. Pick the llama.cpp backend and point it at a Vulkan-enabled llama-server (prebuilt binaries on llama.cpp releases).
Skip Step 4 entirely to run in L1+L2 mode (pure Rust, no neural inference, <5ms latency).
ntk start # CPU mode - opens live TUI dashboard ntk start --gpu # GPU acceleration (CUDA/Metal auto-detected) # Daemon already running? ntk start attaches to the live TUI without restarting. # Ctrl+C exits the TUI - daemon stays running. # For a quick non-interactive snapshot: ntk dashboard # status + gain + bar chart → stdout, then exit
ntk status # daemon status + model info ntk test-compress file # test on a captured output ntk gain # view token savings
Edit ~/.ntk/config.json or place a .ntk.json in your project root for per-project overrides:
{
"compression": {
"inference_threshold_tokens": 300,
"context_max_messages": 3,
"tokenizer": "cl100k_base"
},
"model": {
"provider": "ollama",
"quantization": "q5_k_m",
"gpu_layers": -1,
"gpu_vendor": null,
"backend_chain": [],
"fallback_to_layer1_on_timeout": true
},
"exclusions": {
"commands": ["cat", "echo"]
},
"security": { "audit_log": false },
"l3_cache": { "enabled": true, "ttl_days": 7 }
}
Tuning inference_threshold_tokens for your hardware
L3 first-call latency scales with GPU class. The L3 cache absorbs repeats (< 150 ms), so the threshold is set by how much interactivity you tolerate on a cold call:
| Hardware | Cold L3 latency | Threshold |
|---|---|---|
| CPU-only (AVX2) | 30–60 s | 2000 or disable L3 |
| Entry GPU (RX 580, GTX 1060) | 10–15 s (Vulkan) | 600 |
| Mid (RTX 3060, RX 6700 XT, M1) | 2–5 s | 300 (default) |
| High (RTX 4070+, M2 Pro+) | < 1.5 s | 200 |
| Flagship (RTX 4090, M4 Pro+) | < 700 ms | 100 |
Measure yours: ntk bench --l3 --runs 3. Pick a threshold where p95 ≤ 2 s for interactive comfort.
ntk stop ntk init --uninstall # removes hook from settings.json rm -rf ~/.ntk # remove config & data