NTK - Neural Token Killer

1

Prerequisites

NTK requires Claude Code (the CLI) installed and configured. Layers 1+2 run pure Rust and work out-of-the-box. Layer 3 inference is optional and configured later in Step 4 — Ollama is installed then, not now.

bash

# Check Claude Code is installed
claude --version

2

Install NTK

The installer enumerates every discrete GPU on the machine (NVIDIA + AMD, any number and mix), lists them, and asks which release variant to install: NVIDIA (-gpu CUDA / Metal build), AMD (-cpu build with post-install Vulkan guidance), or CPU only. When piped non-interactively, the choice is auto-selected from detection; override with NTK_INSTALL_PLATFORM=nvidia|amd|cpu.

bash - Linux / macOS

curl -fsSL https://ntk.valraw.com/install.sh | sh

PowerShell - Windows (recommended)

$tmp = "$env:TEMP\ntk-install.ps1"
irm https://ntk.valraw.com/install.ps1 -OutFile $tmp
powershell -ExecutionPolicy Bypass -File $tmp

The one-liner irm ... | iex works only when you preset $env:NTK_INSTALL_PLATFORM = 'nvidia' | 'amd' | 'cpu' — otherwise the GPU picker prompt is lost in the pipe buffering.

Or install from source:

bash - from source

# Auto-picks the right --features flag for your hardware:
./scripts/build.sh        # Linux / macOS
.\scripts\build.ps1       # Windows

# Or vanilla Cargo:
cargo build --release                      # CPU / Ollama (works everywhere)
cargo build --release --features cuda      # NVIDIA (needs CUDA Toolkit)
cargo build --release --features metal     # Apple Silicon

Note: the cuda feature requires the CUDA Toolkit (nvcc on PATH). Install it separately — Cargo does not ship SDKs. There is no AMD feature flag; AMD users build the CPU variant and configure a Vulkan llama-server via ntk model setup.

3

Initialize hook

This patches ~/.claude/settings.json to add the PostToolUse hook, copies the hook script, and creates ~/.ntk/config.json with sensible defaults. That's the full scope of ntk init — model backend and GPU choice are handled in Step 4.

bash

ntk init -g                # Claude Code (default)
ntk init -g --opencode     # OpenCode
ntk init -g --cursor       # Cursor — MCP via ~/.cursor/mcp.json
ntk init -g --zed          # Zed — MCP via context_servers
ntk init -g --continue     # Continue — MCP via mcpServers[]

The -g flag patches the global settings. Safe to re-run (idempotent). Cursor, Zed, and Continue register NTK as a Model Context Protocol server (ntk mcp-server) that the agent calls on-demand — no ntk start daemon required for MCP mode.

4

Configure Layer 3 (optional)

The model setup wizard detects every NVIDIA and AMD GPU on the system (multi-GPU is fully supported — 2× NVIDIA, 1× NVIDIA + 1× AMD, etc.), lets you pick one, and installs / configures the backend. The chosen vendor is saved to config.model.gpu_vendor and enforced at runtime — picking AMD on a machine that also has NVIDIA actually routes inference to AMD. It also installs Ollama on demand when you pick that backend — so there is no separate Ollama step.

bash

ntk model setup         # backend + GPU wizard

# Once the backend is set, pull the default model:
ntk model pull          # phi3:mini Q5_K_M (~2GB)

# Or choose a specific quantization:
ntk model pull --quant q4_k_m  # smaller, faster
ntk model pull --quant q6_k    # higher quality

AMD users: the wizard detects your card (Polaris, Vega, RDNA) via the Windows driver registry or Linux sysfs — ROCm is not required. Pick the llama.cpp backend and point it at a Vulkan-enabled llama-server (prebuilt binaries on llama.cpp releases).

Skip Step 4 entirely to run in L1+L2 mode (pure Rust, no neural inference, <5ms latency).

5

Start daemon

bash

ntk start               # CPU mode - opens live TUI dashboard
ntk start --gpu         # GPU acceleration (CUDA/Metal auto-detected)

# Daemon already running? ntk start attaches to the live TUI without restarting.
# Ctrl+C exits the TUI - daemon stays running.

# For a quick non-interactive snapshot:
ntk dashboard           # status + gain + bar chart → stdout, then exit

6

Verify installation

bash

ntk status              # daemon status + model info
ntk test-compress file  # test on a captured output
ntk gain                # view token savings

Configuration

Edit ~/.ntk/config.json or place a .ntk.json in your project root for per-project overrides:

~/.ntk/config.json

{
  "compression": {
    "inference_threshold_tokens": 300,
    "context_max_messages": 3,
    "tokenizer": "cl100k_base"
  },
  "model": {
    "provider": "ollama",
    "quantization": "q5_k_m",
    "gpu_layers": -1,
    "gpu_vendor": null,
    "backend_chain": [],
    "fallback_to_layer1_on_timeout": true
  },
  "exclusions": {
    "commands": ["cat", "echo"]
  },
  "security": { "audit_log": false },
  "l3_cache": { "enabled": true, "ttl_days": 7 }
}

Tuning `inference_threshold_tokens` for your hardware

L3 first-call latency scales with GPU class. The L3 cache absorbs repeats (< 150 ms), so the threshold is set by how much interactivity you tolerate on a cold call:

Hardware	Cold L3 latency	Threshold
CPU-only (AVX2)	30–60 s	`2000` or disable L3
Entry GPU (RX 580, GTX 1060)	10–15 s (Vulkan)	`600`
Mid (RTX 3060, RX 6700 XT, M1)	2–5 s	`300` (default)
High (RTX 4070+, M2 Pro+)	< 1.5 s	`200`
Flagship (RTX 4090, M4 Pro+)	< 700 ms	`100`

Measure yours: ntk bench --l3 --runs 3. Pick a threshold where p95 ≤ 2 s for interactive comfort.

Uninstall

bash

ntk stop
ntk init --uninstall    # removes hook from settings.json
rm -rf ~/.ntk           # remove config & data

Scenario	Fixture	Ratio
Repetitive logs	docker_logs_repetitive	97%
TypeScript React trace	typescript_react_trace	92%
Unstructured log	generic_long_log	91%
Gradle build	gradle_build_verbose	90%
Terraform plan	terraform_plan_apply	88%
Node.js trace	node_express_trace	83%
Ruby on Rails trace	ruby_rails_trace	83%
pnpm install	pnpm_install_large	78%
cargo build	cargo_build_verbose	76%
FastAPI / Starlette trace	fastapi_starlette_trace	73%
PHP Symfony trace	php_symfony_trace	69%
cargo test	cargo_test_failures	68%
Kotlin Android trace	kotlin_android_trace	63%
Java stack trace	stack_trace_java	62%
Python Django trace	python_django_trace	62%
C# ASP.NET Core trace	csharp_aspnet_trace	61%
.NET build warnings	dotnet_build_warnings	61%
Go panic	go_panic_trace	57%
Git diff	git_diff_large	57%
Maven build	maven_build_tests	55%
Bazel build	bazel_build_verbose	34%
Webpack stats	webpack_stats	33%
TypeScript errors	tsc_errors_node_modules	33%
docker compose logs	docker_compose_logs	30%
pytest verbose	pytest_verbose_large	23%
go test verbose	go_test_verbose	16%
Next.js build routes	nextjs_build_routes	9%
Elixir Phoenix trace	elixir_phoenix_trace	7%
kubectl get events	kubectl_get_events	6%
RSpec documentation	rspec_documentation	2%
ESLint stylish	eslint_stylish	2%
kubectl describe pod	kubectl_describe_pod	spec_loader only
Swift / UIKit crash	swift_uikit_crash	spec_loader only

Category	RTK alone	NTK incremental	NTK+RTK combined
Tests	N/A	N/A	N/A
Build	N/A	N/A	N/A
Git	N/A	N/A	N/A
GitHub CLI	N/A	N/A	N/A
Package Managers	N/A	N/A	N/A
Infrastructure	N/A	N/A	N/A

Hardware	Backend	p50 latency	p95 latency	Notes
NVIDIA RTX 5060 Ti	CUDA	~30ms	~50ms	Ada Lovelace, full offload
NVIDIA RTX 3060	CUDA	~50ms	~80ms	12GB VRAM, full offload
Apple M2 MacBook Pro	Metal	~80ms	~150ms	Unified memory, via Ollama Metal
Intel Xeon 4th Gen	AMX	~150ms	~250ms	Sapphire Rapids, AMX tiles
Intel Core i7-12700	AVX2	~300ms	~500ms	12-core desktop, AVX2
Intel Core i5-8250U	AVX2	~600ms	~900ms	4-core laptop, baseline CPU

Neural
Token Killer

Four layers, one result

Installs in 30 seconds

Built for developer workflows

NTK needs your help to evolve

Get Started with NTK

Tuning `inference_threshold_tokens` for your hardware

Command Reference

Savings by Command Type

How savings are measured

Layer 3 Latency by Hardware

Neural Token Killer

Four layers, one result

Installs in 30 seconds

Built for developer workflows

NTK needs your help to evolve

Get Started with NTK

Tuning inference_threshold_tokens for your hardware

Command Reference

Savings by Command Type

How savings are measured

Layer 3 Latency by Hardware

Neural
Token Killer

Tuning `inference_threshold_tokens` for your hardware