Neural
Token Killer

Semantic compression proxy daemon for Claude Code. Reduces tool output by up to 92% (measured on docker logs; stack traces 56–83%) before it reaches the LLM context. Four progressive compression layers, optional local neural inference, model-agnostic context injection.

GitHub
install curl -fsSL https://ntk.valraw.com/install.sh | sh
0
% max savings (L1+L2)
<20ms
L1+L2 latency
0
compression layers
0 deps
runtime required
How it works

Four layers, one result

Each layer activates only when needed, keeping latency near zero for small outputs.

L1
Fast Filter

ANSI strip, cargo/rustc progress, TS wavy underlines, git index/+++/--- metadata, template dedup, stack-trace collapse, test failure extraction. Always on. <1ms.

L2
Tokenizer-Aware

cl100k_base BPE token counting, path shortening. Always on. <5ms.

L3
Local Inference

Ollama/Phi-3 Mini with type-specific prompts. Only triggers when output >300 tokens.

L4
Context Injection

Reads the Claude Code transcript and prepends your most recent intent to the L3 prompt. Plain-string prefix, works with any LLM.

Experimental: An opt-in YAML rule engine (RFC-0001) lets the community ship language collapses without writing Rust. 13 rulesets ship today (Python, Java, Go, Node, Ruby, PHP, .NET, Kotlin, Rust, Swift, Elixir, Docker, kubectl) and a JS reference runtime mirrors the Rust engine for non-Rust agents. Activate with NTK_SPEC_RULES=….

Installs in 30 seconds

One command installs the binary, patches Claude Code's settings.json with the PostToolUse hook, and creates the config file.

ntk start - 127.0.0.1:8765
███╗ ██╗████████╗██╗ ██╗
████╗ ██║╚══██╔══╝██║ ██╔╝
██╔██╗ ██║ ██║ █████╔╝
Neural Token Killer
v0.2.0 · 127.0.0.1:8765
phi3:mini q5_k_m · candle [GPU]
● RUNNING
72,243
tokens saved
85%
avg ratio
847
compressions
Layer Distribution
L1
42%
L2
35%
L3
23%
Recent
L3 cargo test −94%
L1 git status −62%
L2 tsc --noEmit −83%
Features

Built for developer workflows

Sub-millisecond L1+L2
Regex + tokenizer layers add <5ms overhead to every Bash tool call.
Local AI, zero cloud
Phi-3 Mini runs 100% on your machine. No API keys, no data sent to the cloud.
RTK compatible
Works alongside RTK. RTK filters first, NTK semantically summarizes the result.
Type-aware compression
Different prompts for test output, build errors, logs, diffs. Extracts exactly what matters.
GPU acceleration
Enumerates every discrete GPU — NVIDIA via nvidia-smi, AMD via ROCm / Windows driver registry / Linux sysfs (Polaris & Vega included), plus Metal, AMX, AVX-512. Multi-vendor selection in ntk model setup — pick NVIDIA or AMD explicitly; the daemon scopes CUDA_VISIBLE_DEVICES / HIP_VISIBLE_DEVICES accordingly and never silently prefers NVIDIA. L3 drops from ~800ms to <100ms on GPU.
Live TUI dashboard
ntk start opens a full-screen dashboard with real-time metrics, per-layer stats, recent commands, and the active model with GPU/CPU mode. Updates every 500ms. Ctrl+C stops the daemon gracefully.
Cross-platform
Single Rust binary. Works on Windows, macOS, and Linux without any runtime dependencies.
Built-in benchmark harness
Every /compress response exposes per-layer token counts and latency. Set NTK_LOG_COMPRESSIONS=1 to persist every compression (input, L1/L2/L3 outputs) to ~/.ntk/logs/ for auditing. Included scripts in bench/ replay 12 fixture outputs (including Python Django / Node Express / Go panic / PHP Symfony stack traces), parse Claude Code transcripts, and produce a markdown report with token savings and estimated cost delta.
Layer 4: context injection
NTK reads the Claude Code session transcript to grab your most recent request, then prepends it to the L3 summarizer prompt so the compressed output prioritizes information relevant to your actual goal. Four prompt formats supported (Prefix / XmlWrap / Goal / Json), runtime-overridable via NTK_L4_FORMAT. Works with any LLM — plain-string prefix, zero model coupling.
Open initiative

NTK needs your help to evolve

This project started solo and the surface area (languages, frameworks, GPUs, editors) outgrew one-person maintenance. There is a concrete, pre-scoped list of starter tasks — most land as a single PR under an hour.

Add a fixture
Capture a real log from a tool or language we don't cover (Ruby, Elixir, Scala, Swift, Flutter, Terraform, kubectl). Two files: one .txt + one .meta.json.
Extend the stack-trace filter
Still open: Erlang/OTP, Scala/Akka, Elixir/Phoenix, Swift/iOS, Flutter/Dart, Clojure. Playbook in .claude/skills/add-stack-trace-language.md.
Port the hook to your editor
Claude Code, OpenCode, Cursor, Zed, and Continue work today via ntk init -g [--opencode|--cursor|--zed|--continue]. Aider is blocked on upstream. Windsurf, Cline, and other agents would need a fresh adapter — see docs/editor-integrations.md.
Benchmark on your GPU
AMD / Apple Silicon / Intel AMX / RTX 40-50 p50+p95 numbers are partly estimates. If you own the hardware, run ntk model bench and PR the measurement in.
Translate the docs
docs/app.js has EN + PT blocks. Adding ES, FR, DE, JA is a pure copy-then-translate task — no code required.
Break an invariant
Find an input that produces a misleading compression (lost error, empty exemplar, non-idempotent output) — that's a proptest we should add. See tests/proptest/compression_invariants.rs.
Read CONTRIBUTING.md Browse open issues
Installation guide

Get Started with NTK

NTK is a single Rust binary. The setup takes under 5 minutes.

1
Prerequisites

NTK requires Claude Code (the CLI) installed and configured. Layers 1+2 run pure Rust and work out-of-the-box. Layer 3 inference is optional and configured later in Step 4 — Ollama is installed then, not now.

bash
# Check Claude Code is installed
claude --version
2
Install NTK

The installer enumerates every discrete GPU on the machine (NVIDIA + AMD, any number and mix), lists them, and asks which release variant to install: NVIDIA (-gpu CUDA / Metal build), AMD (-cpu build with post-install Vulkan guidance), or CPU only. When piped non-interactively, the choice is auto-selected from detection; override with NTK_INSTALL_PLATFORM=nvidia|amd|cpu.

bash - Linux / macOS
curl -fsSL https://ntk.valraw.com/install.sh | sh
PowerShell - Windows (recommended)
$tmp = "$env:TEMP\ntk-install.ps1"
irm https://ntk.valraw.com/install.ps1 -OutFile $tmp
powershell -ExecutionPolicy Bypass -File $tmp

The one-liner irm ... | iex works only when you preset $env:NTK_INSTALL_PLATFORM = 'nvidia' | 'amd' | 'cpu' — otherwise the GPU picker prompt is lost in the pipe buffering.

Or install from source:

bash - from source
# Auto-picks the right --features flag for your hardware:
./scripts/build.sh        # Linux / macOS
.\scripts\build.ps1       # Windows

# Or vanilla Cargo:
cargo build --release                      # CPU / Ollama (works everywhere)
cargo build --release --features cuda      # NVIDIA (needs CUDA Toolkit)
cargo build --release --features metal     # Apple Silicon

Note: the cuda feature requires the CUDA Toolkit (nvcc on PATH). Install it separately — Cargo does not ship SDKs. There is no AMD feature flag; AMD users build the CPU variant and configure a Vulkan llama-server via ntk model setup.

3
Initialize hook

This patches ~/.claude/settings.json to add the PostToolUse hook, copies the hook script, and creates ~/.ntk/config.json with sensible defaults. That's the full scope of ntk init — model backend and GPU choice are handled in Step 4.

bash
ntk init -g                # Claude Code (default)
ntk init -g --opencode     # OpenCode
ntk init -g --cursor       # Cursor — MCP via ~/.cursor/mcp.json
ntk init -g --zed          # Zed — MCP via context_servers
ntk init -g --continue     # Continue — MCP via mcpServers[]

The -g flag patches the global settings. Safe to re-run (idempotent). Cursor, Zed, and Continue register NTK as a Model Context Protocol server (ntk mcp-server) that the agent calls on-demand — no ntk start daemon required for MCP mode.

4
Configure Layer 3 (optional)

The model setup wizard detects every NVIDIA and AMD GPU on the system (multi-GPU is fully supported — 2× NVIDIA, 1× NVIDIA + 1× AMD, etc.), lets you pick one, and installs / configures the backend. The chosen vendor is saved to config.model.gpu_vendor and enforced at runtime — picking AMD on a machine that also has NVIDIA actually routes inference to AMD. It also installs Ollama on demand when you pick that backend — so there is no separate Ollama step.

bash
ntk model setup         # backend + GPU wizard

# Once the backend is set, pull the default model:
ntk model pull          # phi3:mini Q5_K_M (~2GB)

# Or choose a specific quantization:
ntk model pull --quant q4_k_m  # smaller, faster
ntk model pull --quant q6_k    # higher quality

AMD users: the wizard detects your card (Polaris, Vega, RDNA) via the Windows driver registry or Linux sysfs — ROCm is not required. Pick the llama.cpp backend and point it at a Vulkan-enabled llama-server (prebuilt binaries on llama.cpp releases).

Skip Step 4 entirely to run in L1+L2 mode (pure Rust, no neural inference, <5ms latency).

5
Start daemon
bash
ntk start               # CPU mode - opens live TUI dashboard
ntk start --gpu         # GPU acceleration (CUDA/Metal auto-detected)

# Daemon already running? ntk start attaches to the live TUI without restarting.
# Ctrl+C exits the TUI - daemon stays running.

# For a quick non-interactive snapshot:
ntk dashboard           # status + gain + bar chart → stdout, then exit
6
Verify installation
bash
ntk status              # daemon status + model info
ntk test-compress file  # test on a captured output
ntk gain                # view token savings
Configuration

Edit ~/.ntk/config.json or place a .ntk.json in your project root for per-project overrides:

~/.ntk/config.json
{
  "compression": {
    "inference_threshold_tokens": 300,
    "context_max_messages": 3,
    "tokenizer": "cl100k_base"
  },
  "model": {
    "provider": "ollama",
    "quantization": "q5_k_m",
    "gpu_layers": -1,
    "gpu_vendor": null,
    "backend_chain": [],
    "fallback_to_layer1_on_timeout": true
  },
  "exclusions": {
    "commands": ["cat", "echo"]
  },
  "security": { "audit_log": false },
  "l3_cache": { "enabled": true, "ttl_days": 7 }
}

Tuning inference_threshold_tokens for your hardware

L3 first-call latency scales with GPU class. The L3 cache absorbs repeats (< 150 ms), so the threshold is set by how much interactivity you tolerate on a cold call:

Hardware Cold L3 latency Threshold
CPU-only (AVX2)30–60 s2000 or disable L3
Entry GPU (RX 580, GTX 1060)10–15 s (Vulkan)600
Mid (RTX 3060, RX 6700 XT, M1)2–5 s300 (default)
High (RTX 4070+, M2 Pro+)< 1.5 s200
Flagship (RTX 4090, M4 Pro+)< 700 ms100

Measure yours: ntk bench --l3 --runs 3. Pick a threshold where p95 ≤ 2 s for interactive comfort.

Uninstall
bash
ntk stop
ntk init --uninstall    # removes hook from settings.json
rm -rf ~/.ntk           # remove config & data
Reference

Command Reference

All NTK commands. Prefix every command with ntk.

Daemon
start
Start the compression daemon on port 8765, opening the live TUI dashboard. If daemon is already running, attaches to the live TUI without restarting.
start --gpu
Start with GPU acceleration (CUDA/ROCm/Metal auto-detected).
stop
Stop the daemon.
status
Show daemon status, loaded model, GPU info, and uptime.
dashboard
Combined static snapshot: daemon status + session gain + ASCII bar chart. Prints to stdout and exits immediately - safe for scripts and CI.
Setup & Init
init -g
Initialize globally: patch settings.json, create ~/.ntk/config.json, then auto-launch model setup wizard.
init --show
Display current hook installation status.
init --uninstall
Remove the PostToolUse hook from settings.json.
init --auto-patch
Non-interactive mode for CI/CD pipelines.
init --hook-only
Install hook script only - skip config.json creation and model setup wizard.
init -g --opencode
Install PostToolUse hook in OpenCode's settings.json instead of Claude Code.
init -g --cursor
Register NTK as an MCP server in ~/.cursor/mcp.json. Agent calls it on-demand via the compress_output tool.
init -g --zed
Register NTK in Zed's context_servers (MCP). Same binary as Cursor, different config shape.
init -g --continue
Register NTK in Continue's mcpServers array (~/.continue/config.json).
mcp-server
JSON-RPC over stdio. Launched by MCP clients (Cursor / Zed / Continue); not a standalone command.
Model
model pull
Download phi3:mini (default Q5_K_M, ~2GB) via Ollama.
model pull --quant q4_k_m
Download a specific quantization (q4_k_m, q5_k_m, q6_k).
model setup
Interactive backend selector (Ollama / Candle / llama.cpp) with multi-GPU detection and selection. Installs Ollama on demand. Run after ntk init, or anytime to reconfigure.
model test
Test model latency and output quality with a sample prompt.
model test --debug
Verbose test: shows thread config, mlock status, system prompt preview, timing breakdown, and performance analysis with CPU-tier-aware targets (mobile ≥5 tok/s, desktop ≥10, high-end ≥15, GPU ≥40).
model bench
Benchmark CPU vs GPU inference latency.
model list
List available models in the configured backend.
model install-server
Re-download the llama-server binary for the current OS × gpu_vendor (picks CUDA / Vulkan / SYCL / Metal automatically). Does not re-download the GGUF model.
Compression
test-compress <file>
Run the full pipeline on a captured output file and print result.
test
Run correctness tests on all compression layers. No daemon required.
test --l3
Include Layer 3 inference in the test run.
bench
Benchmark all compression layers (default: 5 runs per payload).
bench --runs <N>
Set number of benchmark runs per payload for stable measurements.
bench --submit
Emit a structured JSON report (hardware + per-payload numbers). Attach to a GitHub bench issue.
test-compress --verbose
Per-layer breakdown — applied rules, tokens, latency, and 20-line preview for each stage.
diff <file> --layer l1|l2
Unified diff between raw input and a specific layer's output. Use when a PR changes the compression ratio.
config
Show the active merged configuration (global + project overrides).
config --file <path>
Show configuration from a specific file path.
Metrics & Analytics
metrics
Session metrics table in stdout (plain text).
graph
ASCII bar chart + sparkline of savings over time.
gain
Token savings summary (RTK-compatible output format).
history
Last 50 compressed commands with token counts and layer used.
discover
Analyze Claude Code session for missed compression opportunities.
tail
Show the last 10 compression events and exit. Reads ~/.ntk/metrics.db directly — works even when the daemon is offline.
tail -f
Stream events live (polls SQLite every 2 s). Use --command NAME to filter by command prefix.
prune --older-than 30
Delete records older than N days from both compression_records and l3_cache, then VACUUM to reclaim disk.

RTK users: prefix with rtk ntk <cmd> to also compress NTK's own output.

Token savings

Savings by Command Type

Measured against real-world command output captured during development sessions.

92%
Max savings (docker logs)
L1+L2 only, no L3
~55%
Avg stack-trace compression
Java / Python / Go / Node / PHP
<20ms
L1+L2 overhead
always on, zero impact
12
Fixture scenarios
multi-language coverage

Measured on 34 deterministic fixtures in bench/fixtures/ with NTK v0.3.0. All numbers are L1+L2 only (L3 not triggered in this run). Aggregate token reduction across the corpus: 57.4% (40 631 → 17 313 tokens). Average per-fixture ratio: 49.3%. Reproduce with pwsh bench/run_all.ps1 or bash bench/run_all.sh.

Scenario Fixture Ratio Visual
Repetitive logs docker_logs_repetitive 97%
TypeScript React trace typescript_react_trace 92%
Unstructured log generic_long_log 91%
Gradle build gradle_build_verbose 90%
Terraform plan terraform_plan_apply 88%
Node.js trace node_express_trace 83%
Ruby on Rails trace ruby_rails_trace 83%
pnpm install pnpm_install_large 78%
cargo build cargo_build_verbose 76%
FastAPI / Starlette trace fastapi_starlette_trace 73%
PHP Symfony trace php_symfony_trace 69%
cargo test cargo_test_failures 68%
Kotlin Android trace kotlin_android_trace 63%
Java stack trace stack_trace_java 62%
Python Django trace python_django_trace 62%
C# ASP.NET Core trace csharp_aspnet_trace 61%
.NET build warnings dotnet_build_warnings 61%
Go panic go_panic_trace 57%
Git diff git_diff_large 57%
Maven build maven_build_tests 55%
Bazel build bazel_build_verbose 34%
Webpack stats webpack_stats 33%
TypeScript errors tsc_errors_node_modules 33%
docker compose logs docker_compose_logs 30%
pytest verbose pytest_verbose_large 23%
go test verbose go_test_verbose 16%
Next.js build routes nextjs_build_routes 9%
Elixir Phoenix trace elixir_phoenix_trace 7%
kubectl get events kubectl_get_events 6%
RSpec documentation rspec_documentation 2%
ESLint stylish eslint_stylish 2%
kubectl describe pod kubectl_describe_pod spec_loader only
Swift / UIKit crash swift_uikit_crash spec_loader only

Combined savings formula: 1 − (1 − rtk%) × (1 − ntk_incremental%)

Pending: combined NTK+RTK measurements have not been recorded against the current fixture library. Run pwsh bench/run_all.ps1 with RTK filtering on to populate. Numbers below are estimates derived from single-layer measurements; treat as illustrative until the combined bench lands.

Category RTK alone NTK incremental NTK+RTK combined Visual
Tests N/A N/A N/A
Build N/A N/A N/A
Git N/A N/A N/A
GitHub CLI N/A N/A N/A
Package Managers N/A N/A N/A
Infrastructure N/A N/A N/A

How savings are measured

Token counts use cl100k_base (tiktoken-rs), the same tokenizer as Claude. All figures come from bench/microbench.csv produced by pwsh bench/run_all.ps1 (or bash bench/run_all.sh) on the 12 fixtures under bench/fixtures/. L3 activates only when post-L1+L2 output exceeds 300 tokens, so small outputs incur zero neural inference latency. L4 (context injection) is prepended to the L3 prompt when the Claude Code hook forwards transcript_path.

GPU Acceleration

Layer 3 Latency by Hardware

Phi-3 Mini Q5_K_M (3.8B). GPU drops p95 latency from ~900ms to under 100ms.

Hardware Backend p50 latency p95 latency Notes
NVIDIA RTX 5060 Ti CUDA ~30ms ~50ms Ada Lovelace, full offload
NVIDIA RTX 3060 CUDA ~50ms ~80ms 12GB VRAM, full offload
Apple M2 MacBook Pro Metal ~80ms ~150ms Unified memory, via Ollama Metal
Intel Xeon 4th Gen AMX ~150ms ~250ms Sapphire Rapids, AMX tiles
Intel Core i7-12700 AVX2 ~300ms ~500ms 12-core desktop, AVX2
Intel Core i5-8250U AVX2 ~600ms ~900ms 4-core laptop, baseline CPU

L3 only activates when output exceeds 300 tokens post-L1+L2. Small outputs always use L1+L2 (<5ms).