Neural
Token Killer

Semantic compression proxy daemon for Claude Code. Reduces tool output by 70–99% before it reaches the LLM context - using three progressive compression layers with optional local neural inference.

☆ GitHub
install curl -fsSL https://ntk.valraw.com/install.sh | sh
0
% max savings
<1ms
L1+L2 latency
0
compression layers
0 deps
runtime required
How it works

Three layers, one result

Each layer activates only when needed, keeping latency near zero for small outputs.

L1
Fast Filter

ANSI removal, line deduplication, test failure extraction. Always on. <1ms.

L2
Tokenizer-Aware

cl100k_base BPE token counting, path shortening. Always on. <5ms.

L3
Local Inference

Ollama/Phi-3 Mini with type-specific prompts. Only triggers when output >300 tokens.

Live demo

Installs in 30 seconds

One command installs the binary, patches Claude Code's settings.json with the PostToolUse hook, and creates the config file.

ntk start - 127.0.0.1:8765
███╗ ██╗████████╗██╗ ██╗
████╗ ██║╚══██╔══╝██║ ██╔╝
██╔██╗ ██║ ██║ █████╔╝
Neural Token Killer
v0.2.0 · 127.0.0.1:8765
phi3:mini q5_k_m · candle [GPU]
● RUNNING
72,243
tokens saved
85%
avg ratio
847
compressions
Layer Distribution
L1
42%
L2
35%
L3
23%
Recent
L3 cargo test −94%
L1 git status −62%
L2 tsc --noEmit −83%
Features

Built for developer workflows

Sub-millisecond L1+L2
Regex + tokenizer layers add <5ms overhead to every Bash tool call.
🧠
Local AI, zero cloud
Phi-3 Mini runs 100% on your machine. No API keys, no data sent to the cloud.
🔌
RTK compatible
Works alongside RTK. RTK filters first, NTK semantically summarizes the result.
🎯
Type-aware compression
Different prompts for test output, build errors, logs, diffs. Extracts exactly what matters.
🖥
GPU acceleration
Auto-detects CUDA (NVIDIA), ROCm (AMD), Metal (Apple Silicon), AMX, AVX-512. L3 latency drops from 800ms to <100ms on GPU.
📊
Live TUI dashboard
ntk start opens a full-screen dashboard with real-time metrics, per-layer stats, recent commands, and the active model with GPU/CPU mode. Updates every 500ms. Ctrl+C stops the daemon gracefully.
🌐
Cross-platform
Single Rust binary. Works on Windows, macOS, and Linux without any runtime dependencies.
Installation guide

Get Started with NTK

NTK is a single Rust binary. The setup takes under 5 minutes.

1
Prerequisites

NTK requires Claude Code (the CLI) installed and configured. Layer 3 inference requires Ollama (recommended) or a compatible backend. CPU-only mode works without any AI backend.

bash
# Check Claude Code is installed
claude --version

# Install Ollama (optional, recommended for L3)
curl -fsSL https://ollama.ai/install.sh | sh
ollama serve &
2
Install NTK

The install script downloads the latest release binary for your OS and architecture:

bash - Linux / macOS
curl -fsSL https://ntk.valraw.com/install.sh | sh
PowerShell - Windows
irm https://ntk.valraw.com/install.ps1 | iex

Or install from source:

bash - from source
cargo install ntk
3
Initialize hook

This patches ~/.claude/settings.json to add the PostToolUse hook, copies the hook script, creates ~/.ntk/config.json, and automatically launches the model setup wizard to configure your inference backend:

bash
ntk init -g

The -g flag patches the global settings. Use without it for per-project setup. The operation is idempotent - safe to run multiple times. Use --hook-only to skip the model wizard.

4
Install model

Pull Phi-3 Mini (~2GB) via Ollama for Layer 3 semantic compression:

bash
ntk model pull          # pulls phi3:mini Q5_K_M

# Or choose a specific quantization:
ntk model pull --quant q4_k_m  # smaller, faster
ntk model pull --quant q6_k   # higher quality

Skip this step to run in L1+L2 only mode (no neural inference, <5ms latency).

5
Start daemon
bash
ntk start               # CPU mode - opens live TUI dashboard
ntk start --gpu         # GPU acceleration (CUDA/Metal auto-detected)

# Daemon already running? ntk start attaches to the live TUI without restarting.
# Ctrl+C exits the TUI - daemon stays running.

# For a quick non-interactive snapshot:
ntk dashboard           # status + gain + bar chart → stdout, then exit
6
Verify installation
bash
ntk status              # daemon status + model info
ntk test-compress file  # test on a captured output
ntk gain                # view token savings
Configuration

Edit ~/.ntk/config.json or place a .ntk.json in your project root for per-project overrides:

~/.ntk/config.json
{
  "compression": {
    "inference_threshold_tokens": 300
  },
  "model": {
    "provider": "ollama",
    "quantization": "q5_k_m",
    "gpu_layers": -1,
    "fallback_to_layer1_on_timeout": true
  },
  "exclusions": {
    "commands": ["cat", "echo"]
  },
  "telemetry": { "enabled": true }
}
Uninstall
bash
ntk stop
ntk init --uninstall    # removes hook from settings.json
rm -rf ~/.ntk           # remove config & data
Reference

Command Reference

All NTK commands. Prefix every command with ntk.

Daemon
start
Start the compression daemon on port 8765, opening the live TUI dashboard. If daemon is already running, attaches to the live TUI without restarting.
start --gpu
Start with GPU acceleration (CUDA/ROCm/Metal auto-detected).
stop
Stop the daemon.
status
Show daemon status, loaded model, GPU info, and uptime.
dashboard
Combined static snapshot: daemon status + session gain + ASCII bar chart. Prints to stdout and exits immediately - safe for scripts and CI.
Setup & Init
init -g
Initialize globally: patch settings.json, create ~/.ntk/config.json, then auto-launch model setup wizard.
init --show
Display current hook installation status.
init --uninstall
Remove the PostToolUse hook from settings.json.
init --auto-patch
Non-interactive mode for CI/CD pipelines.
init --hook-only
Install hook script only - skip config.json creation and model setup wizard.
Model
model pull
Download phi3:mini (default Q5_K_M, ~2GB) via Ollama.
model pull --quant q4_k_m
Download a specific quantization (q4_k_m, q5_k_m, q6_k).
model setup
Interactive backend selector (Ollama / Candle / llama.cpp) with GPU/CPU hardware detection. Runs automatically after ntk init.
model test
Test model latency and output quality with a sample prompt.
model test --debug
Verbose test: shows thread config, mlock status, system prompt preview, timing breakdown, and performance analysis with CPU-tier-aware targets (mobile ≥5 tok/s, desktop ≥10, high-end ≥15, GPU ≥40).
model bench
Benchmark CPU vs GPU inference latency.
model list
List available models in the configured backend.
Compression
test-compress <file>
Run the full pipeline on a captured output file and print result.
test
Run correctness tests on all compression layers. No daemon required.
test --l3
Include Layer 3 inference in the test run.
bench
Benchmark all compression layers (default: 5 runs per payload).
bench --runs <N>
Set number of benchmark runs per payload for stable measurements.
config
Show the active merged configuration (global + project overrides).
config --file <path>
Show configuration from a specific file path.
Metrics & Analytics
metrics
Session metrics table in stdout (plain text).
graph
ASCII bar chart + sparkline of savings over time.
gain
Token savings summary (RTK-compatible output format).
history
Last 50 compressed commands with token counts and layer used.
discover
Analyze Claude Code session for missed compression opportunities.

💡 RTK users: prefix with rtk ntk <cmd> to also compress NTK's own output.

Token savings

Savings by Command Type

Measured against real-world command output captured during development sessions.

99%
Max savings (vitest)
L1+L2+L3
🧠
~85%
Avg NTK+RTK combined
across all commands
<1ms
L1+L2 overhead
always on, zero impact
📉
Faster responses
less context = less latency
Category Commands NTK Savings Visual
Tests vitest, playwright, cargo test 90–99%
Build next build, tsc, cargo build 70–87%
Git status, log, diff, add, commit 59–80%
GitHub CLI gh pr, gh run, gh issue 26–87%
Package Managers pnpm, npm, npx, cargo 70–90%
Files / Search ls, grep, find, read 60–75%
Infrastructure docker, kubectl 85%
Network curl, wget 65–70%

Combined savings formula: 1 − (1 − rtk%) × (1 − ntk_incremental%)

Category RTK alone NTK incremental NTK+RTK combined Visual
Tests 90% ~9% ~99%
Build 83% ~24% ~96%
Git 70% ~33% ~79%
GitHub CLI 80% ~35% ~87%
Package Managers 85% ~33% ~90%
Infrastructure 85% 0% ~85%

How savings are measured

Token counts use cl100k_base (tiktoken-rs), the same tokenizer as Claude/GPT. Measurements are taken on real captured outputs from development sessions: cargo test suites, tsc compiler errors, vitest runs, git operations, and docker logs. Layer 3 activates only when post-L1+L2 output exceeds 300 tokens, so small outputs incur zero neural inference latency.

GPU Acceleration

Layer 3 Latency by Hardware

Phi-3 Mini Q5_K_M (3.8B). GPU drops p95 latency from ~900ms to under 100ms.

Hardware Backend p50 latency p95 latency Notes
NVIDIA RTX 5060 Ti CUDA ~30ms ~50ms Ada Lovelace, full offload
NVIDIA RTX 3060 CUDA ~50ms ~80ms 12GB VRAM, full offload
Apple M2 MacBook Pro Metal ~80ms ~150ms Unified memory, via Ollama Metal
Intel Xeon 4th Gen AMX ~150ms ~250ms Sapphire Rapids, AMX tiles
Intel Core i7-12700 AVX2 ~300ms ~500ms 12-core desktop, AVX2
Intel Core i5-8250U AVX2 ~600ms ~900ms 4-core laptop, baseline CPU

L3 only activates when output exceeds 300 tokens post-L1+L2. Small outputs always use L1+L2 (<5ms).