Edge Inference

Frontier models, locally served.

Edge Inference is the serving layer for organizations that cannot — or will not — send their data to a third-party API. It runs frontier-class language and vision models on commodity hardware, in environments that may never reach the public internet, with serving-grade latency and tenant-grade observability.

▍ The product

What an operator sees.

Node Dashboard — the operator’s view of an edge node in the field. Hardware, model, throughput, latency, and a tail of recent requests. No upstream traffic required to populate any of it.

Edge Node·node/edge-fwd-base-07 · sub-7 · zone-bravo

Air-gapped

MODEL

claude-edge-7b · int8 · MoE

HARDWARE

2× L40S · 96 GB

UPTIME

47d 12h

LAST SYNC

↑ 04-29 09:00 UTC

REQ / SEC

182

P50 LATENCY

182 ms

P99 LATENCY

612 ms

GPU UTIL.

74 %

Latency · last 60 minp50 / p99 (ms)

window 60m · step 1m

Recent requests

tail · 4

TS	ROUTE	MODEL	LAT (ms)	CODE
10:41:08.214	/v1/chat/completions	claude-edge-7b	168	200
10:41:08.087	/v1/chat/completions	claude-edge-7b	204	200
10:41:07.951	/v1/embeddings	embed-edge-l	32	200
10:41:07.762	/v1/chat/completions	vision-edge-3b	511	200

Illustrative interface — values are design fixtures, not benchmarks

▍ The problem

The most important inference is the one happening closest to the operator.

In a forward operating base, a substation, or a cath lab, network reliability is not assumed and data sovereignty is not negotiable. Sending payloads to a hyperscaler is not an option. Edge Inference makes frontier-class capability available where the work is — without the SaaS contract, without the egress, and without a posture downgrade when connectivity drops.

▍ Capabilities

What ships in the box.

Quantized model bundles

Signed, versioned model bundles (4-bit, 8-bit, mixed-precision) optimized for the hardware footprint you actually have — not the one a vendor demoed on.

GPU scheduler

Preemptive, mixed-precision scheduling across heterogeneous GPUs. Predictable tail latency under contention.

Local serving runtime

Batched, KV-cached, structured-output capable. Compatible with the OpenAI and Anthropic message schemas your applications already speak.

Cold-start optimization

Warm in seconds, not minutes. Built for field hardware that gets power-cycled, not data-center metal that runs forever.

Telemetry buffer

Local-first observability with bounded buffers that sync upstream only when reachable — and only what your policy allows.

Air-gapped updates

Signed model and runtime updates via offline media. Verifiable, reproducible, reversible.

▍ Use in sector

Concrete deployments.

Reference scenarios — drawn from active design-partner conversations and prior operator engagements.

DEFENSE
Frontier-class reasoning on a Toughbook in a SCIF, serving a multi-modal agent that reads classified imagery without it ever crossing a network boundary.
ENERGY
On-substation models for fault classification and protection coordination, running on hardened industrial GPUs with hours of offline operation budgeted.
HEALTHCARE
Hospital-resident models for clinical documentation and decision support, with PHI never leaving the customer’s VLAN.
MARITIME / LOGISTICS
Vessel-onboard inference for routing, exception handling, and customs documentation during the long offline stretches between ports.

Default deployment posture

Air-gapped

NVIDIA · AMD · Apple Silicon

Heterogeneous

Reproducible, reversible updates

Signed bundles

OpenAI · Anthropic schemas

API-compatible

← Previous

Agent Infrastructure

Decision Intelligence