
Private AI Stack
Sized for your work load
How much private compute you need is set by the inference throughput you run, not a fixed product tier. A single Spark sits on a desk. Four units make a node, a 512 GB cluster on one switch. Past that, nodes scale out for as much aggregate throughput as the workload demands.
One-node DGX Spark cluster
The team build: four units pooled into a single 512 GB cluster, sized for a department's day-to-day experimentation and inference.
512 GB
Unified memory
pooled across 4 units
~400B-class
Model size
pooled inference
4× 200G
Switched fabric
any-to-any
~960W
Total draw
standard wall circuit
Topology
Powered from a standard 10A wall circuit. 10GbE management per unit over your existing network. No dedicated power or networking infrastructure required.
In the node
NVIDIA DGX Spark ×4
ComputeGB10 Grace Blackwell, 128 GB unified memory and 4 TB NVMe per unit. Four units pooled into one node for distributed serving of a single large model.
MikroTik CRS804-4DDQ-hRM
InterconnectFour 400G QSFP56-DD ports run at 200G to match ConnectX-7. One per unit, fully populated for a node. A second node adds its own switch and routes across.
QSFP56-DD → QSFP112 DAC
CablingOne identical short-run passive-copper cable per unit, switch to ConnectX-7. Nothing exotic to source.
Indicative figures only. They include a safety margin and will move with supplier pricing, exchange rates, and import costs. Final pricing is confirmed on a written quote. As an all-in example, a single-node deployment lands around R1.0M to R1.2M: ~R900k hardware plus the 13 to 20 day setup below. Final scope depends on your throughput targets, identity provider, and compliance environment.
Model strategy
A portfolio, not one model.
A private AI platform runs a portfolio of model routing, RAG, guardrails, observability, and workload-specific endpoints. A node is four DGX Spark units; capacity scales from a single Spark to a multi-node cluster, with the model chosen to fit the workload.
Single DGX Spark
1 unit · 128 GB · up to ~200B (NVIDIA)
- Gemma 4
- Qwen3.6-35B-A3B
- Qwen3-Coder-Next
- Qwen3 Embedding 8B · BGE-M3
- Parakeet · Voxtral · Kokoro
Embeddings, reranking, private chatbot, summarisation, policy Q&A, voice pre/post-processing.
One DGX Spark node
4 units · 512 GB · 405B-class
- NVIDIA Nemotron 3 Super
- Qwen3.6-35B-A3B
- Qwen3-Coder-Next
- Gemma 4 31B
- Mistral Large 3 (distributed)
Private enterprise assistant, developer guardrails, repo analysis, agentic coding, RAG, multimodal document understanding, model bake-offs.
Multi-node cluster
8+ units · distributed inference
- Kimi K2.6
- DeepSeek V4 Pro
- GLM-5.1
- MiMo-V2.5-Pro
- NVIDIA Nemotron 3 Ultra
Frontier coding agents, long-horizon autonomous workflows, 1M-token reasoning, multi-agent orchestration, regulated high-value workloads.
NVIDIA documents one DGX Spark for models up to ~200B and a dual-Spark link up to ~405B. Larger frontier models run via NVFP4 variants, sharding, and multi-node serving, all validated per workload before client production.
See the full model registryServing architecture
Every request passes the guardrail layer.
No model is reached directly. Requests enter through one gateway, clear policy and guardrails, then route to the endpoint and node sized for the workload, with audit and observability on every path.
Policy & guardrail layer
Model router
Workload endpoints
DGX Spark node pool
Tier 1-2 workloads
Multi-node cluster
Frontier models
Professional services
Stack Configuration
The hardware arrives configured at the DGX OS level. From there, six configuration steps turn it into a governed, observable platform wired into your network. Roughly 13 to 20 days for a standard stack.
Monitoring
Prometheus and Grafana across the hardware, inference, and model-cost layers, tracking utilisation, TTFT, latency, and per-team token spend, with alerts.
Model usage per user
Built on DeepEval's cost and efficiency metrics, tracking spend and tokens per user and per task, with insights into which models complete the work economically and which burn budget.
Authentication
An API gateway in front of every endpoint, per-team keys with rotation, LDAP/AD or SSO integration, and TLS everywhere.
Model registry
A central registry of models, versions, and quantisations, with staged promotion to production, one-step rollback, and provenance for every deployed weight.
Guardrails
PII detection and redaction pre- and post-model, prompt-injection detection, output filtering, tool allow-lists, and red-team testing.
Network integration to VPNs
The lab wired into your corporate network over site-to-site or client VPN. Private endpoints only, firewall rules scoped per team, nothing exposed to the public internet.
Deployment
From order to operational in 4.5 to 8 weeks.
Hardware is purchased from a trusted vendor.
- 01 / PHASE2-4 weeks
Procurement
All hardware ordered at once
DGX Spark units, switch and cables.
- 02 / PHASE2.5-4 weeks
Configuration
Begins on hardware arrival
Six workstreams, ~13-20 days, run on-site once everything is racked.
- 03 / PHASE4.5-8 weeks
Operational
Working, governed, observed
Total order-to-operational for a standard deployment on an existing network.
Want to size the right setup for your organisation?
Start with a conversation. We will work through your workloads, the data-residency constraints, and the budget envelope, and come back with a concrete spec.