NPU LabsNPU LABS
A stack of four NVIDIA DGX Spark units wired to a MikroTik switch on a desk.

Private AI Stack

Sized for your work load

How much private compute you need is set by the inference throughput you run, not a fixed product tier. A single Spark sits on a desk. Four units make a node, a 512 GB cluster on one switch. Past that, nodes scale out for as much aggregate throughput as the workload demands.

RecommendedIndicative hardware ~R900k

One-node DGX Spark cluster

The team build: four units pooled into a single 512 GB cluster, sized for a department's day-to-day experimentation and inference.

512 GB

Unified memory

pooled across 4 units

~400B-class

Model size

pooled inference

4× 200G

Switched fabric

any-to-any

~960W

Total draw

standard wall circuit

Topology

DGX Spark 1DGX Spark 2DGX Spark 3DGX Spark 4CRS804 switchone node · 4 units

Powered from a standard 10A wall circuit. 10GbE management per unit over your existing network. No dedicated power or networking infrastructure required.

In the node

NVIDIA DGX Spark ×4

Compute

GB10 Grace Blackwell, 128 GB unified memory and 4 TB NVMe per unit. Four units pooled into one node for distributed serving of a single large model.

MikroTik CRS804-4DDQ-hRM

Interconnect

Four 400G QSFP56-DD ports run at 200G to match ConnectX-7. One per unit, fully populated for a node. A second node adds its own switch and routes across.

QSFP56-DD → QSFP112 DAC

Cabling

One identical short-run passive-copper cable per unit, switch to ConnectX-7. Nothing exotic to source.

Indicative figures only. They include a safety margin and will move with supplier pricing, exchange rates, and import costs. Final pricing is confirmed on a written quote. As an all-in example, a single-node deployment lands around R1.0M to R1.2M: ~R900k hardware plus the 13 to 20 day setup below. Final scope depends on your throughput targets, identity provider, and compliance environment.

Model strategy

A portfolio, not one model.

A private AI platform runs a portfolio of model routing, RAG, guardrails, observability, and workload-specific endpoints. A node is four DGX Spark units; capacity scales from a single Spark to a multi-node cluster, with the model chosen to fit the workload.

Tier 1

Single DGX Spark

1 unit · 128 GB · up to ~200B (NVIDIA)

  • Gemma 4
  • Qwen3.6-35B-A3B
  • Qwen3-Coder-Next
  • Qwen3 Embedding 8B · BGE-M3
  • Parakeet · Voxtral · Kokoro

Embeddings, reranking, private chatbot, summarisation, policy Q&A, voice pre/post-processing.

Tier 2Default

One DGX Spark node

4 units · 512 GB · 405B-class

  • NVIDIA Nemotron 3 Super
  • Qwen3.6-35B-A3B
  • Qwen3-Coder-Next
  • Gemma 4 31B
  • Mistral Large 3 (distributed)

Private enterprise assistant, developer guardrails, repo analysis, agentic coding, RAG, multimodal document understanding, model bake-offs.

Tier 3

Multi-node cluster

8+ units · distributed inference

  • Kimi K2.6
  • DeepSeek V4 Pro
  • GLM-5.1
  • MiMo-V2.5-Pro
  • NVIDIA Nemotron 3 Ultra

Frontier coding agents, long-horizon autonomous workflows, 1M-token reasoning, multi-agent orchestration, regulated high-value workloads.

NVIDIA documents one DGX Spark for models up to ~200B and a dual-Spark link up to ~405B. Larger frontier models run via NVFP4 variants, sharding, and multi-node serving, all validated per workload before client production.

See the full model registry

Serving architecture

Every request passes the guardrail layer.

No model is reached directly. Requests enter through one gateway, clear policy and guardrails, then route to the endpoint and node sized for the workload, with audit and observability on every path.

Client application
API gateway

Policy & guardrail layer

Audit logs
PII detection & redactionPrompt-injection detectionData classificationTool allow-lists

Model router

RAG pipelineObservability

Workload endpoints

General assistantDeveloper agentFrontier reasoningEmbedding & rerankingSpeech-to-textText-to-speech

DGX Spark node pool

Tier 1-2 workloads

Multi-node cluster

Frontier models

Professional services

Stack Configuration

The hardware arrives configured at the DGX OS level. From there, six configuration steps turn it into a governed, observable platform wired into your network. Roughly 13 to 20 days for a standard stack.

3-4 days

Monitoring

Prometheus and Grafana across the hardware, inference, and model-cost layers, tracking utilisation, TTFT, latency, and per-team token spend, with alerts.

2-3 days

Model usage per user

Built on DeepEval's cost and efficiency metrics, tracking spend and tokens per user and per task, with insights into which models complete the work economically and which burn budget.

2-3 days

Authentication

An API gateway in front of every endpoint, per-team keys with rotation, LDAP/AD or SSO integration, and TLS everywhere.

2-3 days

Model registry

A central registry of models, versions, and quantisations, with staged promotion to production, one-step rollback, and provenance for every deployed weight.

3-5 days

Guardrails

PII detection and redaction pre- and post-model, prompt-injection detection, output filtering, tool allow-lists, and red-team testing.

1-2 days

Network integration to VPNs

The lab wired into your corporate network over site-to-site or client VPN. Private endpoints only, firewall rules scoped per team, nothing exposed to the public internet.

Deployment

From order to operational in 4.5 to 8 weeks.

Hardware is purchased from a trusted vendor.

  1. 01 / PHASE2-4 weeks

    Procurement

    All hardware ordered at once

    DGX Spark units, switch and cables.

  2. 02 / PHASE2.5-4 weeks

    Configuration

    Begins on hardware arrival

    Six workstreams, ~13-20 days, run on-site once everything is racked.

  3. 03 / PHASE4.5-8 weeks

    Operational

    Working, governed, observed

    Total order-to-operational for a standard deployment on an existing network.

Want to size the right setup for your organisation?

Start with a conversation. We will work through your workloads, the data-residency constraints, and the budget envelope, and come back with a concrete spec.