Neurox Workload cluster

The Neurox Workload management cluster is where GPU workloads run on GPU nodes. When deployed standalone, it does not require ingress nor persistent disk. Typically, the Neurox Workload components are installed together with Neurox Control plane components in a single combined Kubernetes cluster.

This page outline the requirements needed to deploy standalone Neurox Workload components into additional Kubernetes GPU clusters. Neurox Workload can autodetect many Cloud Service Provider (CSP) environments, automatically surfacing metadata such as region or availability zone, as well as identify models of GPUs attached.

Multi-Cluster setup

One of the best features of Neurox is monitoring multiple Neurox Workload clusters from a single Neurox Control plane. Common use cases include joining GPU clusters from various cloud providers or even on-prem clusters.

Please see our pricing plans to determine how many Neurox Workload clusters may be joined into a Neurox Control cluster.

Cluster requirements

Kubernetes and CLI 1.29+
Helm CLI 3.8+
4 CPUs
8 GB of RAM
At least 1 GPU node

Prerequisites

You will need both NVIDIA GPU Operator and Kube Prometheus Stack to run the Neurox workload chart.

NVIDIA GPU Operator

Required to run GPU workloads. Install with:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --create-namespace -n gpu-operator gpu-operator nvidia/gpu-operator --version=v25.3.0

For more information on how to configure NVIDIA GPU operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#procedure

Kube Prometheus Stack

Required to gather metrics. Install with:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# This is the minimum required configuration. Feel free to enable components if you need them.
helm install --create-namespace -n monitoring kube-prometheus-stack prometheus-community/kube-prometheus-stack --set alertmanager.enabled=false --set grafana.enabled=false --set prometheus.enabled=false

For more information on how to configure kube-prometheus-stack: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-state-metrics

Credentials

Your Neurox subdomain
Your Neurox Workload auth secret (provided by Neurox Control)
Your Neurox registry username and password

Install

To join a Neurox Workload cluster to an existing Neurox Control cluster, you can obtain the install script by going to your Neurox Control portal > Clusters > New Cluster button and a fully generated install script (with auth secret) will be available to copy/paste.

The example below was based on the output of the generated install script:

CLUSTER_NAME=iks-mlops-us-south-dal12 # customize this

NEUROX_DOMAIN=random-words.goneurox.com
WORKLOAD_AUTH_SECRET=yourworkloadauthsecret
NEUROX_HELM_REGISTRY=oci://ghcr.io/neuroxhq/helm-charts
NEUROX_IMAGE_REGISTRY=registry.neurox.com
NEUROX_USERNAME=random-words-goneurox-com
NEUROX_PASSWORD=yourregistrypassword

kubectl create ns neurox
kubectl create secret generic -n neurox neurox-control-auth --from-literal=shared-secret=${WORKLOAD_AUTH_SECRET}
kubectl create secret docker-registry -n neurox neurox-image-registry --docker-server=${NEUROX_IMAGE_REGISTRY} --docker-username=${NEUROX_USERNAME} --docker-password=${NEUROX_PASSWORD}

helm install neurox-workload ${NEUROX_HELM_REGISTRY}/neurox-workload --namespace neurox --set global.workloadCluster.name=${CLUSTER_NAME} --set global.controlHost=${NEUROX_DOMAIN}

PreviousNeurox Control plane standalone NextInstall Prerequisites

Last updated 2 months ago