AI Infrastructure Reference

ClusterMAX Tier Reference

This page provides a ClusterMAX summary in one place so teams can compare providers without hunting across screenshots and social posts.

Source: clustermax.ai

Tiers listed

Recommended tiers

Providers listed

ClusterMAX snapshot version: 2026-03-11

Last updated: 2026-03-12

What ClusterMAX Is

ClusterMAX is a tiered way of organizing GPU cloud providers based on practical operating quality for AI workloads. Instead of only looking at marketing claims, the tiering helps teams quickly separate providers that are usually production-ready from providers still maturing.

What It Does

Gives infra teams a fast shortlist of providers worth evaluating first for real workloads.
Provides a common language for discussing provider maturity across engineering, finance, and leadership.
Reduces selection risk by highlighting operational differences, not just raw GPU specs.
Helps teams decide where to pilot, where to scale, and which providers need deeper due diligence.

ClusterMAX Detailed Assessment Criteria

Based on the ClusterMAX 2.0 assessment worksheet structure, use this matrix to score providers with weighted criteria and clear evidence notes.

Category	Criteria	Weight (points)	Target Score (5)	Notes / Evidence
Security	Relevant attestation (SOC2 Type 1, ISO 27001, etc.)	10	5	Annual audit completion and certificate scope.
Security	Specific compliance for global customers	8	4	GDPR, CCPA, and regional requirements documented.
Security	Secure backend network (InfiniBand PKeys, VLANs)	9	5	Segmentation and tenant isolation enforced.
Security	Driver and firmware update process	7	4	Monthly patching with rollback plan.
Lifecycle	Ease of onboarding/offboarding (no hidden costs)	6	5	Contract and migration steps are clear.
Lifecycle	Ease of cluster creation	5	4	Template-based deployment available.
Lifecycle	Ease of cluster expansion	4	4	Autoscaling and capacity expansion path.
Lifecycle	Ease of cluster use (speed/latency)	6	4	Job start latency and data path efficiency.
Lifecycle	Quality of support experience	7	5	24/7 Tier 1/2 response quality.
Orchestration	Cluster setup with reasonable defaults (OS, packages)	5	5	Standard base image with GPU-ready defaults.
Orchestration	Ease of adding/removing users, groups and permissions	4	4	Identity management and team boundaries.
Orchestration	Enforce RBAC on compute and storage resources	5	5	Granular role controls applied.
Orchestration	Integration with external IAM provider for SSO	4	5	SAML/OIDC integration in production path.
Orchestration	SLURM configuration (modules, pyxis/enroot, NCCL, topology)	8	5	Full-stack HPC scheduler setup.
Orchestration	Kubernetes configuration (kubeconfig, CNI, GPU/operator)	8	5	Production-safe cluster operator setup.
Orchestration	NCCL-tests/TorchTitan runs at expected MFU/bandwidth	7	5	Performance targets hit in reproducible runs.
Orchestration	K8s multi-node disaggregated serving support (llm-d)	6	4	Framework support for multi-node inference.
Storage	POSIX-compliant filesystem (Weka, VAST, DDN)	7	5	Shared training filesystem deployed.
Storage	S3-compatible object storage available	6	5	Internal/external gateway compatibility.
Storage	Mounts for /home and /data (or default RWM storage class)	5	5	Persistent volumes available by default.
Storage	Local drives/distributed local FS for caching (/lvol)	4	4	Fast local cache path provisioned.
Storage	Storage scalability and performance	6	5	Sustained throughput at cluster scale.
Networking	InfiniBand or RoCEv2 available	8	5	High-speed fabric available for distributed jobs.
Networking	MPI implementation available (HPC-X)	6	5	Preinstalled and tested against baseline workloads.
Networking	Default NCCL configuration is reasonable	5	4	Environment vars and topology settings are sane.
Networking	nccl-tests/all_reduce_benchmark runs at full bandwidth	7	5	Bandwidth matches expected fabric profile.
Networking	Multinode TorchTitan training runs at expected MFU	7	5	MFU targets observed in real workloads.
Networking	SHARP support for improved NCCL performance	4	3	Supported natively or via extra setup.
Networking	NCCL monitoring plugin available	3	4	Plugin captures performance and failure signals.
Networking	NCCL straggler detection available	3	4	Detect and surface slow-rank behavior.
Reliability	Hardware uptime SLA (e.g., 99.9% compute)	9	5	Published uptime commitment with penalties.
Reliability	24x7 support, 15-minute response SLA	8	5	P1 response timeline and escalation path.
Reliability	No link flapping on interconnect network	7	5	Flap and packet-loss incidents tracked.
Reliability	No filesystems unmounting randomly	6	5	Mount stability under load and failover.
Reliability	WAN connection stability and speed	5	4	Cross-region reliability and throughput.
Reliability	Full suite of Passive Health Checks (DCGM, XIDs, ECC errors)	8	5	Continuous health signals integrated.
Reliability	Full suite of Active Health Checks (lightweight, aggressive tests)	7	4	Routine synthetic checks and soak tests.
Monitoring	Grafana/equivalent dashboard accessible (high/low-level views)	6	5	Cluster + workload observability at multiple layers.
Monitoring	Easy to configure custom alerting	4	4	Alert routing and severity controls available.
Monitoring	SLURM integration (sacct, job stats)	5	5	Job-level metrics and historical records exposed.
Monitoring	Kubernetes integration (kube-state-metrics, dcgm-exporter)	5	5	GPU and pod-level telemetry in one view.
Monitoring	DCGM information available (SM Active, TFLOPs, PCIe AER)	8	5	Critical GPU health and utilization counters exposed.
Pricing	Lower prices per GPU-hour	10	5	Compared against direct peers and hyperscalers.
Pricing	Consumption models (1 month, 1 year, etc.)	7	5	Flexible commitment periods and discounts.
Pricing	Individual charges vs. bundled (storage, compute, network)	6	4	Transparent line items and predictable overages.
Pricing	Expansion and extension of existing contracts	5	5	Amendment and scaling terms are practical.
Partnerships	AMD or NVIDIA investment	4	4	Signals long-term ecosystem alignment.
Partnerships	NVIDIA NCP/exemplar cloud performance certification	6	5	Program participation and validity date.
Partnerships	AMD Cloud Alliance status	4	3	Active collaboration signals for AMD roadmap.
Partnerships	Knowledge of security updates (e.g., follow Wiz advisories)	5	4	Patch triage and advisory response cadence.
Partnerships	SchedMD partnership (SLURM)	3	4	Access to expert scheduler support channels.
Partnerships	Participation in industry events and ecosystem support	4	5	Conference participation and community engagement.
Availability	Total GPU quantity and cluster scale experience	8	5	Capacity and proven large-cluster operations.
Availability	On-demand availability, utilization, capacity blocks	7	4	Reservation and on-demand balance for growth.
Availability	Latest GPU models available (H100, B200, MI300X)	9	5	Latest production GPU SKUs accessible.
Availability	Roadmap for future GPUs (B300, GB200, MI400)	7	4	Confirmed access to next-generation hardware.

Full ClusterMAX Tier List

Includes all tiers, including underperforming and unavailable groups, for complete market visibility.

Platinum

1 providersRecommended in External Tools

Top-ranked providers in the latest publicly shared ClusterMAX tier list.

CoreWeave
Platinum-tier neocloud reference provider.

Gold

5 providersRecommended in External Tools

High-tier providers with strong maturity, reliability, and scale characteristics.

Silver

12 providersRecommended in External Tools

Competitive providers with proven capabilities and growing production readiness.

Bronze

19 providersRecommended in External Tools

Emerging and specialized providers with solid traction and differentiated offerings.

Underperforming

21 providersReference only

Providers listed in underperforming tier snapshots.

Unavailable

24 providersReference only

Providers listed in unavailable/not-ready tier snapshots.