AI Infrastructure Reference

ClusterMAX Tier Reference

This page provides a ClusterMAX summary in one place so teams can compare providers without hunting across screenshots and social posts.

Source: clustermax.ai

Tiers listed

6

Recommended tiers

4

Providers listed

82

ClusterMAX snapshot version: 2026-03-11

Last updated: 2026-03-12

What ClusterMAX Is

ClusterMAX is a tiered way of organizing GPU cloud providers based on practical operating quality for AI workloads. Instead of only looking at marketing claims, the tiering helps teams quickly separate providers that are usually production-ready from providers still maturing.

What It Does

  • Gives infra teams a fast shortlist of providers worth evaluating first for real workloads.
  • Provides a common language for discussing provider maturity across engineering, finance, and leadership.
  • Reduces selection risk by highlighting operational differences, not just raw GPU specs.
  • Helps teams decide where to pilot, where to scale, and which providers need deeper due diligence.

ClusterMAX Detailed Assessment Criteria

Based on the ClusterMAX 2.0 assessment worksheet structure, use this matrix to score providers with weighted criteria and clear evidence notes.

CategoryCriteriaFitWeight (points)Target Score (5)Notes / Evidence
SecurityRelevant attestation (SOC2 Type 1, ISO 27001, etc.)105Annual audit completion and certificate scope.
SecuritySpecific compliance for global customers84GDPR, CCPA, and regional requirements documented.
SecuritySecure backend network (InfiniBand PKeys, VLANs)95Segmentation and tenant isolation enforced.
SecurityDriver and firmware update process74Monthly patching with rollback plan.
LifecycleEase of onboarding/offboarding (no hidden costs)65Contract and migration steps are clear.
LifecycleEase of cluster creation54Template-based deployment available.
LifecycleEase of cluster expansion44Autoscaling and capacity expansion path.
LifecycleEase of cluster use (speed/latency)64Job start latency and data path efficiency.
LifecycleQuality of support experience7524/7 Tier 1/2 response quality.
OrchestrationCluster setup with reasonable defaults (OS, packages)55Standard base image with GPU-ready defaults.
OrchestrationEase of adding/removing users, groups and permissions44Identity management and team boundaries.
OrchestrationEnforce RBAC on compute and storage resources55Granular role controls applied.
OrchestrationIntegration with external IAM provider for SSO45SAML/OIDC integration in production path.
OrchestrationSLURM configuration (modules, pyxis/enroot, NCCL, topology)85Full-stack HPC scheduler setup.
OrchestrationKubernetes configuration (kubeconfig, CNI, GPU/operator)85Production-safe cluster operator setup.
OrchestrationNCCL-tests/TorchTitan runs at expected MFU/bandwidth75Performance targets hit in reproducible runs.
OrchestrationK8s multi-node disaggregated serving support (llm-d)64Framework support for multi-node inference.
StoragePOSIX-compliant filesystem (Weka, VAST, DDN)75Shared training filesystem deployed.
StorageS3-compatible object storage available65Internal/external gateway compatibility.
StorageMounts for /home and /data (or default RWM storage class)55Persistent volumes available by default.
StorageLocal drives/distributed local FS for caching (/lvol)44Fast local cache path provisioned.
StorageStorage scalability and performance65Sustained throughput at cluster scale.
NetworkingInfiniBand or RoCEv2 available85High-speed fabric available for distributed jobs.
NetworkingMPI implementation available (HPC-X)65Preinstalled and tested against baseline workloads.
NetworkingDefault NCCL configuration is reasonable54Environment vars and topology settings are sane.
Networkingnccl-tests/all_reduce_benchmark runs at full bandwidth75Bandwidth matches expected fabric profile.
NetworkingMultinode TorchTitan training runs at expected MFU75MFU targets observed in real workloads.
NetworkingSHARP support for improved NCCL performance43Supported natively or via extra setup.
NetworkingNCCL monitoring plugin available34Plugin captures performance and failure signals.
NetworkingNCCL straggler detection available34Detect and surface slow-rank behavior.
ReliabilityHardware uptime SLA (e.g., 99.9% compute)95Published uptime commitment with penalties.
Reliability24x7 support, 15-minute response SLA85P1 response timeline and escalation path.
ReliabilityNo link flapping on interconnect network75Flap and packet-loss incidents tracked.
ReliabilityNo filesystems unmounting randomly65Mount stability under load and failover.
ReliabilityWAN connection stability and speed54Cross-region reliability and throughput.
ReliabilityFull suite of Passive Health Checks (DCGM, XIDs, ECC errors)85Continuous health signals integrated.
ReliabilityFull suite of Active Health Checks (lightweight, aggressive tests)74Routine synthetic checks and soak tests.
MonitoringGrafana/equivalent dashboard accessible (high/low-level views)65Cluster + workload observability at multiple layers.
MonitoringEasy to configure custom alerting44Alert routing and severity controls available.
MonitoringSLURM integration (sacct, job stats)55Job-level metrics and historical records exposed.
MonitoringKubernetes integration (kube-state-metrics, dcgm-exporter)55GPU and pod-level telemetry in one view.
MonitoringDCGM information available (SM Active, TFLOPs, PCIe AER)85Critical GPU health and utilization counters exposed.
PricingLower prices per GPU-hour105Compared against direct peers and hyperscalers.
PricingConsumption models (1 month, 1 year, etc.)75Flexible commitment periods and discounts.
PricingIndividual charges vs. bundled (storage, compute, network)64Transparent line items and predictable overages.
PricingExpansion and extension of existing contracts55Amendment and scaling terms are practical.
PartnershipsAMD or NVIDIA investment44Signals long-term ecosystem alignment.
PartnershipsNVIDIA NCP/exemplar cloud performance certification65Program participation and validity date.
PartnershipsAMD Cloud Alliance status43Active collaboration signals for AMD roadmap.
PartnershipsKnowledge of security updates (e.g., follow Wiz advisories)54Patch triage and advisory response cadence.
PartnershipsSchedMD partnership (SLURM)34Access to expert scheduler support channels.
PartnershipsParticipation in industry events and ecosystem support45Conference participation and community engagement.
AvailabilityTotal GPU quantity and cluster scale experience85Capacity and proven large-cluster operations.
AvailabilityOn-demand availability, utilization, capacity blocks74Reservation and on-demand balance for growth.
AvailabilityLatest GPU models available (H100, B200, MI300X)95Latest production GPU SKUs accessible.
AvailabilityRoadmap for future GPUs (B300, GB200, MI400)74Confirmed access to next-generation hardware.

Full ClusterMAX Tier List

Includes all tiers, including underperforming and unavailable groups, for complete market visibility.

Silver

12 providersRecommended in External Tools

Competitive providers with proven capabilities and growing production readiness.

Bronze

19 providersRecommended in External Tools

Emerging and specialized providers with solid traction and differentiated offerings.

Underperforming

21 providersReference only

Providers listed in underperforming tier snapshots.

Unavailable

24 providersReference only

Providers listed in unavailable/not-ready tier snapshots.