HPC Operations & Support
Managed HPC service with 24/7 monitoring, proactive maintenance, and incident response. Keep your supercomputer in production while your team focuses on science — Mevasis HPC support contract.
The most expensive moments for an HPC system are when researchers cannot get into the queue when they need to, or when a job runs for hours before failing silently. Our HPC operations and support service — also known as managed HPC — takes daily ownership of the system so your team can focus on science and engineering, not sysadmin work.
What “managed” actually means here
This is not a break-fix contract that wakes up only when something is on fire. We operate the system proactively and on evidence:
- 24/7 metric monitoring — catching problems before users notice them
- Regular health checks and trend analysis — fixing small drifts before they become incidents
- A predictable SLA — response times are contractual, not aspirational
- Monthly capacity and efficiency reports — concrete deliverables for management
Service scope
24/7 monitoring and alert management
- Hardware: GPU temperatures, fan/PSU health, NVMe wear, ECC error rates
- Network: InfiniBand link quality, packet loss, switch memory/CPU load
- Storage: parallel filesystem health, metadata server load, capacity alerts
- Scheduler: queue backlog, failed-job rate, fairshare anomalies
- Mevasis NOC triages critical alerts within 15 minutes
Regular maintenance
- Monthly maintenance windows: patches, firmware updates, disk health checks
- Quarterly on-site visits: dust removal, cable checks, fan/PSU swaps
- Annual capacity and performance audit — is the system still running like it did on day one?
- Security patches: CVE-prioritised vulnerability management
Incident response
- Tier-1 / Tier-2 / Tier-3 escalation, calibrated to your SLA
- For hardware failures, same-day replacement from our spares stock
- We manage manufacturer warranties and run RMA processes on your behalf
- Post-incident root cause analysis (RCA) reports
Software and licence management
- Bright Cluster Manager / xCAT / Warewulf updates
- SLURM queue policy and fairshare tuning
- BeeGFS / Lustre configuration optimisation
- Licence tracking (commercial and open-source); proactive renewal of expiring licences
- User and quota administration (request-based workflow)
Performance optimisation
- Workload-specific tuning (compiler flags, MPI parameters, NUMA topology)
- Identification and removal of I/O bottlenecks
- GPU utilisation analysis fed back into user training
- Monthly efficiency report: utilisation, queue statistics, top users
SLA tiers
| Tier | Response time | Coverage | Best for |
|---|---|---|---|
| Standard | 8 hours | Business hours, remote support | Research labs, small teams |
| Business Critical | 4 hours | 24/7, remote + on-site | Production systems, university centres |
| Mission Critical | 1 hour | 24/7, dedicated engineer | Industrial HPC, simulation-bound processes |
Customer outcome
- Significant savings vs. an in-house HPC sysadmin team — especially around 24/7 cover
- A measurable drop in user complaints (typically 70 %+ in steady-state customers)
- Long, reliable operation even after warranty thanks to spares stock and disciplined maintenance
- For management: monthly reportable uptime and utilisation metrics
Common short questions
Another integrator built our system. Can we move support to you?
We have an internal team and only need on-call escalation, not full outsourcing.
How fast does an on-site engineer arrive in an emergency?
Will we be locked into your management tooling?
Next stage
Even a stable system eventually needs to grow. The HPC Capacity Expansion → page explains how we scale the system without taking production offline.