The most expensive moments for an HPC system are when researchers cannot get into the queue when they need to, or when a job runs for hours before failing silently. Our HPC operations and support service — also known as managed HPC — takes daily ownership of the system so your team can focus on science and engineering, not sysadmin work.
What “managed” actually means here
This is not a break-fix contract that wakes up only when something is on fire. We operate the system proactively and on evidence:
- 24/7 metric monitoring — catching problems before users notice them
- Regular health checks and trend analysis — fixing small drifts before they become incidents
- A predictable SLA — response times are contractual, not aspirational
- Monthly capacity and efficiency reports — concrete deliverables for management
Service scope
24/7 monitoring and alert management
- Hardware: GPU temperatures, fan/PSU health, NVMe wear, ECC error rates
- Network: InfiniBand link quality, packet loss, switch memory/CPU load
- Storage: parallel filesystem health, metadata server load, capacity alerts
- Scheduler: queue backlog, failed-job rate, fairshare anomalies
- Mevasis NOC triages critical alerts within 15 minutes
Regular maintenance
- Monthly maintenance windows: patches, firmware updates, disk health checks
- Quarterly on-site visits: dust removal, cable checks, fan/PSU swaps
- Annual capacity and performance audit — is the system still running like it did on day one?
- Security patches: CVE-prioritised vulnerability management
Incident response
- Tier-1 / Tier-2 / Tier-3 escalation, calibrated to your SLA
- For hardware failures, same-day replacement from our spares stock
- We manage manufacturer warranties and run RMA processes on your behalf
- Post-incident root cause analysis (RCA) reports
Software and licence management
- Bright Cluster Manager / xCAT / Warewulf updates
- SLURM queue policy and fairshare tuning
- BeeGFS / Lustre configuration optimisation
- Licence tracking (commercial and open-source); proactive renewal of expiring licences
- User and quota administration (request-based workflow)
Performance optimisation
- Workload-specific tuning (compiler flags, MPI parameters, NUMA topology)
- Identification and removal of I/O bottlenecks
- GPU utilisation analysis fed back into user training
- Monthly efficiency report: utilisation, queue statistics, top users
SLA tiers
| Tier | Response time | Coverage | Best for |
|---|---|---|---|
| Standard | 8 hours | Business hours, remote support | Research labs, small teams |
| Business Critical | 4 hours | 24/7, remote + on-site | Production systems, university centres |
| Mission Critical | 1 hour | 24/7, dedicated engineer | Industrial HPC, simulation-bound processes |
Customer outcome
- Significant savings vs. an in-house HPC sysadmin team — especially around 24/7 cover
- A measurable drop in user complaints (typically 70 %+ in steady-state customers)
- Long, reliable operation even after warranty thanks to spares stock and disciplined maintenance
- For management: monthly reportable uptime and utilisation metrics
Common short questions
Another integrator built our system. Can we move support to you?
Yes. The first step is an independent takeover audit: documentation, monitoring stack, current configuration. Gaps and risk areas are reported and, where needed, corrected before the SLA begins.
We have an internal team and only need on-call escalation, not full outsourcing.
That is exactly the hybrid operations model we offer: day-to-day stays with your team; out-of-hours monitoring, critical incident response, and expert escalation come from Mevasis. It is the optimal model for many university centres.
How fast does an on-site engineer arrive in an emergency?
Mission Critical SLA: on-site within 4 hours in the Istanbul, Ankara, and Izmir metro regions. 8–24 hours in other regions. We carry spares stock across the country.
Will we be locked into your management tooling?
No. We operate your stack, on your terms. Monitoring is typically Prometheus + Grafana + Alertmanager (open-source) — owned by you, operated by us. No vendor lock-in.
Next stage
Even a stable system eventually needs to grow. The HPC Capacity Expansion → page explains how we scale the system without taking production offline.