HPC Operations & Support

The most expensive moments for an HPC system are when researchers cannot get into the queue when they need to, or when a job runs for hours before failing silently. Our HPC operations and support service — also known as managed HPC — takes daily ownership of the system so your team can focus on science and engineering, not sysadmin work.

What “managed” actually means here

This is not a break-fix contract that wakes up only when something is on fire. We operate the system proactively and on evidence:

24/7 metric monitoring — catching problems before users notice them
Regular health checks and trend analysis — fixing small drifts before they become incidents
A predictable SLA — response times are contractual, not aspirational
Monthly capacity and efficiency reports — concrete deliverables for management

Service scope

24/7 monitoring and alert management

Hardware: GPU temperatures, fan/PSU health, NVMe wear, ECC error rates
Network: InfiniBand link quality, packet loss, switch memory/CPU load
Storage: parallel filesystem health, metadata server load, capacity alerts
Scheduler: queue backlog, failed-job rate, fairshare anomalies
Mevasis NOC triages critical alerts within 15 minutes

Regular maintenance

Monthly maintenance windows: patches, firmware updates, disk health checks
Quarterly on-site visits: dust removal, cable checks, fan/PSU swaps
Annual capacity and performance audit — is the system still running like it did on day one?
Security patches: CVE-prioritised vulnerability management

Incident response

Tier-1 / Tier-2 / Tier-3 escalation, calibrated to your SLA
For hardware failures, same-day replacement from our spares stock
We manage manufacturer warranties and run RMA processes on your behalf
Post-incident root cause analysis (RCA) reports

Software and licence management

Bright Cluster Manager / xCAT / Warewulf updates
SLURM queue policy and fairshare tuning
BeeGFS / Lustre configuration optimisation
Licence tracking (commercial and open-source); proactive renewal of expiring licences
User and quota administration (request-based workflow)

Performance optimisation

Workload-specific tuning (compiler flags, MPI parameters, NUMA topology)
Identification and removal of I/O bottlenecks
GPU utilisation analysis fed back into user training
Monthly efficiency report: utilisation, queue statistics, top users

SLA tiers

Tier	Response time	Coverage	Best for
Standard	8 hours	Business hours, remote support	Research labs, small teams
Business Critical	4 hours	24/7, remote + on-site	Production systems, university centres
Mission Critical	1 hour	24/7, dedicated engineer	Industrial HPC, simulation-bound processes

Customer outcome

Significant savings vs. an in-house HPC sysadmin team — especially around 24/7 cover
A measurable drop in user complaints (typically 70 %+ in steady-state customers)
Long, reliable operation even after warranty thanks to spares stock and disciplined maintenance
For management: monthly reportable uptime and utilisation metrics

Common short questions

Another integrator built our system. Can we move support to you?

Yes. The first step is an independent takeover audit: documentation, monitoring stack, current configuration. Gaps and risk areas are reported and, where needed, corrected before the SLA begins.

We have an internal team and only need on-call escalation, not full outsourcing.

That is exactly the hybrid operations model we offer: day-to-day stays with your team; out-of-hours monitoring, critical incident response, and expert escalation come from Mevasis. It is the optimal model for many university centres.

How fast does an on-site engineer arrive in an emergency?

Mission Critical SLA: on-site within 4 hours in the Istanbul, Ankara, and Izmir metro regions. 8–24 hours in other regions. We carry spares stock across the country.

Will we be locked into your management tooling?

No. We operate your stack, on your terms. Monitoring is typically Prometheus + Grafana + Alertmanager (open-source) — owned by you, operated by us. No vendor lock-in.

Next stage

Even a stable system eventually needs to grow. The HPC Capacity Expansion → page explains how we scale the system without taking production offline.