Case Studies
Anonymized examples from real engagements. Client names, locations, and identifying details are omitted. The technical work is representative of what we handle.
Mail reputation containment in shared hosting
Environment
Shared hosting platform — cPanel, Exim, ~400 domains across 3 servers, shared IP pool.
Problem
The hosting provider noticed a spike in bounce-backs and delivery failures across multiple customer domains. Two IPs from the shared pool appeared on major blacklists. Customer complaints were escalating.
Investigation
We audited outbound mail logs across all three servers. Identified one account sending high volumes through a compromised WordPress contact form plugin. The sending pattern had triggered Spamhaus and Barracuda listings. Other accounts on the same IPs were collateral damage.
Root Cause
A customer running an outdated WordPress plugin with a known unauthenticated mail injection vulnerability. The form was being exploited to relay spam through the local MTA using the server default IP.
Fix
Suspended the compromised account and disabled the vulnerable plugin. Cleaned the mail queue of remaining spam. Submitted delisting requests with evidence of containment. Implemented per-account outbound rate limits and added the plugin path to the server malware scanner signatures.
Outcome
Blacklist removal confirmed within 18 hours. Delivery rates returned to normal across all affected customer domains. The provider added outbound rate monitoring as a standard alert.
DNS dependency failure causing domain outages
Environment
Hosting environment with delegated nameservers, a custom DNS layout, phased IP migration work, and downstream customers relying on a cluster-style nameserver pair. The upstream nameserver records happened to be hosted on a shared Windows Plesk server owned and managed by the downstream customer’s own in-house team. That server was not considered part of the production DNS infrastructure by anyone involved.
Problem
A production outage was reported by a downstream customer of our client. The affected customer represented a five-figure annual contract, so the incident carried immediate commercial and operational pressure. The initial assumption was that the outage had been caused by a nameserver-related IP change completed several days earlier by our team. That assumption was understandable: the earlier work sat in the same technical area, and there were no fresh changes at the time the outage was reported. The downstream customer’s in-house team investigated first and traced the issue back toward that earlier change, which placed the focus on our work from the outset.
Investigation
We rechecked the earlier DNS migration first and found no implementation fault. The expected records and server-side changes were correct, so we restarted the investigation from the external DNS path itself and traced the authority chain step by step. That exposed a deeper dependency. The delegated nameserver hostnames used by the affected domains depended on a DNS zone hosted on the downstream customer’s own shared Windows Plesk server. That server was functioning normally — its own hosted sites were unaffected and did not use the nameservers in question. Because it appeared operationally unrelated and showed no signs of a problem, neither the customer’s in-house team nor the first round of troubleshooting identified it as part of the live failure path. However, the zone on that server still contained the records required for the downstream customer’s own nameserver resolution. Once we followed the full dependency chain, it became clear that the real break was not in the previously suspected migration, but in an upstream nameserver path running through a server the customer’s own team had overlooked.
Root Cause
The outage was not caused by the earlier nameserver IP migration. The actual cause was a hidden DNS dependency on the downstream customer’s own Windows Plesk server. The upstream nameserver records were hosted in a zone on that server, but nobody — including the customer’s in-house team who managed it — recognised the connection. During phased IP cleanup, legacy IP addresses were removed from that server. The server itself and its own hosted sites were completely unaffected. But the downstream customer’s main infrastructure depended on those nameserver records for resolution. Once the IPs were removed, the authority chain broke and domains across their core infrastructure stopped resolving. Because the Plesk server showed no symptoms and its own services were fine, it was never checked. That made the issue difficult to identify, even after the customer’s own team had already investigated and pointed at the wrong cause.
Fix
We logged into the registrar panel and updated the child nameserver IP addresses for the upstream nameserver hostnames. That restored the authority chain required for resolvers to reach the delegated nameserver layer again. We then monitored external resolution until responses stabilised and the remaining impact cleared through normal DNS propagation.
Outcome
Domain resolution was restored without rolling back the earlier migration. The incident was traced to a hidden dependency on the customer’s own infrastructure rather than an implementation mistake in the suspected change. Just as importantly, the investigation corrected an initial internal conclusion that had pointed at the wrong root cause. The result was not only service restoration for a high-value downstream customer, but a clearer understanding of the nameserver dependency chain involved in future IP migration and cleanup work.
Live storage migration under production load
Environment
VMware vSphere cluster — 60+ production VMs, shared SAN storage, mixed Linux and Windows workloads.
Problem
The SAN controller backing the primary datastore was approaching end-of-support. The provider needed to migrate all VM storage to a new backend without customer-visible downtime during business hours.
Investigation
We assessed the storage layout, identified VMs with high I/O profiles that needed priority handling, and mapped snapshot dependencies. Tested storage vMotion behavior under load in a staging subset to validate performance impact thresholds.
Root Cause
Not a fault — planned migration driven by hardware lifecycle. The challenge was executing it live with no maintenance window.
Fix
Performed staged storage vMotion in controlled batches during lower-traffic periods. Prioritized high-I/O VMs during overnight windows. Monitored latency and throughput continuously during each batch. Verified datastore integrity and snapshot chain health after each migration.
Outcome
All 60+ VMs migrated to the new SAN backend over a 5-day window with zero customer-reported downtime. Old SAN was decommissioned on schedule.
WordPress performance degradation under normal traffic
Environment
Shared hosting — CloudLinux, LiteSpeed, MySQL 5.7, ~150 accounts on a single node.
Problem
A customer reported their WordPress site loading in 12–15 seconds despite normal traffic volumes. The hosting provider suspected a plugin issue, but disabling plugins had no visible effect.
Investigation
We profiled the request lifecycle and found the bottleneck was not in PHP execution but in MySQL query time. Slow query log showed a single query on the wp_options table taking 3–4 seconds per page load. The table had grown to 4.2 million rows due to an abandoned transient caching plugin that never cleaned up expired entries.
Root Cause
The wp_options table had millions of orphaned autoload=yes transient rows. Every page load triggered a bulk SELECT on autoloaded options, overwhelming the buffer pool and causing disk reads on every request.
Fix
Cleaned orphaned transients from wp_options. Removed the abandoned plugin. Optimized the table to reclaim space. Verified autoload row count was back to a healthy range. Added a cron check to alert on wp_options row count growth.
Outcome
Page load time dropped from 12 seconds to 1.8 seconds. MySQL query time for the options load went from 3.4 seconds to 12 milliseconds.
Hypervisor fault causing VM instability
Environment
KVM-based virtualization platform — 40 VMs across 3 hypervisor nodes, Ceph storage backend.
Problem
Multiple VMs on one hypervisor node experienced I/O stalls and intermittent unresponsiveness. The issue appeared to be random — not tied to a specific VM or workload. Rebooting affected VMs provided temporary relief.
Investigation
We checked hypervisor-level logs and identified repeated libvirt timeout errors correlating with the VM stalls. Ceph OSD logs showed elevated latency on one specific OSD that mapped to a disk in the hypervisor’s local storage node. SMART data confirmed early-stage read errors on the underlying drive.
Root Cause
A failing physical disk in the Ceph cluster was triggering slow reads on one OSD. When VMs with placement groups on that OSD issued I/O, the Ceph recovery and retry logic added enough latency to cause visible stalls at the guest level.
Fix
Marked the failing OSD out of the cluster to trigger data rebalancing. Replaced the physical drive. Re-added the OSD after a full scrub of the rebuilt placement groups. Monitored cluster health and I/O latency for 72 hours to confirm stability.
Outcome
VM I/O stalls stopped immediately after OSD removal. Full cluster rebalance completed within 4 hours. No data loss.