The after-hours stuck-service problem
A 3CX server is not one program — it's a cluster of services running side by side. The media server that mixes call audio, the audio providers, the event-notification layer, the AI transcription service, and so on. Most of the time they all sit there running. Occasionally one wedges: a service crashes, hangs, or exits and doesn't come back. On its own, that's not exotic — software does this.
The painful part is when it happens. A service that stalls at 2:14am doesn't announce itself politely at the start of the next business day. Depending on which one stopped, your client might get one-way audio, missing recordings, no AI summaries, or a dead feature — and they find out before you do. The classic remediation is embarrassingly simple: restart the service and it comes right back. But "restart the service" requires a human to (a) see the alert, (b) be awake, (c) get to a laptop, and (d) log in. At 2am, all four are unlikely. So a thirty-second fix becomes a multi-hour outage whose entire duration is just waiting for a person.
The obvious reaction — "so just auto-restart everything" — is exactly the trap. Some 3CX services cannot be restarted without consequences: bounce the phone system or the call manager and you drop every call in progress. An auto-remediation that occasionally hangs up on live callers is worse than the disease. The art is doing this safely: fix the things that are safe to fix, automatically, and refuse to touch the things that aren't.
How safe auto-remediation works
Sikurd already polls every 3CX instance on a tight cycle and pulls the live service list on each poll. When a service shows up as stopped, Sikurd raises a critical "service stopped" alert. Self-Healing adds one thing on top of that alert: under tightly controlled conditions, it issues the restart for you. Those conditions are the whole point, so here's exactly what they are.
1. A one-poll grace period for transient blips
Sikurd never restarts a service the first time it sees it down. It waits for a second consecutive observation — the same service, still stopped, on the next poll — before it acts. A genuinely transient blip (a service that flickers and recovers on its own within a poll cycle) self-heals without Sikurd lifting a finger. Only a service that is actually stuck, confirmed across two polls in a row, becomes a restart candidate. This single rule eliminates the most common false-positive: reacting to a momentary flap that would have cleared itself.
2. A whitelist of low-blast-radius services only
This is the safety rail that matters most. Self-Healing will only ever restart a fixed, deliberately narrow list of services that are safe to bounce:
- 3CX Media Server — the call-audio mixer. A restart briefly blips audio on calls already in progress, but it does not drop them.
- 3CX Audio Providers (the numbered ones — 01, 02, …) — supply hold music and prompts; restart-safe.
- 3CX Event Notification Manager — the alerting/eventing layer; restarting it has no call impact.
- 3CX System Services (the numbered ones) — background system workers.
- 3CX AI — transcription and summary; purely background, completely safe to restart.
Anything not on that list is left alone. If it's down, you still get the alert — Sikurd just won't try to fix it itself. The list is short on purpose, and widening it is a deliberate decision rather than something that creeps in by default.
3. Cooldowns and an hourly cap
Even for whitelisted services, Self-Healing won't hammer a flapping process:
- 10-minute cooldown per serviceThe same service on the same instance can't be restarted more than once every ten minutes. A service that keeps flapping doesn't get bounced on every poll.
- Maximum 3 attempts per hourAfter three restart attempts in an hour, Sikurd gives up — on purpose. A service that won't survive three restarts has a real problem, and a human should look at it.
- Then it stops and escalatesPast the cap, the restart loop ends and the original alert stays open, routing through your normal escalation and PSA paths. Automation tried; now it's a person's turn.
4. Opt-in, role-gated, and fully audited
Self-Healing is off by default. You turn it on per tenant from the dashboard settings, and only an Owner, Admin, or Super Admin can flip it — issuing service restarts is a privileged action, so Members and read-only roles can't enable it. Every restart attempt, whether it succeeds or 3CX rejects it, is written to the audit trail as a SERVICE_AUTO_RESTART event tied to the instance, right alongside the underlying alert. You can always see which service Sikurd restarted, on which PBX, when, whether it worked, and any error returned. Nothing happens behind your back.
Why it won't drop calls
The single most important thing to understand about Self-Healing is what it refuses to do. The services that carry and route live calls — the phone system, the call manager, the queue manager, the configuration server — are deliberately excluded from the whitelist. They are never auto-restarted, full stop, even when Self-Healing is enabled and even when one of them is the service that's down.
That's not an oversight; it's the design. Restarting any of those mid-call drops calls or breaks the PBX, and no amount of "it would have fixed itself faster" justifies hanging up on a customer's caller. So those services are handled the old-fashioned, safe way: Sikurd alerts loudly, escalates through your on-call and PSA workflows, and waits for a human who can judge the right moment to act. The whitelist is the line between "safe to automate" and "needs a person," and Self-Healing never crosses it.
The result is auto-remediation you can actually leave switched on. The services that are safe to bounce get bounced automatically the moment they're confirmed stuck; the services that aren't get a human. You catch the 2am media-service blip before the client notices, and you never trade that convenience for a dropped call.
Where Self-Healing fits
Self-Healing is the automatic-action layer that sits on top of monitoring. The detection comes from the same poll loop that powers Sikurd's broader alerting; Self-Healing is what turns a subset of those alerts into a fix instead of a notification.
- How to monitor 3CX trunk health — the registration-and-availability side of monitoring.
- 3CX monitoring for MSPs — the broader monitoring and alerting playbook Self-Healing plugs into.
- Best tools for managing multiple 3CX servers — where fleet-wide automation fits in the market.