Self-Healing 3CX Services: Safely Auto-Restarting Stuck Services

A media service hangs at 2am. Nobody's awake. By the time someone sees the alert, the client has already noticed. Self-Healing fixes the easy ones automatically — with a whitelist, cooldowns, and a hard rule that it never touches anything that could drop a call.

The after-hours stuck-service problem

A 3CX server is not one program — it's a cluster of services running side by side. The media server that mixes call audio, the audio providers, the event-notification layer, the AI transcription service, and so on. Most of the time they all sit there running. Occasionally one wedges: a service crashes, hangs, or exits and doesn't come back. On its own, that's not exotic — software does this.

The painful part is when it happens. A service that stalls at 2:14am doesn't announce itself politely at the start of the next business day. Depending on which one stopped, your client might get one-way audio, missing recordings, no AI summaries, or a dead feature — and they find out before you do. The classic remediation is embarrassingly simple: restart the service and it comes right back. But "restart the service" requires a human to (a) see the alert, (b) be awake, (c) get to a laptop, and (d) log in. At 2am, all four are unlikely. So a thirty-second fix becomes a multi-hour outage whose entire duration is just waiting for a person.

The obvious reaction — "so just auto-restart everything" — is exactly the trap. Some 3CX services cannot be restarted without consequences: bounce the phone system or the call manager and you drop every call in progress. An auto-remediation that occasionally hangs up on live callers is worse than the disease. The art is doing this safely: fix the things that are safe to fix, automatically, and refuse to touch the things that aren't.

How safe auto-remediation works

Sikurd already polls every 3CX instance on a tight cycle and pulls the live service list on each poll. When a service shows up as stopped, Sikurd raises a critical "service stopped" alert. Self-Healing adds one thing on top of that alert: under tightly controlled conditions, it issues the restart for you. Those conditions are the whole point, so here's exactly what they are.

1. A one-poll grace period for transient blips

Sikurd never restarts a service the first time it sees it down. It waits for a second consecutive observation — the same service, still stopped, on the next poll — before it acts. A genuinely transient blip (a service that flickers and recovers on its own within a poll cycle) self-heals without Sikurd lifting a finger. Only a service that is actually stuck, confirmed across two polls in a row, becomes a restart candidate. This single rule eliminates the most common false-positive: reacting to a momentary flap that would have cleared itself.

2. A whitelist of low-blast-radius services only

This is the safety rail that matters most. Self-Healing will only ever restart a fixed, deliberately narrow list of services that are safe to bounce:

  • 3CX Media Server — the call-audio mixer. A restart briefly blips audio on calls already in progress, but it does not drop them.
  • 3CX Audio Providers (the numbered ones — 01, 02, …) — supply hold music and prompts; restart-safe.
  • 3CX Event Notification Manager — the alerting/eventing layer; restarting it has no call impact.
  • 3CX System Services (the numbered ones) — background system workers.
  • 3CX AI — transcription and summary; purely background, completely safe to restart.

Anything not on that list is left alone. If it's down, you still get the alert — Sikurd just won't try to fix it itself. The list is short on purpose, and widening it is a deliberate decision rather than something that creeps in by default.

3. Cooldowns and an hourly cap

Even for whitelisted services, Self-Healing won't hammer a flapping process:

  • 10-minute cooldown per service
    The same service on the same instance can't be restarted more than once every ten minutes. A service that keeps flapping doesn't get bounced on every poll.
  • Maximum 3 attempts per hour
    After three restart attempts in an hour, Sikurd gives up — on purpose. A service that won't survive three restarts has a real problem, and a human should look at it.
  • Then it stops and escalates
    Past the cap, the restart loop ends and the original alert stays open, routing through your normal escalation and PSA paths. Automation tried; now it's a person's turn.

4. Opt-in, role-gated, and fully audited

Self-Healing is off by default. You turn it on per tenant from the dashboard settings, and only an Owner, Admin, or Super Admin can flip it — issuing service restarts is a privileged action, so Members and read-only roles can't enable it. Every restart attempt, whether it succeeds or 3CX rejects it, is written to the audit trail as a SERVICE_AUTO_RESTART event tied to the instance, right alongside the underlying alert. You can always see which service Sikurd restarted, on which PBX, when, whether it worked, and any error returned. Nothing happens behind your back.

Why it won't drop calls

The single most important thing to understand about Self-Healing is what it refuses to do. The services that carry and route live calls — the phone system, the call manager, the queue manager, the configuration server — are deliberately excluded from the whitelist. They are never auto-restarted, full stop, even when Self-Healing is enabled and even when one of them is the service that's down.

That's not an oversight; it's the design. Restarting any of those mid-call drops calls or breaks the PBX, and no amount of "it would have fixed itself faster" justifies hanging up on a customer's caller. So those services are handled the old-fashioned, safe way: Sikurd alerts loudly, escalates through your on-call and PSA workflows, and waits for a human who can judge the right moment to act. The whitelist is the line between "safe to automate" and "needs a person," and Self-Healing never crosses it.

The result is auto-remediation you can actually leave switched on. The services that are safe to bounce get bounced automatically the moment they're confirmed stuck; the services that aren't get a human. You catch the 2am media-service blip before the client notices, and you never trade that convenience for a dropped call.

Where Self-Healing fits

Self-Healing is the automatic-action layer that sits on top of monitoring. The detection comes from the same poll loop that powers Sikurd's broader alerting; Self-Healing is what turns a subset of those alerts into a fix instead of a notification.

Frequently asked questions

What is Self-Healing in Sikurd?
Self-Healing is an opt-in feature that automatically restarts a stuck 3CX service after Sikurd has seen it stopped on two consecutive polls. It only ever touches a fixed whitelist of low-risk services (the media/audio, event-notification, and AI services), and it's wrapped in cooldowns, an hourly attempt cap, and a full audit trail. The goal is to clear a transient 2am blip before your client notices — without a human having to wake up.
Will an auto-restart ever drop a live call?
No — that's the whole design constraint. The call-affecting services (the phone system, call manager, queue manager, configuration server) are deliberately excluded from the whitelist and are never restarted automatically, even with Self-Healing turned on. The services that are eligible — chiefly the media server, numbered audio providers, the event-notification manager, and the AI service — can be restarted without dropping active calls. The media server, at worst, briefly blips audio on calls in progress; it does not hang up on anyone.
Which 3CX services does Self-Healing restart?
A fixed, conservative whitelist: the 3CX Media Server, numbered 3CX Audio Providers, the 3CX Event Notification Manager, numbered 3CX System Services, and the 3CX AI service. Anything not on that list — most importantly the phone system and anything in the call path — is left for a human even when it's down. The whitelist is intentionally narrow on purpose; widening it is a deliberate decision, not the default.
How do I turn Self-Healing on, and who can?
It's off by default. An Owner, Admin, or Super Admin enables it for the tenant from the dashboard settings; the toggle is blocked for Members and read-only roles because issuing service restarts is a privileged action. The change is recorded so you always know who flipped it and when.
What stops it from restart-looping a service that won't stay up?
Two guardrails. A 10-minute cooldown means the same service on the same instance can't be restarted more than once every ten minutes. And an hourly cap of three attempts means that after three tries in an hour, Sikurd stops, leaves the alert open, and hands it to a human. A service that won't survive three restarts has a real problem that automation shouldn't paper over.
Is every restart logged?
Yes. Every attempt — successful or rejected — is written to an audit trail as a SERVICE_AUTO_RESTART event tied to the instance, alongside the underlying "service stopped" alert. You see exactly which service Sikurd restarted, when, on which PBX, whether it succeeded, and any error 3CX returned. Nothing happens silently.
What happens after three failed attempts?
Sikurd stops trying and leaves the incident open for a person. The original alert stays active (and routes through your normal escalation, PSA ticketing, and on-call paths), and the failed attempts are visible in the audit trail so whoever picks it up knows automation already tried. Self-Healing is designed to catch the easy, transient failures and get out of the way for the hard ones.

Let the safe restarts happen without you.

Sikurd watches every 3CX service on every instance, clears the transient stalls automatically inside strict guardrails, and escalates the ones that need a human. Opt-in per tenant, off by default, every action audited. Free for your first three instances.