A client requested a presentation discussing key improvements to Always On Availability Group fast failover in SQL Server 2025. I decided that a summary would be appropriate for a blog post. So, here I discuss Enhanced Telemetry, Persistent Health, and Intelligent Fast Failover
What’s Fast Failover with Persistent Health?
High availability has always been at the core of SQL Server, but historically, failover decisions were blunt instruments; often triggered by missed heartbeats rather than a true understanding of engine health. SQL Server 2025 fundamentally changes this model.
SQL Server 2025 improves Always On Availability Groups with advanced telemetry, a persistent health model, and smart failover logic; aligning more closely with how DBAs assess system health through patterns and context, not just brief incidents.
The Big Picture: From “Is It Alive?” to “Is It Healthy?”
In previous SQL Server versions, AG health detection focused heavily on aliveness:
- Is SQL Server responding?
- Is the replica reachable?
- Did a heartbeat timeout occur?
However, while effective, this approach had limitations:
- Transient CPU spikes could trigger unnecessary failovers
- Network hiccups looked identical to engine failures
- Diagnostic context is lost once failover completed
SQL Server 2025 introduces a more mature model:
- Health is continuously evaluated, not just at failure time
- Diagnostic data persists across role changes
- Failover decisions are based on engine condition trends, not just missed pings
This evolution is powered by enhanced real‑time telemetry and a persistent health model tightly integrated with fast failover logic.
Enhanced Telemetry: More Useful Data, Less Distraction
SQL Server 2025 significantly enhances both the quality and frequency of engine health telemetry:
- Health checks leverage sp_server_diagnostics more frequently
- CPU pressure, memory pressure, and I/O latency are evaluated together
- When a health threshold is crossed, SQL Server captures performance counter snapshots
- These diagnostics are written into WSFC logs, not just SQL logs
The key change: diagnostics are captured before failover, not reconstructed afterward. This provides clear insight into why failover occurred; not just that it occurred.
Persistent Availability Group Health: Memory That Survives Failover
Historically, AG health information was short-lived. Once failover occurred, the new primary had no memory of instability leading up to the event.
SQL Server 2025 changes this with a persistent health model:
- Health evaluations and failure conditions are tracked over time
- Diagnostic context survives role transitions
- The new primary is aware of recent instability patterns
This persistence reduces:
- Failover “ping‑pong” scenarios
- Repeated failovers caused by transient issues
- Post‑failover performance surprises
Therefore, SQL Server now remembers being sick, even after it recovers.
Fast Failover, Done Smarter
Fast failover in SQL Server has always balanced two competing goals:
- Fail fast when the primary is truly unhealthy
- Avoid false positives caused by short‑lived conditions
SQL Server 2025 improves both sides of that equation.
Why Failover Is Faster and Safer
- Health logic evaluates patterns rather than single data points.
- Persistent telemetry allows SQL Server to distinguish spikes from sustained failure
- Secondary replicas retain readiness context, reducing recovery time
- Query Store and engine metadata persistence reduce post‑failover regression
The result: faster failover when it matters, fewer failovers when it doesn’t.
Configuring Fast Failover for Persistent Health
Fast Failover for Persistent AG Health allows SQL Server 2025 to fail over immediately when a replica is persistently unhealthy, instead of repeatedly restarting the AG resource on the same node.
This is enabled by the Windows Server Failover Cluster (WSFC) value, RestartThreshold.
Get-ClusterResource "AGName" | Set-ClusterParameter RestartThreshold 0RestartThreshold = 0 (new in 2025) | Immediate failover on persistent health failure |
RestartThreshold = 1 (default pre‑2025) | Try to restart AG once on the same node |
RestartThreshold = 3 | Try multiple restarts before failover |
Best Practices: Configuring AG Health, the Right Way
Enhanced telemetry does not eliminate the need for good configuration. In fact, correct alignment is more important than ever.
1. Tune Failure Condition Level Thoughtfully
FAILURE_CONDITION_LEVEL controls how sensitive SQL Server is to health issues.
ALTER SERVER CONFIGURATION SET FAILOVER CLUSTER PROPERTY FailureConditionLevel = 0;Best practice:
- Level 3 for most production systems
- Level 4–5 only in stable, well‑understood environments
- Avoid overly aggressive settings that negate telemetry benefits
Failover should reflect sustained engine distress, not momentary pressure.
2. Align Health Check Timeout with WSFC Lease Timeout
HEALTH_CHECK_TIMEOUT defines how long SQL Server waits before declaring itself unhealthy.
ALTER SERVER CONFIGURATION
SET FAILOVER CLUSTER PROPERTY HealthCheckTimeout = 15000;Best practice:
- 30 seconds (30,000 ms) is common
- Try to configure WSFC LeaseTimeout longer than AG health timeout
- Prevent the cluster from evicting SQL Server prematurely
This alignment allows SQL Server’s enhanced telemetry to drive decisions first.
3. Configure Replica Session Timeout Carefully
Replica SESSION_TIMEOUT controls how quickly replicas detect communication loss.
ALTER AVAILABILITY GROUP AGName
MODIFY REPLICA ON 'INSTANCE09' WITH (SESSION_TIMEOUT = 15);Best practice:
- 10–20 seconds depending on network reliability
- Avoid ultra‑low values unless latency is extremely predictable
As a result, this setting complements telemetry by confirming whether health issues are local or connectivity‑related.
4. Use Availability Group Commit Time (New in 2025)
SQL Server 2025 introduces a new server‑level setting called Availability Group Commit Time (ms). This controls log batching behavior for AG synchronization.
Best practice:
- 1–5 ms for latency‑sensitive OLTP systems
- >10 ms for mixed workloads
- Higher values for throughput‑oriented batch systems
Lower redo lag improves secondary readiness and shortens failover recovery time.
5. Enable Database Level Health Detection
Database‑level issues (corruption, suspect state, log failures) should trigger failover.
Best practice:
- Enable database level health detection
- Let persistent health tracking retain context across failovers
This ensures real database failures don’t hide behind instance‑level health.
Final Thoughts
SQL Server 2025 represents a maturity leap for Always On Availability Groups.
- Enhanced telemetry gives SQL Server better vision.
- Persistent health gives it memory.
- Smarter fast failover gives it judgment.
Therefore, when configured correctly, these features deliver the following benefits:
- Faster, more accurate automatic failovers
- Fewer false positives
- Better information after the incident
- More predictable post‑failover performance
High availability is no longer just about staying online; it’s about failing smarter.

Be the first to comment on "Fast Failover in SQL Server 2025"