SQL Availability Group Performance Troubleshooting tools - I feel the need, the need for speed!

Most of us have been there, we've set up our Availability Group, added and seeded databases, performed our failover tests and everything has worked fantastically well. Everyone is happy, praise is falling freely from all quarters...because you're great. And you are.

But, what happens if it suddenly it wasn't running smoothly, the secondary keeps falling behind, and maybe the primary workload impacted along with it? What do we do, where would we look? More importantly, what tools can you use? Most of what I mention below are standard MS fare, with the best (in my humble opinion) one not widely used as far as I know. How terribly odd.

(If you don't want to read this post, no worries, but please still scroll down to the Simon's AGLatency Report Tool section and check that out. Its worth it.)

I'm not talking about failures/errors/unexpected failovers etc in this post, more that its working, but slower than my brother in law going to the bar when it's his round.

So what weapons of mass solution do we have at our disposal for diagnosing this.

(Pub, more beer! Just kidding...)

Waitstats

The good patient SQL, telling us where it hurts. I love this, always have a look in here.

Are HADR_SYNC_COMMIT waits prevalent? Often the sign of treacherous performance.

Performance Monitor Counters

I'm listing out the raw counters here, they may be surfaced in different tools, in slightly different ways, with slightly different names. They also give you the ability to view the data historically if they are being captured over a period of time, so are most useful when benchmarked regularly at periods of normal operation.

These counters are also surfaced through the sys.dm_os_performance_counters DMV, so you can capture and log from there should you so desire.

These break down to

SQLServer:Availability Replica

These counters give you the details on what levels of data are flowing.

Flow Control counters. If these are not 0, then you have a problem. SQL is deciding to throttle its AG data movement for whatever reason. (Which in my experience is often lack of resources).

Also

Resent messages / sec - Exactly what it says on the tin...if things have to be resent, then performance is never going to be good, have a look at network counters as a starting point.

SQLServer:Database Replica (View on the secondary)

I was going to highlight the important ones here, but its pretty much all of them.

So monitoring things like:

Log Send Queue - Log data that hasn't yet been sent to the secondary. - High is bad.

Recovery Queue - How much in the Secondary Log files that still need to be applied to the data files. - High is bad

Redo Bytes Remaining - Records to apply to the secondary data file(s) from the log files. - High is bad

Redone Bytes/sec - The speed at which changes are being applied to the secondary data file(s). High is good! Hurrah!

Transaction Delay - How much this is potentially impacting your application. - High is bad.

Availability Group Dashboard (and Extended Events)

The good ship Microsoft have supplied us with a one-stop-shop for Availability group monitoring.

The default view is a bit thin, just basically if things are Green for great or Red for RUN AWAY! By adding columns including some of those metrics mentioned above.

You can start to get an idea of how much work is being done and (boss default metric) remains to be done. Don't worry, some of the counters are only for one side of the AG, and the mode its running in.

From here there is even the Always On Availability Group Latency Reports. https://techcommunity.microsoft.com/t5/sql-server-blog/new-in-ssms-always-on-availability-group-latency-reports/ba-p/385774

.. although its not exactly new in SSMS anymore.

This essentially spins up extended event sessions on each replica, then provides a basic set of reports for the data. This is a great starting point, and can often show up any issues.

In the Availability Group Dashboard - Click 'Collect Latency Data'

What this does is spin up an SQL Agent Job on each replica - 'AlwaysOn_Latency_Data_Collector', which will run for a default time of 2 minutes, then stop.

To analyse this data, then open the reports on the Availability Group, so Right Click the AG name , Reports > Standard Reports > then the report you want.

This then allows you to view metrics for the Primary or Secondary Commit/Remote harden time etc.

And the data broken down. El Primerio.

And El Secondario.

In this example, near 21ms commit, with 11.9 of that waiting for the remote harden, but the secondary has no real times recorded, so it looks like the majority of the delay was traversing the network string. Now don't automatically go and harass your friendly network techs if you see this, you'll get punched.

Ask yourself a few questions...

Is the app running ok with these values?
Am I running in the correct AG Mode for my requirements?
Have I deployed a server with the processing power of a Commodore 64* at the other end?

ie... have I installed one Replica in London, and the other in Auckland, and then configured Synchronous mode? (If you have, then you may be punched once more.)

*If you're too young to know what this is, go ask your parents.

Rather than me just regurgitate content for all the metric meanings, refer to https://techcommunity.microsoft.com/t5/sql-server-blog/new-in-ssms-always-on-availability-group-latency-reports/ba-p/385774 .

However, not everything is tickety-boo.

Problem number 1: Primary Replica reports have to be run on the primary, Secondary Reports on the secondary, which makes overlaying / correlation of data trickier.

Problem number 2: Its a two minute capture by default. If you have to capture a specific event, thats quite a small window to catch it in.

Problem number 3: The data doesn't cover everything as its a complex chain of events under the cover for AG synchronisation (try looking at the extended event data...wow, just wow.) , so you may be able to get an idea of what's going on from this, but more often than not it's not enough.

Please sir, can I have some more?

This is where I turn to a tool that was introduced to me in my time in Microsoft CSS, and its something I haven't seen anyone else using in my time since so are going to recommend it. It performs a similar role to the Latency reports above, but just better.

Lets turn this amp up to eleven in true Spinal Tap fashion... the section most have probably jumped straight to.

Simon's AGLatency Report Tool

Or just AG Latency Tool - written by Simon Su. (About Simon Su - Microsoft Community Hub)

AGLatency report tool V2.2 introduction - Microsoft Community Hub

The direct download is here:

https://github.com/suyouquan/AGLatency/releases/download/v2.2/AGLatencyV2.2.zip

(I can't find any correct social media links to credit this work properly, if anyone knows of one, please let me know and i'll add it in here)

This expands on the Latency Reports, giving greater flexibility and further information around the various stages through the medium of dance. Ok, ok, maybe art as it gives us pretty pictures.

Check out the web link above, its so straight forward to use. There's full instructions on the site, and even an end to end video demoing its use.

The below are taken from the web link above. Again, this is Simon Su's magnificent work, not mine, so every scrap of credit should go to him.

The basic steps are:

Download and unpack the toolset to your client machine
Run AGLatency.exe
File > TSQL Script

This will then opens the TSQL Script you need, with all steps and telling you what to do where.

Create a working folder structure on your local client.

ie.. \AGName

\Primary

\Secondary

Run Step1 script (below)on your primary , save the output as primary.xml in your primary folder
Run Step1 script (still below) on your secondary, save the output as secondary.xml in your secondary folder (Merge together if multiple secondaries.)

Step1

select
AGNode.group_name
,AGNode.replica_server_name
,AGNode.node_name,ReplicaState.role,ReplicaState.role_desc
,ReplicaState.is_local
,DatabaseState.database_id
,db_name(DatabaseState.database_id) as database_name
,DatabaseState.group_database_id
,DatabaseState.is_commit_participant
,DatabaseState.is_primary_replica
,DatabaseState.synchronization_state_desc
,DatabaseState.synchronization_health_desc
,ClusterState.group_id
,ReplicaState.replica_id
from sys.dm_hadr_availability_replica_cluster_nodes AGNode
join sys.dm_hadr_availability_replica_cluster_states ClusterState
on AGNode.replica_server_name = ClusterState.replica_server_name
join sys.dm_hadr_availability_replica_states ReplicaState
on ReplicaState.replica_id = ClusterState.replica_id
join sys.dm_hadr_database_replica_states DatabaseState
on ReplicaState.replica_id=DatabaseState.replica_id
for XML RAW, ROOT('AGInfoRoot')

Step 2: Run the below in each A.G. replica, changing the target file location, and number of files/size as required

CREATE EVENT SESSION [AlwaysOn_Data_Movement_Tracing] ON SERVER
ADD EVENT sqlserver.hadr_apply_log_block,
ADD EVENT sqlserver.hadr_capture_filestream_wait,
ADD EVENT sqlserver.hadr_capture_log_block,
ADD EVENT sqlserver.hadr_capture_vlfheader,
ADD EVENT sqlserver.hadr_db_commit_mgr_harden,
ADD EVENT sqlserver.hadr_log_block_compression,
ADD EVENT sqlserver.hadr_log_block_decompression,
ADD EVENT sqlserver.hadr_log_block_group_commit ,
ADD EVENT sqlserver.hadr_log_block_send_complete,
ADD EVENT sqlserver.hadr_lsn_send_complete,
ADD EVENT sqlserver.hadr_receive_harden_lsn_message,
ADD EVENT sqlserver.hadr_send_harden_lsn_message,
ADD EVENT sqlserver.hadr_database_flow_control_action,
ADD EVENT sqlserver.hadr_transport_flow_control_action,  
ADD EVENT ucs.ucs_connection_flow_control,
ADD EVENT sqlserver.hadr_transport_receive_log_block_message,
ADD EVENT sqlserver.log_block_pushed_to_logpool,
ADD EVENT sqlserver.log_flush_complete ,
ADD EVENT sqlserver.recovery_unit_harden_log_timestamps
ADD TARGET package0.event_file(SET filename=N'c:\temp\AlwaysOn_Data_Movement_Tracing.xel',max_file_size=(500),max_rollover_files=(4))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,
MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)
GO

Start all of these XEvent sessions on Primary and Secondaries. I generally leave them for 10-15minutes and add a waitfor and XEvent stop in to the same script, but be mindful, these files can grow quickly.
Copy the xel files from each replica to the relevent Primary and Secondary folders you created earlier.

We're all set, let the Tool "do its thang!"

Browse to where you've saved in the Primary and Secondary folders
Click Start.

Go Go Go Go!

Processing will depend on the size of the xel files, but should only be a few seconds/ minutes, even with large files.

You are then presented with the results via a Webpage. The AG Replica summary at the top, but then the performance data you're interested in.

The above graphs are interactive as well, you can show/hide various metrics as needed to see what's what.

You also get this bad-boy.

Oh my word... this shows the pattern of what happens in synchronising an Availability group (Synchronous).

And the order...

And the Extended Events it relates to...

And which of these collectively relate to each 'section' of A.G. sync'ing...

You can even look at the base data to get more detail. Simply click on the relevent link on the left hand side. This takes you to a page giving a description of the metic, and aggregations of the data down to the second. Ooooo, we likey.

The entire report gets saved in the \Report\{DateTime} subfolder of where you extracted the A.G. Latency tool to.

Summary

This tool allows you to pinpoint exactly where the bottlenecks are in the A.G. commit processing. How good is that! You may be able to see why I wax lyrical over it.

...and no one seems to use this tool. This makes me very sad.

Hope you've enjoyed reading, and now have another howitzer of a tool in your arsenal.

Rod Wile E. Edwards.

(No Roadrunners were harmed in the creation of this post).