This is the third and final post in the series "SQL Observation for the Nation".
Part one can be found here:
Part two can be found here:
Part two covered Errors, Throughput and Saturation, three of the SRE golden signals that can be used for standardising monitoring across multiple different elements that makes up the application and infrastructure environment. Again, consider these Golden Signals our monitoring Towers, for want of nothing more than to justify an tenuous title link.
Just to be aware, Part 3 is the DVD Extended edition with slabs of code to show the method i've used previously capture this. All scripts are available and attached as text files.
So, let's finish with Availability, along with Latency, A.L.
You can call meeee AL...
How do we know that our Sql database is up and responding ok to the most minimal of queries? As merntioned in Part 1, for a base level of SQL monitoring, we can't cater for every single query as they all have different workload and performance profiles, and there are (usually) thousands of different types of queries so monitoring every single one isn't feasible normally.
So we just check for the SQL Engine service being started, so everything must be ok, Right? Ahhh, I remember those days... But alas no, all that means is that SQL and its databases may be in a position to accept connections.
What if the DB is offline/suspect/recovering?
What if the DB is in Availability Group and that group is resolving? (or for some reason, you're pointing at a Secondary?)
What if the Threadpool is exhausted?
What if Gimli and Legolas are running their CPU destroying 'Enemies killed' report and all the resources are depleted?
etc.
All of these could essentially impact the availability of application connections to the database, so then the availability of the system as a whole is impacted.
Is someone in the house just because the light is on? You won't know for sure until you go and knock on the door and someone answers with a response. Sound familliar? We've been using our friend, the good old Packet Internet Groper for years, known as ping to the rest of the fellowship.
Now, monitoring does this (and more) running instance & database level commands and getting responses to give you lovely data but we don't get an easy number at the end of the day/week/"management shouty cycle" for what that means in the grand scheme of SLAs.
So the new quest, ditch the ring, knock on all the SQL server front doors in the Shire, then step in and out every room in every hobbit hole...oh, and time how long it took for each.
Sounds complex? It really isn't, a little bit of powershell with the help from the awesome contributors of DBATools (dbatools – the community's sql powershell module) and this is soon sorted out.
Lets get started.
Codesnip1
Import-Module dbatools
[reflection.assembly]::LoadWithPartialName("Microsoft.SqlServer.Smo") | out-null
#Main
$TargetServer='MonitoringServer'
$TargetDB='MonitoringDataRepository'
$TargetTable='tbl_SQLAvailability'
$InventoryServer='InventoryServer'
$InventoryDB='InventoryDB'
$dbresult=@()
try
{
<#
Run a query here to populate your list of Servers you want to sweep. In this case its from an inventory table, but this could be direct from CMS,
a text file, anything that you want to pull your server list from.
#>
$SQLRepoConnection = New-Object System.Data.SQLClient.SQLConnection
$SQLRepoConnection.ConnectionString ="server=$InventoryServer;database=$InventoryDB;Integrated Security=True;"
$SQLRepoConnection.Open()
$SQLCommand = New-Object System.Data.SqlClient.SqlCommand
$SQLCommand.CommandText = "SELECT Server_Name FROM MyServerList"
$SQLCommand.Connection = $SQLRepoConnection
$SQLAdapter = New-Object System.Data.SqlClient.SqlDataAdapter
$SqlAdapter.SelectCommand = $SQLCommand
$SQLDataset = New-Object System.Data.DataSet
$SqlAdapter.fill($SQLDataset) | out-null
All we are doing in the above is:
Define Targetserver and table for where we want to log our data to
Define InventoryServer and table for where we want to pull our availability data from... ie, the servers we want to perform this heartbeat check on. In this case, its just a simple inventory table called MyServerList, but this could be anything, a text file (don't do this), a spreadsheet (or this), or your up to date Central Management Server (yay! do this.)
Now we have our dataset with our list of servers we want to poke, lets have at them!
Codesnip2
# Now Loop around all of these servers, running the function to monitor above.
$tablevalue = @()
foreach ($data in $SQLDataset.tables[0])
{
$servername = $data[0]
SQLConnect $servername
}
}
catch
{
Write-Host "Failed to connect to SQL Server: $servername"
$ErrorMessage = $_.Exception.Message
Write-Host $ErrorMessage -ForegroundColor Red
continue
}
Loop around each server and run the SQLConnect function.
The what function? Oi! What aren't you showing us?
All the logic happens in the SQLConnect function, which as i'm not PayPerView, is explained below in Codesnips 3-5.
Codesnip3
param (
[Parameter(Mandatory = $true)][String]$serverName
)
$connectn = new-object system.data.SqlClient.SQLConnection("Data Source=$serverName;Integrated Security=True;Connect Timeout=5 ;Initial Catalog=master; MultipleActiveResultSets=True");
$cmd = new-object system.data.sqlclient.sqlcommand("select '['+name+']' from sys.databases where state_desc <> 'OFFLINE' AND state_desc <> 'RESTORING' and database_id > 4", $connectn);
$cmd.CommandTimeout=5;
try
{
#Start timer before connection to server - this is the start of the latency check
$svrtmrconn = [System.Diagnostics.Stopwatch]::StartNew()
$connectn.Open();
$Connectduration=$svrtmrconn.Elapsed #for time to open the connection
$servertimer = [System.Diagnostics.Stopwatch]::StartNew()
# Get timings at this point
$SQLDBAdapter = New-Object System.Data.SqlClient.SqlDataAdapter
$SqlDBAdapter.SelectCommand = $cmd
$SQLDBDataset = New-Object System.Data.DataSet
$SqlDBAdapter.fill($SQLDBDataset) | out-null
#Get time to get DB info
$sysdbduration=$servertimer.Elapsed
Write-host "Connected to Server $servername." -ForegroundColor Green
Again, all this does is
Start a timer
Open a connection to the server
Record the state of the timer
Connect to the server and get the list of databasenamesthat are valid to be poked (Note, I deliberately exclude restoring, offline and system DBs in the above, you can adjust to taste)
Record the state of the timer again.
Essentially, we now have a time it took for connection, and then to query the master DB.
Ok, onward! Time to do some knocking on those user database doors, and make we are answered.
Codesnip4
#Loop through the list of DBs we want to check
foreach ($dbdata in $SQLDBDataset.tables[0])
{
#Run a minor cmd against DB to confirm accessible, start a timer before, and record afterwards to get time it took
try {
$output=$null
$databaseName= $dbdata[0]
$dbcmd = new-object system.data.sqlclient.sqlcommand("select top 1 name from $databasename.sys.objects", $connectn);
$dbcmd.CommandTimeout=5;
$dbtimer= [System.Diagnostics.Stopwatch]::StartNew()
$output=$dbcmd.ExecuteReader()
$dbduration=$dbtimer.elapsed
#Build up dbresult set of duration and if a success or not.
$databaseName=$databaseName-replace '[][]',''
$global:dbresult+= ([pscustomobject]@{Servername=$servername;DBName=$databaseName;Duration=$dbduration.milliseconds;Network=$null;Success=1})
}
catch {
#Database failed connection. Populate the dbresult set of such a travisty.
$databaseName=$databaseName-replace '[][]',''
Write-host " $databasename Database Not accessible" -ForegroundColor Red
$global:dbresult+= ([pscustomobject]@{Servername=$servername; DBName=$databaseName;Duration=$null;Network=$null;Success=0})
continue
}
}
Start another timer
Execute the most basic of basic queries (here I just used select top 1 name from sys.objects), but here you could have a canary table and write a row etc, anything you want.
Create a result set with our parameters of the database name, the time it took (from our timer) , including that it was a success *Hurrah*!
If the connection fails, then we end in our wicked naughty catch block, where we write the database name and that it wasn't a success *boo*! - Note, the loop continues, as we don't want one failed DB connection abort the entire server run.
And finally record the success or failure of the server level connect earlier.
Codesnip5
#Now populate the Server level connection was a success and the time it took to connect
$global:dbresult+= ([pscustomobject]@{Servername=$servername;DBName=$null;Duration=$($sysdbduration.milliseconds);Network=$($Connectduration.milliseconds);Success=1})
}
catch
{
#Or the server level failure to connect... woe is us.
echo $_.Exception.Message;
Write-host "ERROR opening connection to $servername. $_.Exception.Message";
$global:dbresult+= ([pscustomobject]@{Servername=$servername;DBName=$null;Duration=$null;Network=$null;Success=0})
continue;
};
$connectn.Close();
};
And the last part of the Main script
Write our powershell object that contains our Latency (timer values) and availability data (Success 1 or 0 above) to our Monitoring table defined in Codesnip1 above.
Codesnip6
try
{
$SQLRepoConnection.close()
$final=$global:dbresult |ConvertTo-DbaDataTable
#Now write our results for the Availability and Latency Sweep to our Target DB and table
Write-Host "Saving Results to $Targetserver.$TargetDB.$TargetTable" -ForegroundColor Yellow
IF ($final) {Write-DbaDataTable -InputObject $final -SqlInstance $TargetServer -Database $TargetDB -Table $TargetTable -Autocreatetable}
}
Catch
{
Write-Verbose "Error Loading results to $Targetserver.$TargetDB.$TargetTable"
$ErrorMessage = $_.Exception.Message
Write-Host $ErrorMessage -ForegroundColor Red
}
Note: I'm using -autocreatetable above for ease of testing and demo (Actually, I've been drinking again, so just lazy) , but its recommended to prestage your target table with valid datatypes first.
Oh, here's one....
CREATE TABLE [dbo].[tbl_SQLAvailability](
[Servername] [nvarchar](128) NULL,
[DBName] [nvarchar](128) NULL,
[Duration] [int] NULL,
[Network] [int] NULL,
[Success] [bit] NOT NULL,
[CollectionDate] [datetime2] DEFAULT getdate()
) ON [PRIMARY]
GO
Add indexes to taste dependant on how you plan to query the data. I recommend CollectionDate as a Clustered Index and PK as a minimum.
So, running it, what do we end up with in our tbl_SQLAvailability. "Oooo, lovely data rows, they looks tasty...."
A line for each SQL instance you connect to, the DBs, whether each was a success or not, and the time with each connection.
Server level connections have a DBName of NULL,with the Network column defining how long it took to open that orginal connection, the Duration column defining how long the query against the DB took to execute.
From the above, showing a single poll, we see...SERVER1 responded in 6ms for the connection, and durations between 1 and 8ms for the database 'door knocking'.
Now, please don't go and say to your nearest Network tech and exclaim...."The network took 'x' long to do 'y' as this said so"... as you may be attacked with something sharp and pointy that glows blue when Incident Managers are nearby.
All of this data is comparitive, as with most things, gain a baseline, then you have your trends about how something is behaving.
Also, the laptop i'm running this from may actually be from Middle Earth, ergo not speedy so ignore the numbers above.
Great, data, top of the class....Now what?
So imagine that script running every minute via a SQL Agent job on one or more of your DBA servers, polling your SQL Servers to see what is going on and logging to the table. This would give a potential 1440 polls per server and DB per day.
You now can start to build a picture of Availabilty for all of your servers and databases, and Latency for timings it takes to open connections and run queries.
So if we're running this every minute....
...and we know how many minutes are in a day (1440)
...and we know how many heartbeats were success 1 or 0?
Oooo Samwise, we can now calculate a % figure of our availability to prove our SLA to Lord Elrond can't we? All this and Liv Tyler was nowhere in sight (...unfortunately).
Server Level Stats
Time for some Maths
Or math, maffs, or at the risk of jumping fantasy genres, "Teef countin'".
So , we've had the above running for some time, lets have a look at our data. Starting with the server metrics, ie.. is the Engine responding?
--Get Server Metrics
;WITH ServerStats AS
(
SELECT Servername, AVG(Network) AS AvgNetwork, MAX(Network) AS MaxNetwork,
CASE Success WHEN 1 THEN 'True'
WHEN 0 THEN 'False'
END AS Success,
COUNT(Success) As PingCount,
SUM(COUNT(Success)) OVER (PARTITION BY Servername,CONVERT(Date,CollectionDateTime)) as TotalCount,
CONVERT(Date, CollectionDatetime) AS CollectionDate
FROM [MonitoringDataRepository].[dbo].[tbl_SQLAvailability]
WHERE DBName IS NULL
GROUP BY Servername, Success,CONVERT(Date, CollectionDatetime)
)
SELECT Servername, AvgNetwork, MaxNetwork, Success, PingCount, TotalCount, CONVERT(DECIMAL(8,3),PingCount)/TotalCount *100 AS ServerAvailability, CollectionDate FROM ServerStats
ORDER BY CollectionDate,Servername, Success
From this, we get our aggregated metrics, including Availability for the day. Marvellous, ding-dong the witch is dead and all that jazz.
Want to know when the server didn't respond. Simples ...
--Server Missed Pings
SELECT * FROM [MonitoringDataRepository].[dbo].[tbl_SQLAvailability] WHERE Success=0 AND DBname IS NULL
Database Level Metrics
And what about our Database Level stats? We do exactly the same for that data as well.
--Get Database Level Metrics
;WITH Totalcount AS
(SELECT Servername, COUNT(Success) AS TotalCount,CONVERT(Date, CollectionDatetime) As CollectionDate
FROM [MonitoringDataRepository].[dbo].[tbl_SQLAvailability]
WHERE DBName IS NULL
GROUP BY Servername, CONVERT(Date, CollectionDatetime))
, DBStats AS
(
SELECT Servername, DBName, AVG(Duration) AS AvgDuration, MAX(Duration) AS MaxDuration, CASE Success WHEN 1 THEN 'True'
WHEN 0 THEN 'False'
END AS Success,
COUNT(Success) As PingCount,
CONVERT(Date, CollectionDatetime) AS CollectionDate
FROM [MonitoringDataRepository].[dbo].[tbl_SQLAvailability]
WHERE DBName IS NOT NULL
OR (DBName IS NULL AND Success=0)
GROUP BY Servername, DBName, Success,CONVERT(Date, CollectionDatetime) )
SELECT d.Servername, DBName, AvgDuration, MaxDuration, Success, d.PingCount, t.TotalCount, CONVERT(DECIMAL(8,3),d.pingcount)/t.TotalCount *100, d.CollectionDate FROM DBStats d
INNER JOIN Totalcount t ON d.Servername=t.Servername and d.CollectionDate=t.CollectionDate
ORDER BY CollectionDate, Servername, DBname, Success
Woo!
Whats this telling us?
Server1
DB1, DB2 and DB3 were accessible throughout 02/07/23 - 100% Availability
DB1 Database Level connection lost 5 polls (mins) on 03/07/23 - giving 99.653% Availability for that DB alone. (DB2 and 3 not shown above)
Server2
The server Lost 20 polls (mins)on 02/07/23 giving an Availability of 98.611% and this also obviously impacted DB1 availability to the same amount.
For both, the downtime % is also listed where the Success=False.
Fantastic, we now have our metrics of Availability, and some Latency aggregations for both server and database levels.
Estate Level Calculations for our Masters.
So what about the Estate level? For when you are inevitably asked, "How's my SQL Service that you are providing?" It may be from your boss, your boss' boss or Gandalf the Grey himself , and they want to inspect the IT Estate in its entirity. Wouldn't it be nice to say.. "Ah, SQL as a Service has availability of 99.993% over the last Month" <engage smugmode>
You now have your Availability data, so you can do this... but be cautious on the calculations used. If all your servers have equal numbers of databases, then great, its a lot easier. But this is often not the case, so an outage of a server with 100 databases will have a greater impact on the Estate availability 'score' than a server with only 1 database on. Assuming that the servers/dbs are of equal importance of course, which is what we are doing here.
When all servers have the same amount of polls, which is 'normally' the case, then averages of percentages work ok... however, what about that server Legolas deployed, and that DB Gandalf added as well? These happen mid way through the day, so they then don't have the full number of polls for a day that the others do.
Consider if you added a server with 10 databases at 11:59pm, so only had 1 poll recorded against it. Using simple averages of the percentages, this would essentially assume a sample size that is the same as the others and has the potential to skew our Availability score. All of a sudden, it's reported beneath SLA and you hear your bosses saying "... what about their legs?... they don't need those
..
One way to cater for this is to treat the Estate metrics as one dataset so it covers our sample sizes.
Say you have 10 Servers.
Server 1-5 have 10 databases each
Server 6-10 have 5 databases each
So a total of 75 databases - 50 (5 servers x 10 DBs) + 25 (5 servers x 5 DBs)
If we're polling every 1 minute, that gives 1440 polls per day FOR EACH database, giving a total number of success polls of 108000. (1440 x 75)
So to get an ESTATE score of 100% Availability, we need 108000 successful polls.
We know how many successful polls we have from the above collections already, so thats covered.
Using our data above, where I only have 2 Servers (SERVER1 and SERVER2), 1 with 3 databases, and 1 with 1 database. 4 databases total, so 4 x 1440 polls, gives us a maximum poll count of 4 x 1440 = 5760 for a given day.
--Get Estate metrics
;WITH Totalcount AS
(
SELECT COUNT(Success) AS TotalCount,CONVERT(Date, CollectionDatetime) As CollectionDate
FROM [MonitoringDataRepository].[dbo].[tbl_SQLAvailability]
WHERE DBName IS NOT NULL
OR DBName IS NULL and Success=0
GROUP BY CONVERT(Date, CollectionDatetime)
)
, DBCounts AS
(
SELECT
COUNT(Success) As SuccessPingCount,
CONVERT(Date, CollectionDatetime) AS CollectionDate
FROM [MonitoringDataRepository].[dbo].[tbl_SQLAvailability]
WHERE DBName IS NOT NULL
GROUP BY CONVERT(Date, CollectionDatetime) )
SELECT t.TotalCount, d.SuccessPingCount, d.CollectionDate, CONVERT(DECIMAL (8,3),d.successpingcount)/t.TotalCount*100 AS PctAvailability
FROM Totalcount t INNER JOIN DBCounts d ON t.CollectionDate=d.CollectionDate
As one of our Servers, SERVER2, lost 20mins that day... it gives us a different availability % to that above. In isolation, it had an Availability score of 98.611% for tthe 02/07, but taking the other servers into account, the impact is less for the entire estate (3 DBs on the other server were 100%.)
You only love SQL Server you do, thats why you've written this purely around SQL Server.
Apart from it being a SQL blog...no, not true, all of the above stems from from a simple powershell call to SQL Server. Switch that to relevent call to another database engine (ie.. Postgres), and you can do exactly the same there too if you so desired.
<Backhand volley, fifteen-all>
Latency
"A DBA is never late Frodo Baggins, Nor are they early. They arrive precisely when they mean to."
At the same time of us collecting the Availability, we've been grabbing the Latency as well, on opening connection to SQL, and then a basic check is performed to make sure the DB can service the requests sent to it.
So what good is the Latency?
You can see if a server is taking longer to respond, whether that is down to things like....
Network slowness
CPU Pressure - Corresponding to our other metrics, saturation etc.
IO pressure - Saturation again.
IO Latency
Just expanding into a subsystem again like in Part 2... again, aggregated to the DB level.
SELECT
DB_NAME(fs.database_id) AS DatabaseName
, mf.type_desc
, SUM(fs.io_stall_read_ms) AS ReadStallTotalms
, SUM(fs.io_stall_write_ms) AS WriteStallTotalms
, SUM(fs.io_stall_read_ms + fs.io_stall_write_ms) AS IOStallsTotalms
,getdate() as CollectionDate
FROM sys.dm_io_virtual_file_stats(null,null) AS fs
INNER JOIN sys.master_files AS mf WITH (NOLOCK)
ON fs.database_id = mf.database_id
AND fs.[file_id] = mf.[file_id]
GROUP BY DB_NAME(fs.database_id), mf.type_desc
ORDER BY DatabaseName, type_desc desc
Gives us our collective stall totals, delta this up and we have our Database Level IO metrics.
As before, use LAG/LEAD for delta workings:
SELECT [DatabaseName]
,[type_desc]
,[ReadStallTotalms] - LAG([ReadStallTotalms]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as ReadStallTotalDeltaMs
,[WriteStallTotalms] - LAG([WriteStallTotalms]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as WriteStallTotalDeltaMs
,[IOStallsTotalms] - LAG([IOStallsTotalms]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as IOStallTotalDeltaMs
,DATEADD(HOUR, 3,[CollectionDate]) as CollectionDate
FROM [BZYDBA].[dbo].[tbl_IODBLatency]
where DatabaseName='tempdb'
ORDER BY type_desc desc, CollectionDate
As always, graph it, trend it, use it.
Remember that Observability paragraph, sitting way back in the comfort of Bag End in part one? You do? Good.
Combine this with Infrastucture latency on the disks, and voila, we're getting a chain of performance data between systems and some resemblance of Observability. Ye gods! Did we do that?
As with the Traffic/Throughput in Part 2,
If its a database responding slower, when all others are ok on the server, then you can eliminate a lot of paths instantly, and focus purely on that one DB. Troubleshooting is as much as identifying what something is NOT, as much as identifying what something IS.
Again, as mentioned in Part 1, the combination of these metrics allows us to troubleshoot much more accurately. Now take this information into our Troublesome application situation....
A high Latency and no other Saturation metics showing issues could indeed suggest Network slowness (purely an example), and if the Network folks are monitoring in a similar way, we have parts of that end to end picture of Observability.
User (not ok) >
Web Server (not ok) >
App Server (not ok) >
DB Server (ok - Availability) >
DB Server (ok - Saturation) >
DB Server (ok - Errors) >
DB Server (not ok - Latency) >
Disk Infrastructure
Disk IO Metrics (ok - Latency) >
Network Infrastructure > {Network performance investigation)
Network metrics reviewed.
Availability >
Saturation etc >
We have a tree that allows a faster progression to the pain points.
Collaboratively, the downstream errors can be explained and RCA reached quicker, with less effort.
Armed with all this data, logged, aggregated... throw a chart against it via your favourite visualisation tool and management, apps team can self serve the metrics they're interested in and escalate when necessary.
And thats it you'll all be glad to know, 4 SRE metric principals, along with one addition to provide arguably the most important information of all, if something is online or not.
It is the Ring's will.
But why do it, why do any of this?
Monitoring has been evolving, and becoming more structured, and Observability when everyone is on board can be a very powerful process. By structuring SQL monitoring and data collection into a way that matches other teams, we all start to pull in the same direction, and troubleshooting users issues end-to-end start to get easier with less wasted effort...and we all want less wasted effort, right?
I'm not going to lie, that Sauron chap is a sneaky sneakster. Its far from easy getting all teams aligned. But with an SRE element involved, and various SMEs of each area embedded within an SRE team, its very achievable.
In short, we've now microwaved the ring on full power, given Sauron a serious kicking, chucked him through a few windows and the steeplejack Fred Dibnah is now eyeing up the base of Barad-dûr with an evil twinkle in his eye. Beautiful.
Player 1 wins. Flawless Victory.
If you're still here at this point, my utmost thanks to following my rambling drivel. Hopefully it may have given some ideas to how you can improve your own monitoring, or combine it better with what other teams are doing. Some/all of this can be REALLY useful and open insights that you had already, but didn't realise and just needed to structure things in an useful way.
If you're inspired, awesome, try some of these DRE principals...but remember..
"...It's a Dangerous Business, Frodo, Going Out Your Door..." :-) You might just like it.
Rodo Baggins.
Scripts:
The Powershell script to collect:
And the TSQL code from above.
Availability mashups
コメント