This is the first follow up post from: Azure SQL TDE and Customer Keys (BYOK). Microsoft?...your name isn't down, so you aren't coming in. (sqlrod.com) , which explained how to use Customer Keys with Azure SQL DB (and Managed instance), and some of the dangerous pitfalls that you can face. We need to know when there may be trouble on the horizon, so key (pun fully intended) to this is monitoring.
Yes, i'm rambling on again about monitoring...but I like monitoring.
Monitoring .
We really need to monitor so that we know what is going on, and more importantly, what is going on before our customers and bosses do. From this, we can then take actions, (preferably proactively) effectively swooping in and saving the day, albeit silently to our customers if we get it right.
Monitoring for Inaccessible Databases.
If a Server/DB loses access to the vault that contains its key, we have a kill point that we need to act upon really quickly. As soon as a DB switches to Inaccessible, that marks the death point for your users being able to work, clock starts ticking and Simon Pheonix is now thawing.
So we head over to the Activity log. Annoyingly, what you find is that there is an Update SQL database operation that is Started, and In progress, but nothing easily seen about what its doing at the high level.
So open the event and hop over to the JSON tab, to see the EventName . You'll see it buried away in the properties, the eventname of "MakeDatabaseInaccessible" with a status of "Started". Oh great, when using an Alert to an Action group from the standard ActivityLog Monitoring, we can't narrow down an alert on that very well.
Yes, we may be able to cover it with other statuses, but risk an Alert storm, which can be nearly as dangerous as no alerts, as nobody sees/acknowledges and actions them.
Take Note
Even more worrying, if the DB on a Serverless tier with AutoPause enabled, and the DB is Paused, it doesn't transition to inaccessible either, so you don't receive an alert as it remains in a Paused state.
Connections to the DB will fail with "Database '{DatabaseName}' on server '{servername}' is not currently available" as it cannot unpause.
If the resource is amended, ie... disabling autopause, you will recieve a Failure on Update SQL Database with the relevent information in the StatusMessage.
This one could be valid for a standard Alert from the Activity Log, as ideally, we would like to know if there's any failed update on our precious resources.
So we need a better monitoring warning system.
Log Analytics and Log Alerts.
There are probably many methods, but many shops already use Log Analytics, and you can filter from the output in a MUCH more granular way, so we'll use that.
Create a Log Analytics workspace if you haven't got one already.
In the Activity Log of the resource that you want to export the Activity Log data to, select Export Activity Logs
+ Add Diagnostic setting.
I've selected, Administrative (the example we are using is Administrative), ServiceHealth, Alert, and ResourceHealth here and tick 'Send to Log Analytics' workspace.
Great, so now we are pushing our logs to a target with more options.
Open up your Log Analytics workspace, Monitoring Blade - Alerts ..and +Create - Alert Rule
Custom Log Search.
In your search query, we're going to add a bit of KQL
AzureActivity
| where parse_json(Properties).EventName == "MakeDatabaseInaccessible"
Just select a count of table rows, greater than 0, aggregating every 1 minute as we want to know as fast as possible.
It should now look something like the below.
Click on Actions, and make sure you have an Action group set up to notify you in your preferred manner, i.e., email, SMS, push message etc.
The beauty of this here, is you can also have an action to run remidiation logic in the form of a Runbook, Logic App, Azure Function if you so wish? Did someone say self healing?
Click on Details and set the Severity - 0 - Critical here, as this isn't one to mess around with.
Click Review and Create, then Create.
Now, should the worse ever happen, you will get notified at least.
Due to the way the Events manifest themselves, the MakeDatabaseInaccesible event only transitions to Succeeded when the DB is online again (Really Mr. Microsoft?!).
If you wanted a 'Fix' alert, then the KQL would be:
AzureActivity
| where parse_json(Properties).EventName == "MakeDatabaseInaccessible"
| where parse_json(Properties).Status == "Succeeded"
And you classify this as an Informational alert if you so wanted.
Monitoring the Vaults themselves.
"No vault, our fault..." Roddy Burns '23.
Easier than above as Microsoft publish great information on how to monitor the actual vaults themselves. This is detailed at Monitoring Azure Key Vault | Microsoft Learn, the basic recommended Activity Log / Logs alerts are below.
Key Vault Availability drops below 100% (Static Threshold)
Key Vault Latency is greater than 1000ms (Static Threshold)
Overall Vault Saturation is greater than 75% (Static Threshold)
Overall Vault Saturation exceeds average (Dynamic Threshold)
Total Error Codes higher than average (Dynamic Threshold)
These are Platform metrics, available under the Monitoring Blade of your Key Vault, Alerts and Create Alert Rule. Conditions are available here along with Custom Log Search alert.
For a Custom Log Search against a Log Analytics workspace, to capture Failed requests, your KQL query could be:
// Count of failed KeyVault requests by status code.
AzureDiagnostics
| where ResourceProvider =="MICROSOFT.KEYVAULT"
| where httpStatusCode_d >= 300 and not(OperationName == "Authentication" and httpStatusCode_d == 401)
| summarize count() by requestUri_s, ResultSignature, _ResourceId
// ResultSignature contains HTTP status, e.g. "OK" or "Forbidden"
// httpStatusCode_d contains HTTP status code returned
Key Monitoring
Key Expiration
We also need to know some of the activities that are happening to our precious keys. With key expiration, this is easier, as the portal caters for this along with our favourite scripting options of course.
If Expiry Time is set, then a notification time will be enabled on the key. Be aware though, setting expiry time on the key won't enable the Notification setting until a new key version is generated.
Key Rotation
We also may want to know when a new version of a key is generated, either manually, or programmatically. We can do this by creating an Event Subscription through the Events blade on the key vault itself.
Name
Give the Subscription a name, such as keyrotationsubscription
Event Schema
Set this to Cloud Event Schema v1.0 (this will be set by our chosen endpoint type below anyway)
Event Types
As we're interested in Keys, select the following, note, i've included Vault Access Policy changed, as mentioned previously, this is REALLY important to stable operation. Ideally, you may want to split 'Vault Access Policy Changed' into a different event to assign a higher severity to that alert. The choice is yours.
Endpoint Details
Type
Here, you have the world of options, from Azure functions, to Webhooks, i've just chosen Azure Monitor Alert. This will change the Event schema to Cloud Event Schema v1.0.
Endpoint
Click 'Configure an Endpoint' and then choose the severity (Sev 3 (informational) seems apt)
Tick the 'Select action groups' box if you want to send this alert direct to an action group, and add a description to appear as part of the alert.
Done. You now have an alert when a key version changes.
Summary
If we have Key Vaults (or any Resource for that matter), we have to monitor it.
In this particular use case, we had a original requirement to implement Customer Managed keys, which gives us more options to mess things up so we need a monitor a little bit tighter. The above should give us a starting point on how to do that, and keep ourselves in the 'System Available' happy place.
As always, thanks for reading my ramblings.
Chubb Rod
Comments