Checking availability group status using a CMS
Today I want to show you another powerful way to use your Central Management Server. The CMS can also be used to execute OPENROWSET queries over your complete SQL Server Farm. Now I can hear you all say: “Why would you want to do that?” and “OPENROWSET, you know that you shouldn’t implement that for queries that have to run frequently!”. And to be honest, in most of the cases you’d be right. But it can come in extremely useful in some situations.
Scenario
Imagine a hosted environment with very strict and rather rigid release rules. Because of the importance and stability requirements of this environment there every component of the environment has to follow the complete DTAP (Development/Test/Acceptance/Production) workflow. Off course this includes the Production SQL Servers and their instances.
Enter the CMS, because a CMS is rarely considered crucial for operations we have introduced the CMS as a DBA Tool, granted a VERY powerful one, but still “just” a tool. By doing this we made the CMS the responsibility of the DBA team so that we can make changes to the CMS without having to go via QA/Testing/The whole change process and version control. (We actually never told our QA officer we actually CAN do with a CMS though).
Use Case
As we all know, for licensing compliance it can be very important to know on what server our active nodes of the deployed availability groups are located. This becomes even more important if you have several large servers, each serving multiple instances.
For license (read cost) reasons this environment runs in a complete active-passive setup. So it is very important that at any given time all active listeners are running on the preferred primary. This setting is easily configured in the Windows Cluster manager, but conform the SQL Server best practices you don’t want automatic failback. So if an instance is discovered to run on the “wrong” server we want to know this. But how can we be check and report this at any given point in time without having to check multiple (100+) availability groups?
In this environment SQL Instance patching happens during the nightly maintenance window. A predefined list of instances was patched by the hosting using their own procedure for rolling updates. But unfortunately any detailed report mentioning where they started and how they implemented this is missing. So we’d love to know the listeners situation as soon as possible.
Luckily (Cough Cough) our CMS or Central Management Server comes to the rescue, so we can use a centralized query to get this information from our servers in an efficient manner and even keep historical information.
OK let’s go through the steps needed to build this pull report getting this crucial information only using our CMS.
Preparing the CMS
Because in this case our production instances (and the QA officer) are rather allergic to undocumented and untested changes we don’t want to implement anything (let alone a linked server) on the Instance side. This means that we have to make sure that our CMS can execute Ad Hoc distributed queries. (This will be the only one instance change needed for the whole SQL Server Farm)
sp_configure ‘show advanced options’, 1;
RECONFIGURE;
sp_configure ‘Ad Hoc Distributed Queries’, 1;
RECONFIGURE;
GO
Get our servers from the CMS
Once connected to our trusty CMS, it’s easy to get the complete server list using the following query:
select s.server_name as name from msdb.dbo.sysmanagement_shared_registered_servers s
You could also include the groups, and add an extra filter if you want to, but for this blogpost I’ll keep this part rather straightforward.
Get meaningful info about our availability groups
Then we write the query that will fetch all the needed information (and some other interesting titbits) from our hadr groups and their servers.
select
hadrc.cluster_name,
SERVERPROPERTY(‘MachineName’) AS [ServerName],
SERVERPROPERTY(‘InstanceName’) AS [Instance],
SERVERPROPERTY(‘Edition’) AS [Edition],
SERVERPROPERTY(‘ProductVersion’) AS [ProductVersion],
Left(@@Version, Charindex(‘-‘, @@version) – 2) As VersionName,
lip.state_desc,
ip_address,
AGC.name as AG_Name,
isnull(l.dns_name, @@SERVERNAME )+‘:’+(CAST(port as varchar(8))) as Listener,
case WHEN is_conformant is null then 0 ELSE is_conformant END as is_conformant,
CASE
WHEN role=1 THEN replica_server_name
ELSE (select replica_server_name from sys.dm_hadr_availability_replica_cluster_states RCS
INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id where role=1)
END as active_node,
CASE
WHEN role=2 THEN replica_server_name
ELSE (select top 1 replica_server_name from sys.dm_hadr_availability_replica_cluster_states RCS
INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id where role=2)
END as passive_node,
hadrc.quorum_state_desc,
hadrc.quorum_type_desc,
CAST(CASE AGC.failure_condition_level
WHEN 1 THEN ‘SQL Server service is down/SQL Server AlwaysOn Lease
Timeout’
WHEN 2 THEN ‘Instance of SQL Server does not connect to cluster/availability replica is in failed state’
WHEN 3 THEN ‘Critical SQL Server internal errors / Default’
WHEN 4 THEN ‘Moderate SQL Server internal errors’
WHEN 5 THEN ‘Any qualified failure conditions/Exhaustion of SQL Engine worker-threads/Detection of an unsolvable’
END as varchar(110)) as AutomaticFailoverCondition,
RCS.join_state_desc,
ARS.role_desc,
ARS.connected_state_desc,
ARS.operational_state_desc,
ARS.synchronization_health_desc,
CAST(Getdate() as date) as [Timestamp],
CHECKSUM(CAST(SERVERPROPERTY(‘MachineName’) as varchar(128))+ AGC.name +CONVERT(varchar(10),getdate(),102))as LookupKey
FROM sys.availability_groups_cluster AS AGC
INNER JOIN sys.dm_hadr_availability_replica_cluster_states AS RCS ON RCS.group_id = AGC.group_id
INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id
LEFT OUTER JOIN sys.availability_group_listeners l ON l.group_id = ARS.group_id
LEFT OUTER join sys.availability_group_listener_ip_addresses lip on lip.listener_id=l.listener_id
outer apply (select * from sys.dm_hadr_cluster) as hadrc
WHERE replica_server_name = SERVERPROPERTY(‘ServerName’)
Creating the Report Store
Ok now we have a good query, so let’s create the table that will store this information, so that we have a solid base to create our reporting/reports.
DECLARE @CMD nvarchar(4000)
DECLARE @CMD2 NVARCHAR(4000)
DECLARE @ServerName sysname
SET NOCOUNT ON;
SET @CMD=”’select
hadrc.cluster_name,
SERVERPROPERTY(””MachineName””) AS [ServerName],
SERVERPROPERTY(””InstanceName””) AS [Instance],
SERVERPROPERTY(””Edition””) AS [Edition],
SERVERPROPERTY(””ProductVersion””) AS [ProductVersion],
Left(@@Version, Charindex(””-””, @@version) – 2) As VersionName,
lip.state_desc,
ip_address,
AGC.name as AG_Name,
isnull(l.dns_name, @@SERVERNAME )+””:””+(CAST(port as varchar(8))) as Listener,
case WHEN is_conformant is null then 0 ELSE is_conformant END as is_conformant,
CASE
WHEN role=1 THEN replica_server_name
ELSE (select replica_server_name from sys.dm_hadr_availability_replica_cluster_states RCS
INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id where role=1)
END as active_node,
CASE
WHEN role=2 THEN replica_server_name
ELSE (select top 1 replica_server_name from sys.dm_hadr_availability_replica_cluster_states RCS
INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id where role=2)
END as passive_node,
hadrc.quorum_state_desc,
hadrc.quorum_type_desc,
CAST(CASE AGC.failure_condition_level
WHEN 1 THEN ””SQL Server service is down/SQL Server AlwaysOn Lease Timeout””
WHEN 2 THEN ””Instance of SQL Server does not connect to cluster/availability replica is in failed state””
WHEN 3 THEN ””Critical SQL Server internal errors / Default””
WHEN 4 THEN ””Moderate SQL Server internal errors””
WHEN 5 THEN ””Any qualified failure conditions/Exhaustion of SQL Engine worker-threads/Detection of an unsolvable””
END as varchar(110)) as AutomaticFailoverCondition,
RCS.join_state_desc,
ARS.role_desc,
ARS.connected_state_desc,
ARS.operational_state_desc,
ARS.synchronization_health_desc,
CAST(Getdate() as date) as [Timestamp],
CHECKSUM(CAST(SERVERPROPERTY(””ServerName””) as varchar(128))+AGC.name+CONVERT(varchar(10),getdate(),102))as LookupKey
FROM sys.availability_groups_cluster AS AGC
INNER JOIN sys.dm_hadr_availability_replica_cluster_states AS RCS ON RCS.group_id = AGC.group_id
INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id
LEFT OUTER JOIN sys.availability_group_listeners l ON l.group_id = ARS.group_id
LEFT OUTER join sys.availability_group_listener_ip_addresses lip on lip.listener_id=l.listener_id
outer apply (select * from sys.dm_hadr_cluster) as hadrc
WHERE replica_server_name = SERVERPROPERTY(””ServerName””)”’
DECLARE ServerList Cursor FOR
select s.server_name as name from msdb.dbo.sysmanagement_shared_registered_servers s
OPEN ServerList
FETCH NEXT FROM ServerList into @ServerName
WHILE @@FETCH_STATUS = 0
BEGIN
SET @CMD2 = ‘Delete from MNGDB.Reporting.AG_ClusterInfo where ServerName=”’+@ServerName+”’ and Timestamp = CAST(Getdate() as date)’
exec sp_executesql @CMD2
SET @CMD2=‘INSERT INTO MNGDB.Reporting.AG_ClusterInfo SELECT a.* FROM OPENROWSET(”SQLNCLI”, ”Server=’+@ServerName+‘;Trusted_Connection=yes;”,’+@CMD+‘) AS a;’
exec sp_executesql @CMD2
FETCH NEXT FROM ServerList into @ServerName
END
CLOSE ServerList
DEALLOCATE ServerLis
Ok hit F5
Reporting the Status
After the previous run, you’ll have a table filled with all the information we need to be able to create multiple interesting reports using the same dataset. Please note that you don’t want to run this query multiple times (OPENDATASET queries like this shouldn’t be used too frequently) so by storing it in a table on the CMS, you can get all the info you need without having to hammer all your instances.
So let’s query the data:
;WITH ClusterResources as (Select DISTINCT
cluster_name,
[Listener],
Timestamp
FROM MNGDB.Reporting.AG_ClusterInfo
WHERE Timestamp = CAST(Getdate() as date))
Select
C.cluster_name,
C.[Listener],
CIA.ServerName as Primary_Server,
CIA.[Instance] as Primary_Instance,
CIA.[ProductVersion] as Primary_Version,
CIP.ServerName as Secondary_Server,
CIP.[Instance] as Secondary_Instance,
CIP.[ProductVersion] as Secondary_Version,
(Select count(*) FROM ClusterResources d where d.cluster_name=C.cluster_name) as Aantal_AGroups_InCluster,
(Select count(*) FROM MNGDB.Reporting.AG_ClusterInfo e where e.[role_desc]=‘PRIMARY’ AND CIA.[connected_state_desc]=‘CONNECTED’ and e.cluster_name=C.cluster_name
AND CAST(e.ServerName as varchar(128))=CAST(CIA.ServerName as varchar(128))) as Aantal_Active_AG_On_Active_Node,
(Select count(*) FROM MNGDB.Reporting.AG_ClusterInfo e where e.[role_desc]!=‘PRIMARY’ AND CIA.[connected_state_desc]=‘CONNECTED’ and e.cluster_name=C.cluster_name
AND CAST(e.ServerName as varchar(128))=CAST(CIP.ServerName as varchar(128))) as Aantal_Passive_AG_On_Passive_Node,
CASE