Checking availability group status using a Central Management Server

Checking availability group status using a Central Management Server

Door: Karel Coenye

Checking availability group status using a CMS

Today I want to show you another powerful way to use your Central Management Server. The CMS can also be used to execute OPENROWSET queries over your complete SQL Server Farm. Now I can hear you all say: “Why would you want to do that?” and “OPENROWSET, you know that you shouldn’t implement that for queries that have to run frequently!”. And to be honest, in most of the cases you’d be right. But it can come in extremely useful in some situations.

Scenario

Imagine a hosted environment with very strict and rather rigid release rules. Because of the importance and stability requirements of this environment there every component of the environment has to follow the complete DTAP (Development/Test/Acceptance/Production) workflow. Off course this includes the Production SQL Servers and their instances.

Enter the CMS, because a CMS is rarely considered crucial for operations we have introduced the CMS as a DBA Tool, granted a VERY powerful one, but still “just” a tool. By doing this we made the CMS the responsibility of the DBA team so that we can make changes to the CMS without having to go via QA/Testing/The whole change process and version control. (We actually never told our QA officer we actually CAN do with a CMS though).

Use Case

As we all know, for licensing compliance it can be very important to know on what server our active nodes of the deployed availability groups are located. This becomes even more important if you have several large servers, each serving multiple instances.

For license (read cost) reasons this environment runs in a complete active-passive setup. So it is very important that at any given time all active listeners are running on the preferred primary. This setting is easily configured in the Windows Cluster manager, but conform the SQL Server best practices you don’t want automatic failback. So if an instance is discovered to run on the “wrong” server we want to know this. But how can we be check and report this at any given point in time without having to check multiple (100+) availability groups?

In this environment SQL Instance patching happens during the nightly maintenance window. A predefined list of instances was patched by the hosting using their own procedure for rolling updates. But unfortunately any detailed report mentioning where they started and how they implemented this is missing. So we’d love to know the listeners situation as soon as possible.

Luckily (Cough Cough) our CMS or Central Management Server comes to the rescue, so we can use a centralized query to get this information from our servers in an efficient manner and even keep historical information.

OK let’s go through the steps needed to build this pull report getting this crucial information only using our CMS.

Preparing the CMS

Because in this case our production instances (and the QA officer) are rather allergic to undocumented and untested changes we don’t want to implement anything (let alone a linked server) on the Instance side. This means that we have to make sure that our CMS can execute Ad Hoc distributed queries. (This will be the only one instance change needed for the whole SQL Server Farm)

sp_configure ‘show advanced options’, 1;

RECONFIGURE;

sp_configure ‘Ad Hoc Distributed Queries’, 1;

RECONFIGURE;

GO

Get our servers from the CMS

Once connected to our trusty CMS, it’s easy to get the complete server list using the following query:

select s.server_name as name from msdb.dbo.sysmanagement_shared_registered_servers s

You could also include the groups, and add an extra filter if you want to, but for this blogpost I’ll keep this part rather straightforward.

Get meaningful info about our availability groups

Then we write the query that will fetch all the needed information (and some other interesting titbits) from our hadr groups and their servers.

select

hadrc.cluster_name,

SERVERPROPERTY(‘MachineName’) AS [ServerName],

SERVERPROPERTY(‘InstanceName’) AS [Instance],

SERVERPROPERTY(‘Edition’) AS [Edition],

SERVERPROPERTY(‘ProductVersion’) AS [ProductVersion],

Left(@@Version, Charindex(‘-‘, @@version) 2) As VersionName,

lip.state_desc,

ip_address,

AGC.name as AG_Name,

isnull(l.dns_name, @@SERVERNAME )+‘:’+(CAST(port as varchar(8))) as Listener,

case WHEN is_conformant is null then 0 ELSE is_conformant END as is_conformant,

CASE

WHEN role=1 THEN replica_server_name

ELSE (select replica_server_name from sys.dm_hadr_availability_replica_cluster_states RCS

INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id where role=1)

END as active_node,

CASE

WHEN role=2 THEN replica_server_name

ELSE (select top 1 replica_server_name from sys.dm_hadr_availability_replica_cluster_states RCS

INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id where role=2)

END as passive_node,

hadrc.quorum_state_desc,

hadrc.quorum_type_desc,

CAST(CASE AGC.failure_condition_level

WHEN 1 THEN ‘SQL Server service is down/SQL Server AlwaysOn Lease

Timeout’

WHEN 2 THEN ‘Instance of SQL Server does not connect to cluster/availability replica is in failed state’

WHEN 3 THEN ‘Critical SQL Server internal errors / Default’

WHEN 4 THEN ‘Moderate SQL Server internal errors’

WHEN 5 THEN ‘Any qualified failure conditions/Exhaustion of SQL Engine worker-threads/Detection of an unsolvable’

END as varchar(110)) as AutomaticFailoverCondition,

RCS.join_state_desc,

ARS.role_desc,

ARS.connected_state_desc,

ARS.operational_state_desc,

ARS.synchronization_health_desc,

CAST(Getdate() as date) as [Timestamp],

CHECKSUM(CAST(SERVERPROPERTY(‘MachineName’) as varchar(128))+ AGC.name +CONVERT(varchar(10),getdate(),102))as LookupKey

FROM sys.availability_groups_cluster AS AGC

INNER JOIN sys.dm_hadr_availability_replica_cluster_states AS RCS ON RCS.group_id = AGC.group_id

INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id

LEFT OUTER JOIN sys.availability_group_listeners l ON l.group_id = ARS.group_id

LEFT OUTER join sys.availability_group_listener_ip_addresses lip on lip.listener_id=l.listener_id

outer apply (select * from sys.dm_hadr_cluster) as hadrc

WHERE replica_server_name = SERVERPROPERTY(‘ServerName’)

Creating the Report Store

Ok now we have a good query, so let’s create the table that will store this information, so that we have a solid base to create our reporting/reports.

DECLARE @CMD nvarchar(4000)

DECLARE @CMD2 NVARCHAR(4000)

DECLARE @ServerName sysname

SET NOCOUNT ON;

SET @CMD=”’select

hadrc.cluster_name,

SERVERPROPERTY(””MachineName””) AS [ServerName],

SERVERPROPERTY(””InstanceName””) AS [Instance],

SERVERPROPERTY(””Edition””) AS [Edition],

SERVERPROPERTY(””ProductVersion””) AS [ProductVersion],

Left(@@Version, Charindex(””-””, @@version) – 2) As VersionName,

lip.state_desc,

ip_address,

AGC.name as AG_Name,

isnull(l.dns_name, @@SERVERNAME )+””:””+(CAST(port as varchar(8))) as Listener,

case WHEN is_conformant is null then 0 ELSE is_conformant END as is_conformant,

CASE

WHEN role=1 THEN replica_server_name

ELSE (select replica_server_name from sys.dm_hadr_availability_replica_cluster_states RCS

INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id where role=1)

END as active_node,

CASE

WHEN role=2 THEN replica_server_name

ELSE (select top 1 replica_server_name from sys.dm_hadr_availability_replica_cluster_states RCS

INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id where role=2)

END as passive_node,

hadrc.quorum_state_desc,

hadrc.quorum_type_desc,

CAST(CASE AGC.failure_condition_level

WHEN 1 THEN ””SQL Server service is down/SQL Server AlwaysOn Lease Timeout””

WHEN 2 THEN ””Instance of SQL Server does not connect to cluster/availability replica is in failed state””

WHEN 3 THEN ””Critical SQL Server internal errors / Default””

WHEN 4 THEN ””Moderate SQL Server internal errors””

WHEN 5 THEN ””Any qualified failure conditions/Exhaustion of SQL Engine worker-threads/Detection of an unsolvable””

END as varchar(110)) as AutomaticFailoverCondition,

RCS.join_state_desc,

ARS.role_desc,

ARS.connected_state_desc,

ARS.operational_state_desc,

ARS.synchronization_health_desc,

CAST(Getdate() as date) as [Timestamp],

CHECKSUM(CAST(SERVERPROPERTY(””ServerName””) as varchar(128))+AGC.name+CONVERT(varchar(10),getdate(),102))as LookupKey

FROM sys.availability_groups_cluster AS AGC

INNER JOIN sys.dm_hadr_availability_replica_cluster_states AS RCS ON RCS.group_id = AGC.group_id

INNER JOIN sys.dm_hadr_availability_replica_states AS ARS ON ARS.replica_id = RCS.replica_id

LEFT OUTER JOIN sys.availability_group_listeners l ON l.group_id = ARS.group_id

LEFT OUTER join sys.availability_group_listener_ip_addresses lip on lip.listener_id=l.listener_id

outer apply (select * from sys.dm_hadr_cluster) as hadrc

WHERE replica_server_name = SERVERPROPERTY(””ServerName””)”’

DECLARE ServerList Cursor FOR

select s.server_name as name from msdb.dbo.sysmanagement_shared_registered_servers s

OPEN ServerList

FETCH NEXT FROM ServerList into @ServerName

WHILE @@FETCH_STATUS = 0

BEGIN

SET @CMD2 = ‘Delete from MNGDB.Reporting.AG_ClusterInfo where ServerName=”’+@ServerName+”’ and Timestamp = CAST(Getdate() as date)’

exec sp_executesql @CMD2

SET @CMD2=‘INSERT INTO MNGDB.Reporting.AG_ClusterInfo SELECT a.* FROM OPENROWSET(”SQLNCLI”, ”Server=’+@ServerName+‘;Trusted_Connection=yes;”,’+@CMD+‘) AS a;’

exec sp_executesql @CMD2

FETCH NEXT FROM ServerList into @ServerName

END

CLOSE ServerList

DEALLOCATE ServerLis

Ok hit F5

Reporting the Status

After the previous run, you’ll have a table filled with all the information we need to be able to create multiple interesting reports using the same dataset. Please note that you don’t want to run this query multiple times (OPENDATASET queries like this shouldn’t be used too frequently) so by storing it in a table on the CMS, you can get all the info you need without having to hammer all your instances.

So let’s query the data:

;WITH ClusterResources as (Select DISTINCT

cluster_name,

[Listener],

Timestamp

FROM MNGDB.Reporting.AG_ClusterInfo

WHERE Timestamp = CAST(Getdate() as date))

Select

C.cluster_name,

C.[Listener],

CIA.ServerName as Primary_Server,

CIA.[Instance] as Primary_Instance,

CIA.[ProductVersion] as Primary_Version,

CIP.ServerName as Secondary_Server,

CIP.[Instance] as Secondary_Instance,

CIP.[ProductVersion] as Secondary_Version,

(Select count(*) FROM ClusterResources d where d.cluster_name=C.cluster_name) as Aantal_AGroups_InCluster,

(Select count(*) FROM MNGDB.Reporting.AG_ClusterInfo e where e.[role_desc]=‘PRIMARY’ AND CIA.[connected_state_desc]=‘CONNECTED’ and e.cluster_name=C.cluster_name

AND CAST(e.ServerName as varchar(128))=CAST(CIA.ServerName as varchar(128))) as Aantal_Active_AG_On_Active_Node,

(Select count(*) FROM MNGDB.Reporting.AG_ClusterInfo e where e.[role_desc]!=‘PRIMARY’ AND CIA.[connected_state_desc]=‘CONNECTED’ and e.cluster_name=C.cluster_name

AND CAST(e.ServerName as varchar(128))=CAST(CIP.ServerName as varchar(128))) as Aantal_Passive_AG_On_Passive_Node,

CASE