A regular check of system health ensures your SQL Servers keep running smoothly.
It is a good idea to regularly perform a SQL Server health check on all servers in your estate. Any issues fixed after the last check may have reappeared. If you are lucky enough to have one of the good monitoring tools you will catch some of the issues described here, but not all. A regular health check at different times of the day each time will still be necessary.
Source Of This List
This list is taken from the System Health report generated by SQL CoPilot. If you have any suggestions for additional checks please let us know and we'll add them here and to SQL CoPilot itself.
Some of the issues described here can be avoided by ensuring that SQL Server Best Practices are adhered to, while others are less easily avoided and have to be resolved when they are discovered.
What To Check For
If you run the health check using SQL CoPilot, it checks for all issues listed here. If there is something in the generated report you don't understand, refer back to this page for an explanation.
Blocking is normal. SQL Server takes locks on all types of objects to ensure that data/schema/code does not change while it is being accessed.
Long term blocking should not be normal, especially in an OLTP system. Blocking by a long running request may be an indication that a report is being run that is affecting other users, or that an application has been poorly written and is holding transactions open for too long.
RESOURCE_SEMAPHORE and RESOURCE_SEMAPHORE_QUERY_COMPILE waits are both signs of severe memory shortage and the cause needs to be looked into urgently.
A RESOURCE_SEMAPHORE wait occurs when a query tries to obtain a memory grant bigger than the memory available. This is normally due to the memory grant being very large. It could be that the query genuinely needs a grant this big, but often the optimizer has vastly overestimated the amount needed, due perhaps to an overcomplicated query. If the root cause cannot be identified, recent versions of SQL Server have a MAX_GRANT_PERCENT query hint.
RESOURCE_SEMAPHORE_QUERY_COMPILE waits are also a sign of memory shortage and of complicated queries. There is a series of "gateways", small, medium and large, that are used to throttle compilations requiring large amounts of memory. This is designed to allow smaller plans to compile while limiting the number needing large amounts of memory.
Try to limit the complexity of these queries, and to reduce the number of compilations required through the use of stored procedures and parameterization.
After the memory allocated to the SQL Server buffer pool (limited by "max server memory"), the rest of the SQL Server address space, memory used by the operating system, and memory used by other applications on the server, there should be something left.
Ideally this needs to be enough to allow for fluctuations in OS and application usage.
The SQL Server health check highlights servers with less than 100MB or 2% of server memory, whichever is greater.
Try to keep some disk space free in case it is needed unexpectedly. Even with proper capacity planning and ensuring data files have free space internally, you can still be taken by surprise by a data load, an index rebuild, an update that causes loads of page splits, etc. Having sufficient space to allow files to grow (and enabling the growth) is a sensible precaution.
The SQL Server health check warns if any disk has less than 5% free space.
The thread pool is the pool of threads available for workers. It is allocated when SQL Server starts and the exact number of available threads depends on the number of CPU cores and whether the server is 32-bit or 64-bit.
If a worker thread is not available when needed (so all threads are being used) you will see the THREADPOOL wait type.
You can increase the size of the thread pool with the "max worker threads" option, but please don't make this your automatic response to the THREADPOOL wait. Often it is an application problem, keeping too many uncommitted transactions open, or erratically creating an unnecessary number of active connections. Only change the configuration option if absolutely necessary.
If a database is not online, it is not restoring (being restored or the target in a mirroring setup) and it is not intentional, then something is clearly wrong and has to be investigated.
A table in msdb called suspect_pages records every time a page has a bad checksum or torn page, or there is an 823 or 824 error.
The health check reports all rows with an event_type less than 4, meaning no action has yet been taken.
Restoring the database, restoring selected pages, or running a CHECKDB repair will automatically update the event_type.
Any issue with AlwaysOn, including problems with the underlying cluster, are reflected in the AlwaysOn health. Any state other than HEALTHY is highlighted by the health check.
If an issue occurs with database mirroring, such as becoming disconnected, this should be resolved as soon as possible.
The SQL Agent service is nearly always needed and unless you are using an edition of SQL Server that doesn't have it (such as Express edition or SQL Azure) you should ensure the service is set to start automatically.
Errors with a severity of 21 or higher should be investigated immediately as there may be an issue that could affect stability of the instance or availability of the databases.
A non-yielding scheduler can have many causes. Ensure you are running the latest service pack and cumulative update before investigating further.
As with a non-yielding scheduler, this can have many causes. Ensure you are running the latest service pack and cumulative update and then contact Microsoft if the stack dumps persist.
This has been put close in this list to non-yielding schedulers and stack dump for a good reason. All 3 are serious issues that need fixing, but often have already been fixed in a cumulative update or service pack.
SQL Server loves memory, and setting "max server memory" as high as possible is best practice, but don't set it so high that the OS is starved of memory and starts paging out some of the working set. That is not the only possible cause. Others are a serious lack of memory on the server in the first place, other applications on the server that need lots of memory.
A good first response to this is to reduce "max server memory" a bit, but ensuring the server is dedicated to SQL Server and getting as much RAM as possible should also be considered.
Sometimes deadlocks are a fact of life, a sign of a poorly written application, or perhaps a very complicated application that just can't avoid them. It could also be something totally avoidable. Either way, any change in the number of deadlocks, or a SQL Server suddenly getting them where there were none before needs to be investigated.
If a session is idle but has an open transaction, it may be a sign of a bug in an application, or it could be that an application has crashed, leaving the session without any further activity. Either way this can eventually cause blocking and impact other sessions and users.
The short term solution to idle sessions holding transactions is usually to check that it is safe to terminate the session (by speaking to users/developers/application experts) and then to issue the KILL command.
The long term solution is fix the application to prevent it from reoccurring.
There are a number of possible reasons why a request may be rolling back:
- You, or another privileged user, may have issued a KILL command to terminate a session;
- An update issued by an application may have timed out;
- An update may have caused data or log file to fill, or have filled tempdb; This may have implications to other sessions also needing the space, and they may also rollback;
- An application may have been written in such a way that a rollback is common - in error handling code for example;
Sometimes it can take a long time for a rollback to complete, normally if there were a lot of updates, inserts and deletes up to the point of failure. There may be significant blocking during the rollback bringing all users of the affected database(s) to a halt.
It may be tempting to try to kill the session. That won't work - the KILL command will be ignored.
It may be tempting to restart the SQL service - DON'T DO THAT! SQL Server will put the database into recovery after the restart and the rollback will take just as long. The changes have already been made to the database pages and some will have been written to disk by a checkpoint. The log file contains a record of all of the changes, and the only way SQL Server can ensure all data changes are undone is by stepping through the log records and undoing the changes one by one.
Page life expectancy is a good indication of memory contention. It records the average number of seconds you would expect a page to remain in memory before it is flushed to make way for a new one.
In the early days of SQL Server, Microsoft recommended keeping page life expectancy above 300 seconds, but that advice is way out of date. With the large amounts of memory available to modern 64-bit servers, allowing it to get as low as 300 seconds is leaving it far too late.
I tend to use a value of [GB available to the buffer pool] * 100. So an instance with "max server memory" set to 8GB would be considered low on memory when page life expectancy drops below 800 seconds.
Some systems run naturally lower than this, so the best advice is to take a baseline and use monitoring software to alert when it drops below this baseline.
SQL Server keeps track of memory and responds to requests from the OS to trim its usage when necessary. The DMV sys.dm_os_memory_info has a column that indicates whether the current memory state is low, steady or high. Ideally it should be high most of the time.
This section of the health check looks at average milliseconds per data read and per log write. High values for one or both of these can adversely affect performance and can have a multitude of causes. For a full discussion of reasons behind slow I/O, see SQL Server I/O Performance.
This can be due to queries or updates overloading the disk subsystem, external processes, contention from multiple servers on a shared disk array or SAN, problems with the storage, active virus scanning...
There are a huge number of possible reasons for slow I/O. See SQL Server I/O Performance for in-depth coverage of the topic.
Occasional peaks in total server CPU usage are expected and evidence that the CPU resource is being used optimally, but extended periods of time during which it remains high may be evidence of a problem. For example there may be a query performing a long-running CPU-intensive query that impacts other users. Or another application on the server could be hogging the CPU.
Too many compilations is often a sign of too much adhoc SQL.
Too many recompilations may be caused by DDL inside a stored procedure, or a large number of inserts into a temporary table causing statistics to be updated.
(Re)compilations are CPU intensive and should be looked into if CPU is a bottleneck on the server. One way to limit this is to use stored procedures or parameterized queries and to limit use of the RECOMPILE hint.
When a worker has to wait for a resource it is placed into a queue (and the relevant wait type is recorded). When the wait has completed the worker is placed into the runnable queue for the same scheduler it was on previously. If the scheduler is not busy and the runnable queue is empty it will continue being processed. If the scheduler is busy, the worker will wait in the runnable queue until it can be processed. When this happens, the wait time is registered as a signal wait.
A high percentage of signal waits, compared with total wait time in sys.dm_os_wait_stats is a sign of CPU contention.
CMEMTHREAD waits occur during allocation or deallocation of memory for thread-safe memory objects. If you have a lot these occurring it may mean contention on a particular memory object. See this post on SQLskills for more information.
I/O is performed asynchronously. A task issues an I/O request and then is placed in a queue while waiting for the I/O to complete. This allows the scheduler to be used in the meantime by a different task.
If this shows in the SQL Server health check, it means that at the time of the check there were a high number of pending I/Os in relation to the number of schedulers - it doesn't necessarily indicate a problem. Use the Real Time Activity dashboard to check if I/O is consistently a bottleneck at the moment and the SQLOS -> Wait Types screen to check if I/O waits are amongst the highest overall.
When a log file is created or extended, the allocation is not performed in one single chunk, but several. These chunks are called virtual log files. Every time a log file is extended, more virtual log files are added. There is no guarantee these will be placed sequentially on disk, yet a log file is written to sequentially and randomly placed VLFs affect performance.
See Transaction Log VLFs – too many or too few? for a great explanation of this and how you can optimize the size and number of VLFs.
SQL CoPilot highlights databases in which the number of virtual log files exceeds the "Maximum Virtual Log File Count" threshold (default 500 - go to "Configuration -> Manage Thresholds" in SQL CoPilot to change this). This may be too low for some very large log files, so use your judgement to decide if this can be ignored.
Note: checking the number of VLFs requires membership of the sysadmin server role. This part of the health check will not be performed if you are not a sysadmin member.
Indexes may have been deliberately disabled, in which case you should schedule them to be dropped if not needed.
But sometimes indexes are disabled temporarily by a regular process such as an SSIS package or other ETL process and re-enabled afterwards. If the process fails midway through it could leave indexes that are beneficial to daily queries in a disabled state.
Unlike other databases, which truncate the transaction log on checkpoint (when SIMPLE) or after a log backup, tempdb behaves differently.
The transaction log is not truncated until it reaches 70% full. For this reason if you see that the log is more than 70% full it means that something is preventing the truncation - perhaps an open transaction.
Investigate the cause to prevent the log from filling completely, as that can cause application errors and may even make the instance unresponsive.
Errors with a severity of 17 - 20 should be investigated but are not normally urgent. Some DBAs consider severity 20 to be high rather than medium, and you may want to look into these more quickly if you think it is necessary.
A failing job may not be a problem, but equally it could be part of a critical application process, or one of the DBA maintenance jobs for backup, CHECKDB or index rebuild.
If a job is failing reguarly it needs to be fixed or disabled.
If you run the SQL Server health check out of hours this is perfectly normal and expected, as that is when DBCC CHECKDB should be run.
If you run the health check during the working day, you do not want to see a CHECKDB running. It is resource intensive, using memory, I/O and CPU. The DBCC CHECKDB can be safely stopped and rescheduled for a quieter time.
How long is too long? Only you know that. It is not uncommon for an application to run reports lasting hours in some cases and you may have no choice but to allow them to run. But if you see a long running request where it is not expected you need to investigate.
It may be caused by an inefficient plan, or a rogue user running some dreadful SQL performing millions of logical reads.
Get used to the applications your databases support, and you'll know if this is an issue or not.
A failed login is usually harmless. It may be a developer accidentally trying to connect to a production server, or someone with a SQL login mistyping their password.
But, it could be an AD issue such a missing SPN or worse, an unauthorized individual attempting to gain access.
When a job starts to take longer then normal to run it is worth looking into the cause. It could be a symptom of an underlying infrastructure issue, such as contention on a shared storage array.
The SQL Server health check in SQL CoPilot takes the average of the last 10 successful runs for each job and highlights it if the most recent run (or the current duration if still running) is 20% or more above that level.
Earlier in this list I mentioned the suspect_pages table in msdb, where event_type < 4, meaning that the page has not been restored, repaired or deallocated. But what happens to rows when a page has been fixed and event_type ≥ 4? The answer is nothing - it sits there clogging up the table. There is a limit of 1000 rows in this table so this could result in no more suspect pages being recorded.
It is a DBA's responsibility to periodically clear out the old rows. This is best done manually - you wouldn't want an automated job to delete them before you know they exist, as some page repairs can be performed automatically by SQL Server.
Your best option is to run the health check in SQL CoPilot regularly, and delete the rows where event_type ≥ 4 yourself.
The Index Tuning Wizard and Database Tuning Advisor both create hypothetical indexes while trying out different plans and doing their thing. These are deleted at the end of the process, but if either application crashes they can leave these indexes behind. Hypothetical indexes don't actually cause a problem as they can't be used and they are not updated, but they can make it difficult to see the wood for the trees! There is no harm in deleting them. Use the Hypothetical Indexes screen in SQL CoPilot to identify them and script the drop statements.
Have we missed anything?
If you think there is anything that could be added to this SQL Server health check, please let us know.
How To Generate a SQL Server Health Check Report
In the "All Instances" tab click "System Health Report" then click "Generate Report"
A list of all instances is displayed. You can choose to scan all of these, or by using the filters at the top you can narrow down the instances to check by name, version, edition or tag. You can even restrict the report to just one instance if you wish by typing its name into the box above the instance name column.
Now click the Generate button and sit back while the instances are checked and the report is generated. You can stop it at any time by clicking the Cancel button.
The finished report will be available every time you visit this screen, even if you restart the application, until it is generated again.
Click on a row in the report to be taken to the relevant screen for the affected instance.
If you don't have SQL CoPilot, download the free trial and then return here to generate your report.