There are lots of reasons to take a step back and really consider your on-premises SharePoint farm – frustrated users, poor performance, random errors and outages, out of control content, etc. – but often we find ourselves overwhelmed with the sheer scope and size of SharePoint, and the idea of doing a health check just seems daunting; there are so many moving parts and so many different ways to achieve the same thing, and with every different way to configure SharePoint there are probably a dozen different opinions on whether this way is better than that way, and it’s difficult to sort through all the “guidance.” Most people have only their own farm(s) as a point of reference, so they don’t have the benefit of personal experience working with a variety of different farms serving different purposes (and often exposing very different issues).
In this post, I’m going to share just a few things that you should be looking at when performing a health check on your farm – this is by no means a comprehensive list of items and isn’t necessarily the “top” items you should consider as that may be different from one farm to another, it’s just a sampling (in no particular order) of some of the more common things we at Aptillon have found when we perform health checks for our clients.
- Incorrect or Insufficient Topology Configurations
- Using the Farm Account Interactively
- Inappropriate Service Account Usage
- Insufficient Drive Space or Non-Optimal Drive Configurations
- Unchecked Errors in ULS and Windows Event Logs
- Large Lists Throttling Issues
For more information about Aptillon’s SharePoint Health Checks visit our SharePoint Health Check page.
#1 – Incorrect or Insufficient Topology Configurations
Often times we hear our clients tell us that their SharePoint farm is mission critical, and if it’s down then their company is losing money or is otherwise in some sort of “trouble,” and more times than not their environment does not reflect this concern. We’ve seen many “mission critical” farms on a single server, or farms with multiple servers for failover purposes all running on a single virtual machine (VM) host, so when that host goes down the whole farm goes down. We’ve also seen farms with so much hardware that it could handle 100 times the expected user load and others that can only handle a fraction of the expected user load. When looking at your servers you need to consider the following:
- How you are going to do patching (SharePoint, SQL, OS)? Is a complete outage while applying the patch acceptable, or do you need to remain at least mostly functional? This includes not just your SharePoint servers but also your SQL Server servers and, if virtualized, your VM host.
- How many users are accessing the system, and what are their usage patterns? Do you find that your system runs particularly slow first thing in the morning when everyone is just signing on? Do you notice random slowdowns throughout the day?
- How much content are you storing within the system and at what rate is it growing? Do you have sufficient search dedicated servers to properly crawl the content without impairing the system?
- How do you handle outages within your data center? Do you have your servers split across multiple VM hosts? Are those hosts connected to the same power supply? The same switch? Where are your single points of failures and what are your risk thresholds should a failure occur?
These are just some of the things that you need to consider when analyzing your server configuration. Another big factor is how you plan to allocate your SharePoint resources, namely your service applications and web applications. For instance, if you deploy Workflow Foundation Manager (WFM) and you want failover for it, you need to have three servers – not two, not four, three. We’ve seen many environments that had two servers set up, but that doesn’t cut it, as you must have a quorum to have failover. And then there’s the distributed cache service (DCS), for which there’s a lot of information which states that three servers is necessary for failover, but in reality only two are required, because the App Fabric service that controls the DCS is managed by SharePoint, so a quorum isn’t necessary (this is something that Microsoft themselves got confused by with the initial release of SharePoint 2016, which enforced a three server setup for DCS when using MinRole until an update was released which fixed the issue so that only two servers were necessary).
Each service within SharePoint has its own topology considerations, where some are better suited on back-end (or application) servers, some on front-end servers, and some on dedicated servers, and it can be very confusing trying to dissect all the various rules. Microsoft has made it a tad bit easier with SharePoint 2016 and the new MinRole feature, but even with that there’s still some ambiguity. We will typically point our clients to a TechNet diagram (https://www.microsoft.com/en-us/download/details.aspx?id=37000) which contains some high-level guidance around the various server roles, but this is just general guidance and is meant as a rough starting place (and often leaves folks with more questions than answers).
When looking at the search service, which has its own topology considerations, we often see farms with sometimes completely invalid topologies, or at least less than optimal topologies that might not actually meet the needs of the business. There are numerous components within the search service and each one has its own configuration considerations. In another TechNet diagram (https://www.microsoft.com/en-us/download/details.aspx?id=30383), Microsoft provides some guidance around the search topology, but again, the examples presented are very generalized and more than likely may not be appropriate for your farm.
You should take the guidance that is out there as just that, general guidance, and come up with a topology that meets your specific needs and budget (you may have to sacrifice business requirements to stay within your budget). Having someone who has built and worked with lots of farms and has some experience around the various characteristics of the different services can be extremely helpful when evaluating the topology of your own farm.
#2 – Using the Farm Account Interactively
The SharePoint Farm account (often with a login of spfarm or something similar) is the primary account used by numerous services within SharePoint, including Central Admin and the SharePoint Timer Service among others. This account is meant to be a service account, and best practice dictates that you should never be logging into a server interactively with this account; however, with probably 90% of all health checks that we’ve done, we’ve seen clients regularly using this account to log into the server to perform routine maintenance on servers or even just random every day activities (most use this account for all interactions with the server). This is bad for several reasons:
- If a user accidentally locks the account because they mistyped it while logging in, they could temporarily render the farm either completely or at least partially unusable until the account is unlocked.
- Changing the SharePoint Farm account password is not a trivial task and should be practiced in a staging environment and planned for during off hour time periods. If you find yourself in a situation where you need to change the password quickly because a user with knowledge of the password has left the company, then you might be in for more of a headache than you bargained for. This account should be considered a highly privileged account, and very few people should have any knowledge of what the password is; by allowing administrators to login with it, you will almost certainly be required to change it more frequently than a typical service account.
- Logging in interactively with an account creates registry keys associated with the user profile that is created within Windows. When you logout, those registry keys are removed. This can cause issues with the security token service, and can result in failures of certain features (we’ve seen this most frequently with errors associated with Office Web Apps). The fix is simple enough – just configure the SecurityTokenServiceApplicationPool application pool to load the user profile (and recycle nightly) – but this is really just a workaround, as the real fix is to not login interactively in the first place.
- By allowing administrators to perform tasks using a “shared” account like this, you lose the audit trail associated with any actions that the user may have made. There are new features with SharePoint 2016 that make tracking administrative changes made through PowerShell better, but if you have multiple users using a shared account, then you are effectively defeating the benefit of features like this.
We always recommend creating named administrative accounts for administrators to use when managing SharePoint. This way, when they are accessing SharePoint in a standard user capacity, they are doing so with less privileged accounts, and in order to perform a task that requires higher rights, they would then switch to their admin account. This creates a better audit trail for their activities and allows you to apply different group policies to those administrative accounts. Keeping your SharePoint environment secure and maintaining proper audit details can be critical for most farms – if you’re not sure how to do this properly, make sure you work with a SharePoint professional who can help.
#3 – Inappropriate Service Account Usage
There are two types of service accounts within SharePoint: Managed Service Accounts and “regular” Service Accounts. A Managed Service Account is “known” by SharePoint in that SharePoint stores the password for the account and, if configured to do so, is able to actually change the password automatically. Many services within SharePoint are able to work directly with the Managed Service Accounts in order to get the credentials they need to perform their various tasks; however, not all services know how to work with Managed Service Accounts, and those services use what I call “regular” service accounts (basically just an account that isn’t a Managed Service Account). The problem we often see is that these “regular” Service Accounts are often created as Managed Service Accounts, which can cause some problems:
- If an administrator incorrectly chooses to configure automatic password management for this account, then it will likely break the services that are using the account and are unfamiliar with the Managed Service Account feature. This can often result in the account becoming locked, which will make any service that uses it unavailable. The most common one we see this done with is the search crawl account (of course another issue here is that people often use the search service account for the crawl account, which has its own issues).
- Even if you don’t use the automatic password change feature, this results in just one more place that you may have to change the password (and can be confusing to novice administrators who may think it only needs to be changed in this one location).
- Managed accounts can be easily hacked with just a little bit of PowerShell. In a least privileges environment where you are granting users shell admin rights, they could potentially discover the password for the service account. This may not be a huge deal in terms of protecting the accounts that shouldn’t be Managed Service Accounts, as they’ll have the credentials for the farm account, but there are scenarios where it could be problematic because the account may actually have access to more sensitive information such as a back-end data warehouse with sensitive financial information.
There are numerous security issues that can arise with SharePoint, and what accounts you configure and how you use them has a significant impact on the overall stability, reliability, and security of the environment. Make sure you fully understand all the implications of the various accounts that are used by SharePoint and, if you’re not sure, then make sure you contact a SharePoint professional who can help.
#4 – Insufficient Drive Space or Non-Optimal Drive Configurations
When planning your SharePoint drive capacities, it’s important to take into account all the various components of SharePoint that can create data. Some of the data you need to account for are ULS logs, cache files, IIS files, and search related files such as the index and analytics processing.
ULS logs are relatively easy to control the growth, because SharePoint provides options to set the maximum number of days to store the logs and the maximum size that can be occupied by the logs, as well as the amount of data that can be logged; however, despite the ease of configuration, we almost always find that administrators have not set these options and typically allow the logs to grow unchecked (and often on the C drive, which inevitably results in a server failure due to the drive running out of space). We always recommend that you set a reasonable number of days and a maximum size for the logs, and put them on a secondary drive so you don’t compromise the C drive by allowing it to fill up. You should also change the default logging levels so that you’re capturing fewer events, and only bump up the logging level when you’re actively troubleshooting.
SharePoint stores several kinds of cache data depending on how the system is configured. Perhaps the most commonly known one is the blob cache which, when enabled, defaults to C:\BlobCache\14 – we recommend that you move this to a secondary drive and make sure that the maximum size is configured correctly. There are also some services, such as Excel Services (SharePoint 2013), that store cache information, and should be configured to store the files on a secondary drive as well.
When looking at IIS files, what we see most typically forgotten are the IIS log files. These can grow very quickly, and if you don’t have a process for removing them, then you can quickly fill up the drive they are located on. We also recommend moving the location of these files off of the C drive so that if whatever process you hopefully put in place to clear them out (ideally one that archives them and doesn’t just delete them) fails, then you will at least still have a functional system. We also recommend, when provisioning a new web application, that you specify a path on a secondary drive as well. If you didn’t do this initially, you can do it after the fact by unextending and re-extending the web application to IIS (you’ll have to redeploy solutions and reapply any manual IIS changes, but this is the supported way of making this change).
As for search, most people are aware of the index files and the need to put them on a secondary drive (though we see a lot who don’t), but what a lot of people don’t realize is that the Search Analytics Component requires 300GB of additional space to be available (we’ve never seen it need that much, but that’s the documented requirement from Microsoft). The trick with this setting is that it can’t be changed after you do the install – you must set the path when installing SharePoint, and if you don’t then you’ll have to reinstall SharePoint on the server. The following article explains this further: https://blogs.msdn.microsoft.com/chandru/2013/08/14/analytics-component-disk-location-in-sp2013/.
Proper planning when initially provisioning your SharePoint farm is crucial to having a healthy farm in the long run. Most administrators working with SharePoint for the first time (and some so called experts) are likely to have no clue about issues such as the Search Analytics Component not being able to be set after installation, so working with a SharePoint professional in the early stages of a SharePoint rollout can make a huge difference.
#5 – Unchecked Errors in ULS and Windows Event Logs
I’ve lost track of the number of environments that I’ve performed health checks on (or just was called in for one task or another) and found that nobody was currently (or in some cases ever) monitoring the ULS logs or even the Windows Event Logs. Often times both are filled with so many errors (many of them innocuous) that it is nearly impossible to discern the “real” errors that are occurring. SharePoint is great at logging lots of information, and sometimes there are a fair number of events logged as errors that can be safely ignored, but with all these false positive errors being logged, it can make the job of finding the true errors very difficult, and it can also make it seem as though SharePoint is in a much unhealthier state than it actually is. We always recommend looking at and addressing all errors that are logged, even if they are not actually causing user issues – in some cases they may be completely innocuous, but in others they could be a symptom of a larger problem, so ignoring them is never a good idea.
Administering SharePoint isn’t something you do just during the initial rollout – it’s an enterprise application that requires constant and diligent monitoring, just as any other enterprise class application would. Many companies don’t have the expertise within their company to properly monitor and troubleshoot their environment over the long-term. We recommend working with a SharePoint professional to help cover long-term monitoring of your farm. Often times we’ve had contracts with clients where we simply jump into their environment once a month to do a spot check on the ULS and Windows Event Logs, thereby providing a slightly more proactive approach to addressing errors, rather than waiting until a user complains.
#6 – Large Lists Throttling Issues
In SharePoint 2013, the large list threshold defaults to 5000 items, but we quite consistently see clients increase this value to a much higher one (sometimes extremely high, like over a million). This limit is here for a very good reason, and increasing it can have a significant performance impact on your farm, so it’s important to evaluate your farm to see what lists are approaching (or over) this limit and adjust those lists accordingly. Adjustments can sometimes be as simple as adding appropriate indexes to list columns or deleting/archiving data that is no longer needed. Additionally, as usage within the environment grows, it will be important to routinely monitor the environment to detect lists that are approaching the 5000-item limit.
Note that the large list threshold of 5000 items has nothing to do with the number of items returned and everything to do with the number of items that have to be evaluated in order to complete the query. You may have a view that returns just 1 item (or zero items), but if it is necessary to analyze more than 5000 items to render that view, then a table lock will be triggered and, if using the default recommended throttling value, the view will not render. Also, a single item in SQL may actually comprise multiple table rows depending on the number of list columns associated with the list, so you could hit the threshold even if you have fewer than 5000 items in the list.
When scanning your farm for large lists, it’s important to analyze individual views to see which views are analyzing more than the recommended 5000 limit and are thus throttled; however, it’s not just the number of items analyzed – if the view returns more than 3000 items, contains more than 12 lookup columns, or is an aggregate view (aggregate views are views that have results fields such as Totals), then these views can cause performance issues and some will result in a throttled warning message to the viewer.
Why you need a SharePoint Health Check
In this article, I’ve listed just six of the more common issues that we find when performing health checks for our clients, but there are so many more that I could be writing this article for days (most of our health check reports are generally between 30 and 50 pages of issues). When we perform health checks we leverage the collective experiences of those within Aptillon and try to apply the lessons learned through the hundreds of farms that we’ve worked on. The end result is that, if the client chooses to apply the recommended fixes, they will be in a much better position to see a successful SharePoint deployment. Of course, none of the kinds of things we look for with a system health check will solve other issues, such as lack of (or poor) governance and information architectures – this would be a whole separate project (and a different article).
If you find that you’re having SharePoint performance issues, random and annoying errors and system outages, or generally frustrated users, then there’s a good chance that you might need a SharePoint health check. Contact Aptillon if you’d like to see if and how we can help you get your environment back to a healthy state.Tags SharePoint, SharePoint Health Check