A client was facing an issue with a file server that had impact on almost all of his employees and kept members of- not only – the IT department busy for almost two months.
For weeks we were in the dark having no idea which could cause the problem and none of our actions taken were successful.
Starting in November 2016 the main file server (Windows Server 2008R2), hosting the user homedrives and program data, suddenly stopped responding. It was impossible to save documents – Microsoft Word stopped responding, users were unable to log on etc.
After restarting the server everything worked again.
System and application logs on the server did not show anything suspicious.
The second time this happened was three days later. Again, nothing special was logged, neither on the server itself nor on VMWare host, network components or SAN storage. All other systems worked well.
From this time the issue occurred at least once a day. As we did not notice any indications like heavy RAM or processor usage or the amount of users being logged in, we had no chance to replicate the problem.
As mentioned before the file server was a virtual machine on a VMware ESX host. Data is stored on a SAN, most of the users are working on Thin Clients on a Citrix desktop.
Troubleshooting steps involved all of the system components and therefore many different teams.
More than one time we copied the data to new machines with newer operation system, but the error reoccurred. At least on Windows Server 2012R2 there is an event log SMB server, where we noticed a warning event 1020, indicating that communication with the underlying storage took longer than expected (“File system operation has taken longer than expected”). In the meanwhile we had opened a ticket at the Microsoft Premier Support, who advised us in several tuning steps which unfortunately did not lead to a solution.
We had all components checked: network devices and environment, storage, ESX hosts etc. – everything worked normally. Also at the same time no other components showed any problems.
Finally we separated the data using a standalone DFS on different fileservers to isolate the possible source of the error. It came out that there was a folder that hosted Word templates with VB script and every time a user of the development team with write access to the files opened a document based on a template, the event 1020 was logged.
Two steps helped solving the problem: optimization of the VB code and taking away all write access off the template folders for normal user accounts. Developers now have to use dedicated administrative accounts if they have to replace a template with a newer version.
If you notice the described issue on your file server, especially the 1020 event in the SMB log, proof if there are Word templates stored on that server and be sure that every user only has read access to that folder.