page contents

About the Post

Author Information

Troubleshooting a SCVMM 2012 transport error

Last week, I had someone reach out to me asking for help with a new SCVMM 2012 server. It was a new installation and for the most part it worked well for them during the past month. However, recently they started seeing the error below, which prevent them access to the Virtual Machine Manager:

At first they ignored this error, since rebooting the SQL server seemed to fix the issue. As time went on, the error become more common and they found themselves rebooting more frequently. Sadly, they did not share that information with me that they had to reboot so often. With the information they had shared with me listed above, I recommended that they investigate the following areas:

  1. Check for any auto services that are not started
  2. Check to ensure SQL, VMM and WS-Management services (etc.) are started
  3. Check Application and System logs for any service complainants
  4. Check Application and System logs for anything wrong with SCVMM and/or SQL Server

For the most part, all services were started and not having any issues. They did not see any credential errors with any service in the logs. At first they suspected the SCVMM application had an issue based on the error logs below:

 

Application fault again:

 

At this point, I asked them if they had any experience with SCVMM and were aware of the relationship it has with SQL server. They stated that they had some, but were not sure of the SQL details, so at that point I explained that all of the information they see in the SCVMM console is stored in a database on the SQL server and the application depends on it. At that point they understood why I was asking them about SQL server.

Then they shared with me this gem of information that I was looking for to help them:


I also asked them to log into SQL and check the error logs to confirm what we were seeing from this error message on the SCVMM server. (Above)

They first looked at the installation logs and saw an underlying error in their initial setup logs were coming from SQLServer :

InnerException.Type: System.Data.SqlClient.SqlException, InnerException.Message: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 – The semaphore timeout period has expired.)

When I was able to finally persuade them to look at the SQL server, there are no other databases on that server. However, we did see this error message was all over the place:

A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 – The semaphore timeout period has expired.)

At this point, I let them know that “The semaphore timeout period has expired” is a Winsock error that usually indicates the TCP peer did not ACK some data. This typically means that some part of the network layer croaked. (i.e. a NIC or switch is malfunctioning) It could also probably happen if the server machine was unplugged or hard powered off, but SQL Server itself or the client process wouldn’t cause this type of error. Even if you forcibly terminated the SQL Server process on the server, you wouldn’t see this behavior. It would at least send a TCP Reset. (RST)

I asked them if they knew how TCP works and then explained the 3 way handshake that occurs with these diagrams:

Basic overview:


Then I went further and explained some more details with this one:


I recommended that they either use Netmon or Wireshark  to perform a network captures to look at where the failure is actually occurring. However, they saw this error message below and just asked the data center people to look at the SQL server NIC and switches. (It appears it’s either a NIC issue or a networking issue on the SQL server side. I found these 2 logs after digging around for a while and they seem to be consistent throughout)


Personally, I would have gone a step further and pin-pointed it to the actual device. I would have done a quick TCP based trace route, like tcptrace. (It will require you to install wincap) A lot of people are not aware that ICMP traffic, like ping and tracert, is often either dropped or deprioritized a lot in busy networks and it can lead to false alarms. If your network traffic is leaving a building, then I would recommend using a TCP connection to test, so that way I am guaranteed to get a response back. After that, then I would have confirmed my findings with the packet capture by looking for a TCP 3 way handshake.

Moving forward, I hope that people were able to see the key here is to look at basic networking and event viewer (plus other application) logs to get a true sense of what is going on. Sometimes, your perception of things can lead you down the wrong path as you allow it to become a reality. In this case, most of the early troubleshooting was focused on SCVMM and not a lot of attention was placed on the SQL server.

Tags: , , , ,

Comments are closed.

Copy Protected by Chetan's WP-Copyprotect.