These days I’ve been spending my time working with NSX and integrating it with vCloud Director. During some of these tests I ran into an issue with network connectivity on internal networks in vCloud Director.
To expound on the issue, I created an Internal virtual datacenter network in vCloud Director and enabled DHCP services on the internal NSX Edge virtual machine that gets deployed for this internal network. I then deployed two Linux virtual machines connected to this internal network on 2 different ESX hosts. These virtual machines should have received an IP address from the DHCP scope configured on the Edge but for some reason these virtual machines were not getting an IP address and were unable to ping the gateway(Interface on the Edge device).
To isolate if this issue was something specific to the Linux guest I moved all the 3 virtual machines (2 Linux machines and the NSX Edge) to the same ESX host and restarted the networking service. The machine was assigned an IP address of the configured DHCP scope and was able to ping its gateway. So there is nothing wrong with the TCP/IP stack in the guest since network traffic on the same ESX hosts never traverses the external network and is in done in memory.
Digging a litte deeper: The arcane world of log analysis
Bringing out the geek in me and to start digging further I started trawling through the vmkernel logs on the ESX host to see what happens when the virtual machine powers up, i.e. Does it connect to the virtual port…
The above log snip tracks the power on task for the virtual machine(org2-rhel2) and its quite evident from the last line that the control plane connection is down. These Internal networks use VXLAN as their underlying transport and since VXLAN uses a controller in Unicast mode the next thing to check would be if the ESX hosts can communicate with the NSX controller.
On the ESX host using the esxcli command we can query the VDS for VXLAN configuration.
Immediately noticed the the connection to the controller was down
ESX hosts establishes a connection to the NSX controller using a user world daemon. The netcpa.log shows communication with the controller and also the updates that are pushed from the controller down to the ESX hosts. Looking at these logs its clear that the connection is down.
To isolate further on comparing the MAC addresses for the controller IP’s it was found the the controllers IP had been assigned to another machine on the network. After shutting down the machine and restarting the netcpa agent the ESX hosts was able to re-establish a connection with the controller.
If the controller IP address gets changed and cannot be reverted to the original IP, the /etc/vmware/netcpa/config-by-vsm.xml file on the ESX host can be edited to add the new controller IP address.
While this issue may be quite simple and something that happens most of the time in a large network, I hope you found the approach to the problem useful. Feedback welcome!!