Troubleshooting vCloud Director Internal Networks

These days I’ve been spending my time working with NSX and integrating it with vCloud Director. During some of these tests I ran into an issue with network connectivity on internal networks in vCloud Director.
To expound on the issue, I created an Internal virtual datacenter network in vCloud Director and enabled DHCP services on the internal NSX Edge virtual machine that gets deployed for this internal network. I then deployed two Linux virtual machines connected to this internal network on 2 different ESX hosts. These virtual machines should have received an IP address from the DHCP scope configured on the Edge but for some reason these virtual machines were not getting an IP address and were unable to ping the gateway(Interface on the Edge device).

To isolate if this issue was something specific to the Linux guest I moved all the 3 virtual machines (2 Linux machines and the NSX Edge) to the same ESX host and restarted the networking service. The machine was assigned an IP address of the configured DHCP scope and was able to ping its gateway. So there is nothing wrong with the TCP/IP stack in the guest since network traffic on the same ESX hosts never traverses the external network and is in done in memory.

Digging a litte deeper: The arcane world of log analysis

Bringing out the geek in me and to start digging further I started trawling through the vmkernel logs on the ESX host to see what happens when the virtual machine powers up, i.e. Does it connect to the virtual port…

2014-05-19T09:34:02.904Z cpu15:29956759)World: vm 29956760: 1462: Starting world vmm0:org1-rhel2_(bc4599c3-ff8e-432b-863e-1cdcef544661) of type 8

2014-05-19T09:34:02.904Z cpu15:29956759)Sched: vm 29956760: 6410: Adding world ‘vmm0:org1-rhel2_(bc4599c3-ff8e-432b-863e-1cdcef544661)’, group ‘host/user/pool3’, cpu: shares=-3 min=200 minLimit=-1

max=1000, mem: shares=-3 min=3072 minLimit=-1 max=16384

2014-05-19T09:34:02.904Z cpu15:29956759)Sched: vm 29956760: 6425: renamed group 57859289 to vm.29956759

2014-05-19T09:34:02.904Z cpu15:29956759)Sched: vm 29956760: 6442: group 57859289 is located under group 54783989

2014-05-19T09:34:02.907Z cpu15:29956759)MemSched: vm 29956759: 8263: extended swap to 28290 pgs

2014-05-19T09:34:03.089Z cpu15:29956759)VSCSI: 3750: handle 8370(vscsi0:0):Using sync mode due to sparse disks

2014-05-19T09:34:03.089Z cpu15:29956759)VSCSI: 3792: handle 8370(vscsi0:0):Creating Virtual Device for world 29956760 (FSS handle 1150128849) numBlocks=4194304 (bs=512)

2014-05-19T09:34:03.244Z cpu4:29956760)Net: 2292: connected org1-rhel2 (bc4599c3-ff8e-432b-863e-1cdcef544661).eth0 eth0 to vDS, portID 0x30001e7

2014-05-19T09:34:03.244Z cpu4:29956760)Net: 3055: associated dvPort 1683 with portID 0x30001e7

2014-05-19T09:34:03.247Z cpu4:29956760)NetPort: 2862: resuming traffic on DV port 1683

2014-05-19T09:34:03.247Z cpu4:29956760)vxlan: VDL2_CPSetCPEnabled:2840: Control plane enabled on VXLAN network[5001]

.

.

2014-05-19T09:39:24.824Z cpu11:27610460)WARNING: vxlan: VDL2CPCheckConnUpCB:311: Control plane connection of VXLAN network[5001] is down

The above log snip tracks the power on task for the virtual machine(org2-rhel2) and its quite evident from the last line that the control plane connection is down. These Internal networks use VXLAN as their underlying transport and since VXLAN uses a controller in Unicast mode the next thing to check would be if the ESX hosts can communicate with the NSX controller.
On the ESX host using the esxcli command we can query the VDS for VXLAN configuration.

~ # esxcli network vswitch  dvs  vmware vxlan list –vds-name Nebula-Networks
VDS ID                                           VDS Name          MTU  Segment ID    Gateway IP      Gateway MAC        Network Count  Vmknic Count

———————————————–  —————  —-  ————  ————–  —————–  ————-  ————

d7 e6 3d 50 19 d7 02 36-f4 23 96 fe 64 46 1c 33  Nebula-Networks  1600  192.168.1.0   192.168.1.254   00:21:55:08:ec:40              2             1

Immediately noticed the the connection to the controller was down

~ # esxcli network vswitch  dvs  vmware vxlan network list –vds-name Nebula-Networks
VXLAN ID  Multicast IP               Control Plane  Controller Connection  Port Count  MAC Entry Count  ARP Entry Count

——–  ————————-  ————-  ———————  ———-  —————  —————

    5000  N/A (headend replication)  Enabled ()     192.168.1.50  (down)            2                0                0

    5001  N/A (headend replication)  Enabled ()     192.168.1.50  (down)            1                0                0

ESX hosts establishes a connection to the NSX controller using a user world daemon. The netcpa.log shows communication with the controller and also the updates that are pushed from the controller down to the ESX hosts. Looking at these logs its clear that the connection is down.

~ # tail -f /var/log/netcpa.log

2014-05-19T09:34:09.615Z [37281B70 info ‘Default’] Core: Sharding connection 192.168.1.50:0 is timeout

2014-05-19T09:34:09.615Z [37281B70 info ‘Default’] App CORE : 0 unregister connection to 192.168.1.50:0

2014-05-19T09:34:09.615Z [37281B70 info ‘Default’] User of connection 192.168.1.50:0

2014-05-19T09:34:09.615Z [37281B70 info ‘Default’] App CORE : 0 register connection to existing controller to 192.168.1.50 port 1234

To isolate further on comparing the MAC addresses for the controller IP’s it was found the the controllers IP had been assigned to another machine on the network. After shutting down the machine and restarting the netcpa agent the ESX hosts was able to re-establish a connection with the controller.

~ # tail -f /var/log/netcpa.log

2014-05-19T10:37:10.471Z [5DC5DB70 info ‘Default’] Core: ShardingSlice length of peer 192.168.1.50: 4194304

2014-05-19T10:37:10.471Z [5DC5DB70 info ‘Default’] Vxlan: core app ready on 192.168.1.50:0

2014-05-19T10:37:10.472Z [5DC5DB70 info ‘Default’] Vxlan: send VNI Membership Update(Join) to the controller: VNI 5000 controller 192.168.1.50

2014-05-19T10:37:10.472Z [5DC5DB70 info ‘Default’] Vxlan: send VNI Membership Update(Join) to the controller: VNI 5001 controller 192.168.1.50

2014-05-19T10:37:10.472Z [5DC5DB70 info ‘Default’] Core: Controller is ready: 192.168.1.50:0

2014-05-19T10:37:10.472Z [FFE59100 info ‘Default’] Core: Sharding Segment Update message: server 192.168.1.50 startSliceId 0 numSlices 1024

2014-05-19T10:37:10.473Z [FFE59100 info ‘Default’] Vxlan: receive VNI Membership Update(Join) from the controller: VNI 5000 controller 192.168.1.50 len 23

2014-05-19T10:37:10.473Z [FFE59100 info ‘Default’] Vxlan: set VNI 5000 (mcast proxy: Enabled, arp proxy: Enabled)

2014-05-19T10:37:10.474Z [FFE59100 info ‘Default’] Vxlan: receive VNI Membership Update(Join) from the controller: VNI 5001 controller 192.168.1.50 len 23

2014-05-19T10:37:10.474Z [FFE59100 info ‘Default’] Vxlan: set VNI 5001 (mcast proxy: Enabled, arp proxy: Enabled)

If the controller IP address gets changed and cannot be reverted to the original IP, the /etc/vmware/netcpa/config-by-vsm.xml file on the ESX host can be edited to add the new controller IP address.

While this issue may be quite simple and something that happens most of the time in a large network, I hope you found the approach to the problem useful. Feedback welcome!!

2 thoughts on “Troubleshooting vCloud Director Internal Networks”

Andrew
June 7, 2014 at 5:54 pm

This is a great write up. You really worked alot to get this. I follow you man, expecting more articles on NSX.
- donovandurandPost author
  June 9, 2014 at 4:38 pm
  
  Thanks Andrew. Hope to blog more on this subject.

#Virtual Chronicles

Troubleshooting vCloud Director Internal Networks

2 thoughts on “Troubleshooting vCloud Director Internal Networks”

Leave a Reply Cancel reply

July 2026
M	T	W	T	F	S	S
« Jun
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31