Here we are with yet another post in this series on uncovering virtual networking. In previous post we discussed, Traffic Shaping. In this post, I will try to dive into network failure detection (NFD) policy options available in vSphere (Specifically ESXi 🙂 ). By the way, starting with this post, we are entering into Teaming and Failover policies. Let’s have a quick look at Teaming and failover policy options before I discuss NFD.

As you can see in the image, we have multiple options here such as NFD, Notify switch, Failback, and Load balancing policies. I’ll talk about other options in another post. That being said, let us discuss NFD now.
NFD policy offers two methods to detect link failure namely, Link Status Only and Beacon probing.

Link Status Only
It relies only on the link status updates that are provided by the physical network adapter. This method detects failures of next available hop only for failures such as removed cables, loose connections and physical switch failures. Basically it checks, if the link is Up or Down.
As in Image example below, lets assume that Physical switch 1 has power failure, in such event, ESXi host will reroute all the traffic from Link 1 on vmnic0 to Link 2 on vmnic1.

Link status only method has limitations such as:
- It can detect failure for next immediate hop and not beyond that for upstream devices.
- It cannot detect any misconfigurations on the physical switches such as incorrect VLAN ID or MTU related misconfigurations for upstream networks.
Due to these reasons, Network health check was introduced in DVS. The health check helps you identify and troubleshoot configuration errors in a vSphere Distributed Switch. It can help identify MTU mismatches, testing teaming policy and VLAN configuration issues. It is certainly recommended that you turn it on. It can save hours of troubleshooting efforts. Read more about Health check.
Beacon probing
Beacon probing is a simple process of network failure detection. It requires an odd number of network adapter (aka uplink) connected to a switch. A packet called as beacon probe is sent out on each uplink every single second as broadcast message frame and expected to be received on remaining uplinks. Below image is graphical representation of this process. I have shown the process from one uplink, uplink0 (vmnic0), but it is applicable for each uplink that is connected to switch as part of NIC Teaming.

Each uplink (in this case vmnic0) sends out a packet as Ethernet broadcast frames. Beacon probing uses this packet information along with link status to determine link failure. As remaining uplinks (vmnic1 and vmnic2 here) receives the beacon sent by an uplink, each remaining uplink comes to know that it is not isolated from the others and link state is also fine (Up). Beacon probing process has advantage that it can detect upstream failures as well as it can detect some misconfigurations unlike link status only.
In order to use Beacon probing, basic requirement is at least three network cards. Also it is recommended that multiple uplinks connections should not be connected to the same physical Switch as it can cause Beacon probing to fail to detect issues beyond the switch. As in image above, connect your uplinks to different physical switches. The physical NICs or uplinks also must be in an active-active or active-standby configuration because the uplinks in an unused state do not participate in beacon probing.
If beacon probing is used with only two NICs, the switch cannot determine which NIC needs to be taken out of service because both do not receive beacons if one of them loses connectivity. Using at least three NICs in such a team allows for n-2 failures where n is the number of NICs in the team before reaching an ambiguous situation.
What if an uplink has connectivity issue?

As in image above, let’s assume that physical switch 2 has some misconfiguration. This results in beacon probe not being received on second uplink, vmnic1. Upon detection of this failure, switch will reroute your virtual machine traffic to other uplinks where probing was successful. Second uplink will not be used for network transmission until next successful beacon delivery.
Conclusion:
Beacon probing is more reliable method of network failure detection compared to Link status only and highly recommended for use. However, you need to satisfy minimum requirement of three uplinks and odd number of uplinks if more than three.
Also check if you can use beacon probing in your environment, as some organizations may not prefer it as beacon probing generates the traffic every single second.
Note: Do not use Beacon probing with IP hash load balancing policy.
Check next post on Load balancing Algorithms.