Implementing High Availability with Bidirectional Forwarding Detection (BFD)

How can a router detect when a BGP neighbor, or just any kind of IGP neighbor device goes down? If the two devices are directly, back-to-back connected it's usually not a problem, because the line protocol of the neighboring device goes down at the same time and the router kills the neighborship. But what if there are at least one Layer 2 device (a switch for example) between the two routers? In that case we can only rely on the BGP Keepalives or OSPF Hellos, because the link status of the device on the other end would still show up/up.

In this lab, the topology is very simple, we'll be examining the BGP neighborship between CSR1 and CSR2, which are both connected to the same switch. They are both in the same VLAN and same subnet, so they can become neighbors.

BFD Topology

By default if CSR1 went down, CSR2 would detect it in around ~180 seconds, CSR2 relies on the BGP Keepalies which are sent every 60 seconds by default. If CSR2 doesn't receive three consecutive Keepalives, he declares CSR1 down, breaks down the BGP neighborship and removes the routes from CSR1. So we can lower the Keepalive and the Holdtime values like this:

CSR1(config-router)#neighbor 100.1.2.2 timers ?
  <0-65535>  Keepalive interval

CSR1(config-router)#neighbor 100.1.2.2 timers 5 ?
  <0-65535>  Holdtime

CSR1(config-router)#neighbor 100.1.2.2 timers 5 15 ?
  <0-65535>  Minimum hold time from neighbor
  <cr>       <cr>

CSR1(config-router)#neighbor 100.1.2.2 timers 5 15 10
% Warning: A hold time of less than 20 seconds increases
  the chances of peer flapping

Now CSR1 and CSR2 send Keepalives every 5 sends. The BGP Holdtime message is sent in the BGP Open message which I set to 15 seconds.

BGP Keepalives
BGP Keepalives sent every 5 seconds

What's the problem with this solution? First we cannot configure sub-second timers, the minimum value we can configure for the Keepalive interval is one second. Second if we decrease the interval between the BGP Keepalives we increase the load on the router's CPU. Remember these are control-plane messages which are directly forwarded to the router's CPU, moreover because BGP uses TCP as a transport the neighbor router also has to send back a TCP ACK message. This obviously doesn't scale well if we have many BGP sessions, if we decrease the timers to each neighbor, the CPU of the router will suffer.

If we want sub-second failure detection we have to use BFD which is standalone protocol configured on the routers and the routing protocols have to register with BFD. BFD is configured on the interfaces directly:

CSR1(config)#int gigabitEthernet 1
CSR1(config-if)#bfd interval ?
  <50-9999>  Milliseconds

CSR1(config-if)#bfd interval 500 ?
  min_rx  Minimum receive interval capability

CSR1(config-if)#bfd interval 500 min_rx ?
  <50-9999>  Milliseconds

CSR1(config-if)#bfd interval 500 min_rx 500 ?
  multiplier  Multiplier value used to compute holddown

CSR1(config-if)#bfd interval 500 min_rx 500 multiplier ?
  <3-50>  value used to multiply the interval

CSR1(config-if)#bfd interval 500 min_rx 500 multiplier 3

Here we define the BFD Echo intervals: the first value is the min_tx and the second is the min_rx. If our neighbor router sets his min_rx higher than our min_tx value, we have to use the neighbor router's min_rx. We cannot have a lower interval than our neighbor's min_rx. The min_rx value is transmitted in the BFD Control packets. Here I set the Echo TX interval to 500 ms. Notice that we can go much lower than that: on IOS-XE (I use the CSR1000v image here) the minimum value we can configure is 50 ms! On IOS-XR we can have much more lower values than this, some platforms support 5 ms Echo TX intervals. The multiplier value is very straightforward: if CSR2 misses 3 BFD Echoes, he declares CSR1 down, and notifies every registered protocols. So after issuing this command, CSR1 starts sending BFD Control packets out of his G1 interface:

CSR1 starts sending BFD Control packets
CSR1 starts sending BFD Control packets

CSR2 responds with an ICMP Port Unreachable, because BFD has not been configured on CSR2 yet. As it can be seen above, BFD uses UDP as the transport with the destination port 3784 for the Control packets, the BFD Echo packets use UDP 3785. Also notice that the Echo min_rx value (which we've configured above) is transmitted in the BFD Control packets. The Desired Min TX and Min RX values are for the BFD Control packets. Later we'll see what's the difference between the BFD Echo and Control packets. But right now we can see that CSR1 only sends BFD Control messages periodically, he tries to form the BFD neighborship with CSR2. So let's configure BFD on CSR2 as well:

CSR2(config)#int g1
CSR2(config-if)#bfd interval 500 min_rx 500 multiplier 3

But this command just by itself doesn't do anything. Routing protocols have to register with BFD to be able to use the sub-second failure detection. The command is different for every routing protocol, for BGP we configure BFD with this command:

CSR1(config)#router bgp 65001
CSR1(config-router)#neighbor 100.1.2.2 fall-over bfd
%BFD-6-BFD_SESS_CREATED: BFD-SYSLOG: bfd_session_created, neigh 100.1.2.2 proc:BGP, idb:GigabitEthernet1 handle:1 act

CSR2(config)#router bgp 65002
CSR2(config-router)#neighbor 100.1.2.1 fall-over bfd
%BFD-6-BFD_SESS_CREATED: BFD-SYSLOG: bfd_session_created, neigh 100.1.2.1 proc:BGP, idb:GigabitEthernet1 handle:1 act
%BFDFSM-6-BFD_SESS_UP: BFD-SYSLOG: BFD session ld:4097 handle:1 is going UP

As we can see the BFD session has been established between CSR1 and CSR2, we can verify with the following command:

CSR2#show bfd neighbors details 

IPv4 Sessions
NeighAddr                              LD/RD         RH/RS     State     Int
100.1.2.1                            4097/4097       Up        Up        Gi1
Session state is UP and using echo function with 500 ms interval.
Session Host: Software
OurAddr: 100.1.2.2      
Handle: 1
Local Diag: 0, Demand mode: 0, Poll bit: 0
MinTxInt: 1000000, MinRxInt: 1000000, Multiplier: 3
Received MinRxInt: 1000000, Received Multiplier: 3
Holddown (hits): 0(0), Hello (hits): 1000(137)
Rx Count: 143, Rx Interval (ms) min/max/avg: 1/998/843 last: 452 ms ago
Tx Count: 141, Tx Interval (ms) min/max/avg: 2/998/858 last: 1 ms ago
Echo Rx Count: 117, Echo Rx Interval (ms) min/max/avg: 383/1990/1019 last: 326 ms ago
Echo Tx Count: 117, Echo Tx Interval (ms) min/max/avg: 383/1989/1019 last: 328 ms ago
Elapsed time watermarks: 0 0 (last: 0)
Registered protocols: BGP CEF 
Uptime: 00:02:00
Last packet: Version: 1                  - Diagnostic: 0
             State bit: Up               - Demand bit: 0
             Poll bit: 0                 - Final bit: 0
             C bit: 0                                   
             Multiplier: 3               - Length: 24
             My Discr.: 4097             - Your Discr.: 4097
             Min tx interval: 1000000    - Min rx interval: 1000000
             Min Echo interval: 500000  

Take a look at the "Registered protocols" above, we can see that the BGP process has been registered with BFD. We can register multiple protocols at the same time, many protocols are supported: BGP, OSPF, EIGRP, HSRP etc. CEF is registered by default, moreover CEF is actually required, it must be enabled if we want to use BFD.

BFD session between CSR1 and CSR2
BFD session between CSR1 and CSR2: Control and Echo packets exchange

We can see that BFD Echo packets are sent around every 500 ms. Notice that these packets are actually "self-addressed", the source and the destination IP addresses are the same. When CSR1 sends a BFD Echo it uses his own interface address (100.1.2.1) as both the source and the destination, the destination MAC is the MAC address of CSR2 of course. So when CSR2 receives this packet, he just "loops it back" to CSR1, that's why we see every packet duplicated above in the pcap. Echo packets are only used to test the forwarding path between the two routers. They are not used to test the host stack on the remote system, because the Echo packets are not even destined to the remote system. Notice that these are actually data plane packets, which are forwarded by CEF, these are not processed by the CPU of the routers. That's the main benefit of BFD: sub-second timers with no additional control plane operation. Notice that besides the Echoes, the routers also send BFD Control packets periodically, every second by default. These are actually control plane messages which are destined to the CPU of the remote routers, but BFD Control packets are only sent about every second by default. We can also change the interval between the Control packets with the following command:

CSR1(config)#bfd slow-timers 5000
CSR2(config)#bfd slow-timers 5000

So this is the "heartbeat" mechanism of BFD in a nutshell: self-addressed packets forwarded by CEF using the special ASICs, done in hardware and not process-switched, or processed by the CPU. What if the connection between CSR1 and the switch breaks? I shut down the link on the switch:

SW1(config)#int g0/0 
SW1(config-if)#shut
*Aug 13 09:44:30.812: %LINK-5-CHANGED: Interface GigabitEthernet0/0, changed state to administratively down
*Aug 13 09:44:31.812: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0, changed state to down
(1.5 seconds later)
CSR2#
*Aug 13 09:44:33.553: %BFDFSM-6-BFD_SESS_DOWN: BFD-SYSLOG: BFD session ld:4097 handle:1,is going Down Reason: ECHO FAILURE
*Aug 13 09:44:33.554: %BGP-5-NBR_RESET: Neighbor 100.1.2.1 reset (BFD adjacency down)
*Aug 13 09:44:33.554: %BGP-5-ADJCHANGE: neighbor 100.1.2.1 Down BFD adjacency down

And we can see that a little bit more than 1.5 seconds later, when BFD misses three consecutive Echoes, it triggers a session failure and notifies BGP to bring down the adjacency.

Now let's actually test the Echo RX/TX timers: if I change the timers on CSR1:

CSR1(config-if)#bfd interval 1000 min_rx 1000 multiplier 3

Now CSR1 sends his Echoes every second, but CSR2 should also send his Echoes every second, because I set the min_rx to one second. The min_rx is transferred in the BFD Control messages, so CSR2 cannot send faster than that, he also sends his BFD Echoes every second. This way we can restrict how often the neighbor router should send his BFD Echoes, if a router cannot handle a high rate of BFD packets, we can specify a large min_rx interval.

BFD Echo timers changed
BFD Echo packets after changing the timers

We can actually turn off the BFD Echoes with the following command:

CSR1(config-if)#no bfd echo 
CSR2(config-if)#no bfd echo 

Now only BFD Control packets are sent from both routers, Control packets cannot be sent that frequent than the Echoes, also these are control plane messages, destined to the router's CPU. So turning the Echo packets off doesn't make too much sense.

Turning off BFD Echoes
Turning off BFD Echoes, now only BFD Control messages are sent

Also notice that BGP still sends his own Keepalives, even if we register the process with BFD. That didn't change, as it can be seen above. The only change is that BFD can notify the BGP process if BFD detects a failure in the forwarding path, and as a response BGP would kill the neighbor relationship with his peer instantly. Also notice that BFD only helps to detect the failure, it doesn't necessarily help to speed up reconvergence. For that in case of IGPs we can use LFA (Loop-Free Alternative or Fast Re-Route feature), which calculates and installs a repair path into the routing table.