Motivation (for connecting ‘old’ BGP/EVPN network to ‘new’ L3-BGP-only)
The old L2 network is overgrown and ARP tables on servers became very large. Scalability reached its limits.
Migrating network architecture in place is impossible without re-deploying all servers.
L3-BGP-only is the most obvious choice for future growth, but a new network needs a connection to the old one. We need to be able to keep CLOS topology (spine levels throughputs are in Tbps) to meet the latency and throughput demand.
Implementation
For each old L2 VLAN, we deploy a pair of route reflectors. These route reflectors will relay NLRI information between L2 servers and border SONiC L3 switches.
BGP route reflectors
BGP protocol is de-facto ubiquitous today. It was devised for ISP/IX/WAN but lately it started to be used on large LANs. There are 2 modes of BGP:
eBGP and iBGP. eBGP is used between autonomous systems. iBGP is used inside the autonomous systems.
We chose to implement iBGP for VLAN (in L2) / VRF (L3) and eBGP for routing between those separated internal networks. We can do that because traffic between internal networks is not large and we are filtering/firewalling it. Predefined-AS number is used for servers and switches included in the network (for VLAN in L2 or for VRF in L3). This way the network becomes a fully Autonomous System.
Usually iBGP requires full-mesh peering setup. With hundreds of servers in single VLAN scale, full-mesh topology would be impossible one to achieve. Full mesh connection formula is:
total number of connections = (n*(n-1))/2
n is the number of devices.
RR (Route Reflector) can propagate iBGP routes to peers, hence a full mesh of iBGP peers is not necessary. With network scaling-up, adding new peers will require only peering (SONiC Switches or L2 servers) with 2 x route reflectors. Both RR act as active ones (active-active solution) to provide redundancy along with multipath routes. Please be aware: route reflectors are not routers per se! They are only “reflecting” NLRIs (prefixes+attributes) to the connected clients - IP traffic is not forwarded nor passed by them directly. We use FRR demons to implement RRs.
Peering route reflectors between each other
# cat /etc/frr/frr.conf # <--- Click this to show more
!
bgp cluster-id 172.16.63.0
neighbor BGP-KFFR peer-group
neighbor BGP-KFFR remote-as internal
neighbor BGP-KFFR description KFRR internal
neighbor 172.16.63.2 peer-group BGP-KFFR
neighbor 172.16.63.2 description kfrr2
!
address-family ipv4 unicast
neighbor BGP-KFFR soft-reconfiguration inbound
neighbor BGP-KFFR route-map KFRR-IMPORT in
neighbor BGP-KFFR route-map KFRR-EXPORT out
exit-address-family
!
ip prefix-list KFRR-ANY seq 5 permit any
!
route-map KFRR-IMPORT permit 1
description KFFR IMOPRT
match ip address prefix-list KFRR-ANY
!
route-map KFRR-EXPORT permit 1
description KFFR EXPORT
match ip address prefix-list KFRR-ANY
!
line vty
!
There needs to be intra-connection between pair of two route-reflectors to avoid split-brain case.
Peering RRs with SONiC LEAF gateways:
# cat /etc/frr/frr.conf # <--- Click this to show more
!
router bgp 65108
bgp router-id 172.16.63.1
neighbor BGP-L3 peer-group
neighbor BGP-L3 remote-as internal
neighbor BGP-L3 timers 1 3
neighbor BGP-L3 timers connect 1
bgp listen limit 2048
bgp listen range 172.17.0.0/23 peer-group BGP-L3
!
address-family ipv4 unicast
neighbor BGP-L3 addpath-tx-all-paths
neighbor BGP-L3 route-reflector-client
neighbor BGP-L3 soft-reconfiguration inbound
neighbor BGP-L3 prefix-list from-l3 in
neighbor BGP-L3 prefix-list to-l3 out
exit-address-family
!
ip prefix-list from-l3 seq 10 permit 172.16.0.0/12 ge 32
ip prefix-list from-l3 seq 99 deny any
ip prefix-list to-l3 seq 5 permit 0.0.0.0/0
ip prefix-list to-l3 seq 10 permit 172.18.0.0/15 le 32
ip prefix-list to-l3 seq 20 permit 172.16.64.0/24 le 32
ip prefix-list to-l3 seq 30 permit 172.16.65.0/24 le 32
ip prefix-list to-l3 seq 40 permit 172.16.0.0/12 ge 32
ip prefix-list to-l3 seq 99 deny any
!
Each L3-BGP-enabled SONiC switch needs to be connected to the RR. We are using BGP-L3 peer group specifically for this purpose.
Peering old L2 hosts with RR:
# cat /etc/frr/frr.conf # <--- Click this to show more
!
neighbor BGP-VLAN2 peer-group
neighbor BGP-VLAN2 remote-as internal
bgp listen limit 2048
bgp listen range 172.16.0.0/16 peer-group BGP-VLAN2
!
address-family ipv4 unicast
neighbor BGP-VLAN2 addpath-tx-all-paths
neighbor BGP-VLAN2 route-reflector-client
neighbor BGP-VLAN2 soft-reconfiguration inbound
neighbor BGP-VLAN2 prefix-list from-vlan2 in
neighbor BGP-VLAN2 prefix-list to-vlan2 out
exit-address-family
!
ip prefix-list from-vlan2 seq 10 permit 172.16.65.0/24 le 32
ip prefix-list from-vlan2 seq 20 permit 172.16.0.0/12 ge 32
ip prefix-list from-vlan2 seq 99 deny any
ip prefix-list to-vlan2 seq 10 permit 172.18.0.0/15 le 32
ip prefix-list to-vlan2 seq 20 permit 172.16.64.0/24 le 32
ip prefix-list to-vlan2 seq 30 permit 172.16.65.0/24 le 32
ip prefix-list to-vlan2 seq 40 permit 172.16.0.0/12 ge 32
ip prefix-list to-vlan2 seq 99 deny any
!
Each old L2-based host needs to be peered with RR.
# cat /etc/frr/frr.conf # <--- Click this to show more for the whole config of RR:
!
frr version 8.0.1
frr defaults traditional
hostname frr1.ams.creativecdn.net
log file /var/log/frr/frr_bgp.log
log syslog informational
no ipv6 forwarding
service integrated-vtysh-config
!
router bgp 65108
bgp router-id 172.16.63.1
bgp log-neighbor-changes
bgp cluster-id 172.16.63.0
neighbor BGP-KFFR peer-group
neighbor BGP-KFFR remote-as internal
neighbor BGP-KFFR description KFRR internal
neighbor BGP-KUBERNATES peer-group
neighbor BGP-KUBERNATES remote-as internal
neighbor BGP-L3 peer-group
neighbor BGP-L3 remote-as internal
neighbor BGP-L3 timers 1 3
neighbor BGP-L3 timers connect 1
neighbor BGP-NAT peer-group
neighbor BGP-NAT remote-as internal
neighbor BGP-VLAN2 peer-group
neighbor BGP-VLAN2 remote-as internal
neighbor 172.16.63.2 peer-group BGP-KFFR
neighbor 172.16.63.2 description kfrr2
neighbor 172.16.63.101 peer-group BGP-KUBERNATES
neighbor 172.16.63.101 description krr101
neighbor 172.16.63.101 port 180
neighbor 172.16.63.102 peer-group BGP-KUBERNATES
neighbor 172.16.63.102 description krr101
neighbor 172.16.63.102 port 180
neighbor 172.16.5.251 peer-group BGP-NAT
neighbor 172.16.5.251 description nat1a
neighbor 172.16.5.252 peer-group BGP-NAT
neighbor 172.16.5.252 description nat1b
bgp listen limit 2048
bgp listen range 172.17.0.0/23 peer-group BGP-L3
bgp listen range 172.16.0.0/16 peer-group BGP-VLAN2
!
address-family ipv4 unicast
neighbor BGP-KFFR soft-reconfiguration inbound
neighbor BGP-KFFR route-map KFRR-IMPORT in
neighbor BGP-KFFR route-map KFRR-EXPORT out
neighbor BGP-KUBERNATES route-reflector-client
neighbor BGP-KUBERNATES soft-reconfiguration inbound
neighbor BGP-KUBERNATES prefix-list from-kubernates in
neighbor BGP-KUBERNATES prefix-list to-kubernates out
neighbor BGP-L3 addpath-tx-all-paths
neighbor BGP-L3 route-reflector-client
neighbor BGP-L3 soft-reconfiguration inbound
neighbor BGP-L3 prefix-list from-l3 in
neighbor BGP-L3 prefix-list to-l3 out
neighbor BGP-NAT addpath-tx-all-paths
neighbor BGP-NAT route-reflector-client
neighbor BGP-NAT soft-reconfiguration inbound
neighbor BGP-NAT prefix-list from-nat in
neighbor BGP-NAT prefix-list to-vlan2 out
neighbor BGP-VLAN2 addpath-tx-all-paths
neighbor BGP-VLAN2 route-reflector-client
neighbor BGP-VLAN2 soft-reconfiguration inbound
neighbor BGP-VLAN2 prefix-list from-vlan2 in
neighbor BGP-VLAN2 prefix-list to-vlan2 out
exit-address-family
!
ip prefix-list KFRR-ANY seq 5 permit any
ip prefix-list from-kubernates seq 10 permit 172.18.0.0/15 le 32
ip prefix-list from-kubernates seq 20 permit 172.16.64.0/24 le 32
ip prefix-list from-kubernates seq 99 deny any
ip prefix-list from-l3 seq 10 permit 172.16.0.0/12 ge 32
ip prefix-list from-l3 seq 99 deny any
ip prefix-list from-nat seq 10 permit 0.0.0.0/0
ip prefix-list from-nat seq 99 deny any
ip prefix-list from-vlan2 seq 10 permit 172.16.65.0/24 le 32
ip prefix-list from-vlan2 seq 20 permit 172.16.0.0/12 ge 32
ip prefix-list from-vlan2 seq 99 deny any
ip prefix-list to-kubernates seq 99 deny any
ip prefix-list to-l3 seq 5 permit 0.0.0.0/0
ip prefix-list to-l3 seq 10 permit 172.18.0.0/15 le 32
ip prefix-list to-l3 seq 20 permit 172.16.64.0/24 le 32
ip prefix-list to-l3 seq 30 permit 172.16.65.0/24 le 32
ip prefix-list to-l3 seq 40 permit 172.16.0.0/12 ge 32
ip prefix-list to-l3 seq 99 deny any
ip prefix-list to-vlan2 seq 10 permit 172.18.0.0/15 le 32
ip prefix-list to-vlan2 seq 20 permit 172.16.64.0/24 le 32
ip prefix-list to-vlan2 seq 30 permit 172.16.65.0/24 le 32
ip prefix-list to-vlan2 seq 40 permit 172.16.0.0/12 ge 32
ip prefix-list to-vlan2 seq 99 deny any
!
route-map KFRR-IMPORT permit 1
description KFFR IMOPRT
match ip address prefix-list KFRR-ANY
!
route-map KFRR-EXPORT permit 1
description KFFR EXPORT
match ip address prefix-list KFRR-ANY
!
line vty
!
Please note that the whole FRR config includes optional kubernetes bits
Connecting ‘legacy-old’ servers to the RRs
Each server which is connected to the old big flat /12 network needs to peer with RR first, in order to be able to connect to the new L3-BGP-only-servers. When iBGP peering is established, route-reflector would advertise NLRIs to its client-peer without modifying attributes - the aim is to avoid routing loops. We have a smooth experience with a BIRD routing daemon - configuration file is human-readable and can be easily automated using tools such as Puppet or Ansbile.
Let’s consider 2 different servers:
host b101 (172.16.2.101) is connected to the ‘old’ network with BIRD
host b160 (172.16.2.160) is connected to the ‘new’ L3-BGP-network via SONiC NOS based SONiC LEAF#1 and LEAF#2 over FRR.
Let’s see how b101 sees b160.
# cat /etc/bird/bird.conf <--- click here to show more
router id 172.16.2.101;
log syslog all;
protocol kernel k4 {
scan time 60;
merge paths 32;
ipv4 {
import none;
export filter {
if source = RTS_BGP then accept;
reject;
};
};
}
protocol device {
scan time 60;
}
template bgp FRR {
local as 65108;
# Timers defaults Cumulus alike
hold time 9;
connect retry time 10;
error wait time 2, 16;
error forget time 16;
enable route refresh on;
long lived stale time 10;
direct;
ipv4 {
add paths on;
import keep filtered on;
import filter {
if net ~ [ 172.18.0.0/15+ , 172.16.64.0/24+ , 172.16.65.0/24+ , 172.16.0.0/12{32,32} ] then {
bgp_community.delete([(*,*)]);
accept;
}
reject;
};
export none;
};
}
# frr1.ams.creativecdn.net
protocol bgp frr1 from FRR { neighbor 172.16.63.1 as 65108; }
# frr2.ams.creativecdn.net
protocol bgp frr2 from FRR { neighbor 172.16.63.2 as 65108; }
(ams)root@b101:~# birdc show protocol <--- click here to show more
BIRD 2.0.7 ready.
Name Proto Table State Since Info
k4 Kernel master4 up 2021-09-12
device1 Device --- up 2021-09-12
frr1 BGP --- up 2021-09-12 Established
frr2 BGP --- up 2021-09-12 Established
b101# birdc show route count <--- click here to show more
BIRD 2.0.7 ready.
722 of 722 routes for 165 networks in table master4
0 of 0 routes for 0 networks in table master6
Total: 722 of 722 routes for 165 networks in 2 tables
As you can see each ‘old’ server has a 2 iBGP sessions and it’s receiving over 722 NLRIs (prefixes+attributes).
These 722 prefixes include predominantely /32 IPv4 routes - for each ‘new’ L3-BGP-based-server is annoucing exactly 1 x IPv4 addr towards RRs.
BGP-table (also known as BGP topology table, BGP RIB) on ‘old’ servers is looking like this:
b101:~# birdc show route all for 172.16.2.160 <--- click here to show more
FIB-table (Forwarding Information Base) on the ‘old’ servers is looking like this:
b101:~# ip route show 172.16.2.160 <--- Click this to show more
172.16.2.160 proto bird metric 32
nexthop via 172.17.0.28 dev bond0 weight 1
nexthop via 172.17.1.28 dev bond0 weight 1
next-hop 172.17.0.28 address is de-facto a Vlan2-interface configured on SONiC-based LEAF#1 while
next-hop 172.17.1.28 address is de-facto a Vlan2-interface configured on SONiC-based LEAF#2.
Network path from b101 (server) to b160 (server) and the other way goes via SONiC LEAF:
b101:~# mtr -rwnc 10 172.16.2.160 <--- Click this to show more
SONiC-enabled LEAFs (LESWs) are configured as follows.
Each odd-numbered leaf switch is connected directly via LACP portchannel to the odd-numbered spine switch e.g. LESW#1 to SPSW#1 and LESW#2 to SPSW#2 using multiple 100G interfaces. VLAN interface is assigned to the PortChannel0001 interface:
LESW1# show vlan brief <--- Click this to show more
+-----------+-----------------+-----------------+----------------+-----------------------+
| VLAN ID | IP Address | Ports | Port Tagging | DHCP Helper Address |
+===========+=================+=================+================+=======================+
| 2 | 172.17.0.28/12 | PortChannel0001 | tagged | |
+-----------+-----------------+-----------------+----------------+-----------------------+
LESW# SONIC config <--- Click this to show more
###
## Uplink towards SPINE
###
config portchannel add PortChannel0001
# everyport towards SPINE-LACP
for E in ${SPINE1x100[@]}
do
config portchannel member add PortChannel0001 Ethernet${E} # fec done in base config
done
###
## Vrf2 - prod_vlan_2
###
config vrf add Vrf2
config vlan add 2
config interface vrf bind Vlan2 Vrf2
config interface ip add Vlan2 ${IP[Vlan2]}
# RTB House tiny hack for arp refreshing is needed on production vlan
sysctl -w net.ipv4.conf.Vlan2.arp_accept=1
#
config vlan member add 2 PortChannel0001
On the other side (SPSW) of PortChannel we are extracting VXLAN to VLAN (Cumulus-Linux-NOS config):
As you can see this is demarcation point between old flat L2 EVPN-based network and the new L3-BGP-only based hosts.
The amount of traffic is constatly growing and our setup allows scaling-up. We have interconnection consisting of 16x100Gbit/s ports coupled between each Spine and Leaf pair giving 3.2Tbit/s total BW - but please be aware that we can scale up as we grow!
Now our biggest bandwidth-consuming DB cluster is using around 600Gbit/s at its peak and still growing! Each server of this cluster is serving flawlessly over 70Gbit/s TCP traffic!
show int portchannel
Flags: A - active, I - inactive, Up - up, Dw - Down, N/A - not available,
S - selected, D - deselected, * - not synced
No. Team Dev Protocol Ports
----- --------------- ----------- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0001 PortChannel0001 LACP(A)(Up) Ethernet172(S) Ethernet180(S) Ethernet188(S) Ethernet200(S) Ethernet152(S) Ethernet148(S) Ethernet164(S) Ethernet144(S) Ethernet184(S) Ethernet160(S) Ethernet176(S) Ethernet156(S) Ethernet192(S) Ethernet196(S) Ethernet168(S) Ethernet204(S)
We are connecting spine and leaf with multiple 100G ports - currently each leaf is connected via 16x100G ports but the sky is the limit and de-facto switchport numbers!
Interconnect can be done at any level: e.g. leaf to leaf, leaf to spine and leaf to super-spine - possible point of connection: anywhere (L2/L3 boundary).
No bottle-neck due to CLOS spine-leaf architecture. A CLOS/Spine-Leaf or “fat tree” architecture features multiple connections between interconnection switches (spine switches) and access switches (leaf switches) to support high-performance computer clustering. In addition to flattening and scaling out Layer 2 networks at the edge, it also creates a nonblocking, low-latency fabric.
ECMP
Equal-cost multi-path routing (ECMP) is a routing feature where instead of traditional one next-hop exists per prefix there are multiple next-hops in use at the same time.
Packet forwarding to a single destination takes place over multiple “best paths” in parralel simultaneously.
The ECMP leverages modified Dijkstra’s algorithm to search for the shortest path, and uses the modulo-n hashing method in the selection of the delivery path.
To prevent out of order packets, ECMP hashing is done on a per-flow basis, which means that all packets with the same source and destination IP addresses and the same source and destination ports always hash to the same next hop thus traffic will be uniformly spread across next-hops(load-balancing effect).
Hashing is computed with 5-tuple:
source IP address
destination IP address
protocol
source port
destination port
You can read more about hashing in excellent Broadcom document here.
Please note that BGP offers add-on called ADD-PATH described in RFC 7911 which is confusing - there terms are two different things!
RTB House combines ADD-PATH and ECMP at the very same time - this two features help us leverage inter-DC network bottlenecks!
Below is example output of one IP (/32 prefix) which is fully-reachable over six next-hops:
172.16.64.31 proto bgp metric 20
nexthop via 172.16.2.15 dev Vlan2 weight 1
nexthop via 172.16.48.151 dev Vlan2 weight 1
nexthop via 172.16.48.152 dev Vlan2 weight 1
nexthop via 172.16.48.153 dev Vlan2 weight 1
nexthop via 172.16.63.101 dev Vlan2 weight 1
nexthop via 172.16.63.102 dev Vlan2 weight 1
BGP sessions status with Route-Reflectors <--- Click this to show more
# show ip bgp vrf Vrf2 summary
IPv4 Unicast Summary:
BGP router identifier 172.17.1.6, local AS number 65108 vrf-id 280
BGP table version 6432
RIB entries 491, using 88 KiB of memory
Peers 54, using 38 MiB of memory
Peer groups 2, using 128 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
frr1.ams.creativecdn.net(172.16.63.1) 4 65108 2619247 2613784 0 0 0 03w1d01h 462 52 frr1.ams.creativecdn
frr2.ams.creativecdn.net(172.16.63.2) 4 65108 877102 872909 0 0 0 04w2d05h 462 52 frr2.ams.creativecdn
FRR configuration snippet related to route reflectors connection through Spine portchannel:
#LESW1# show run bgp <--- Click this to show more
router bgp 65108 vrf Vrf2
bgp router-id 172.17.1.6
neighbor PG-VRF2-SPINE peer-group
neighbor PG-VRF2-SPINE remote-as internal
neighbor PG-VRF2-SPINE description Spine uplink
neighbor PG-VRF2-SRV peer-group
neighbor PG-VRF2-SRV remote-as internal
neighbor PG-VRF2-SRV description Servers downlink
neighbor PG-VRF2-SRV bfd
neighbor PG-VRF2-SRV capability extended-nexthop
neighbor 172.16.63.1 peer-group PG-VRF2-SPINE
neighbor 172.16.63.1 description frr1.ams.creativecdn.net
neighbor 172.16.63.2 peer-group PG-VRF2-SPINE
neighbor 172.16.63.2 description frr2.ams.creativecdn.net
bgp listen range fc00:0:302::/48 peer-group PG-VRF2-SRV
!
address-family ipv4 unicast
network 172.16.0.0/12
neighbor PG-VRF2-SPINE soft-reconfiguration inbound
neighbor PG-VRF2-SPINE route-map RMP-VRF2-SPINE-IMPORT in
neighbor PG-VRF2-SPINE route-map RMP-VRF2-SPINE-EXPORT out
neighbor PG-VRF2-SRV route-reflector-client
neighbor PG-VRF2-SRV soft-reconfiguration inbound
neighbor PG-VRF2-SRV route-map RMP-VRF2-SRV-IMPORT in
neighbor PG-VRF2-SRV route-map RMP-VRF2-SRV-EXPORT out
exit-address-family
!
ip prefix-list default-only seq 10 permit 0.0.0.0/0
ip prefix-list default-only seq 1000 deny any
ip prefix-list PXL-VRF2-FROM-SRV seq 10 permit 172.16.0.0/12 ge 32
ip prefix-list PXL-VRF2-FROM-SRV seq 1000 deny any
ip prefix-list PXL-VRF2-TO-SRV seq 1000 deny any
!
route-map RMP-VRF2-SRV-EXPORT permit 10
match ip address prefix-list default-only
!
route-map RMP-VRF2-SRV-EXPORT permit 20
match ip address prefix-list PXL-VRF2-TO-SRV
!
route-map RMP-VRF2-SRV-EXPORT deny 1000
!
route-map RMP-VRF2-SRV-IMPORT permit 10
match ip address prefix-list PXL-VRF2-FROM-SRV
set tag 2
set weight 10
!
route-map RMP-VRF2-SRV-IMPORT deny 1000
!
route-map RMP-VRF2-SPINE-EXPORT permit 10
match tag 2
!
route-map RMP-VRF2-SPINE-EXPORT deny 1000
!
route-map RMP-VRF2-SPINE-IMPORT permit 10
match ip address prefix-list default-only
!
route-map RMP-VRF2-SPINE-IMPORT permit 20
match ip address prefix-list PXL-VRF2-FROM-SRV
!
route-map RMP-VRF2-SPINE-IMPORT deny 1000
!
ip nht resolve-via-default
!
line vty
!
end
BGP neighbour output with route-reflector looking from SONiC point-of-view:
LESW# show ip bgp vrf Vrf2 neighbour 172.16.63.1 <--- Click this to show more
lesw# show ip bgp vrf Vrf2 neighbors 172.16.63.1
BGP neighbor is 172.16.63.1, remote AS 65108, local AS 65108, internal link
Description: frr1.ams.creativecdn.net
Hostname: frr1.ams.creativecdn.net
Member of peer-group PG-VRF2-SPINE for session parameters
BGP version 4, remote router ID 172.16.63.1, local router ID 172.17.1.6
BGP state = Established, up for 03w1d01h
Last read 00:00:00, Last write 00:00:00
Hold time is 3, keepalive interval is 1 seconds
Neighbor capabilities:
4 Byte AS: advertised and received
Extended Message: advertised and received
AddPath:
IPv4 Unicast: TX received
IPv4 Unicast: RX advertised IPv4 Unicast and received
Route refresh: advertised and received(old & new)
Enhanced Route Refresh: advertised and received
Address Family IPv4 Unicast: advertised and received
Hostname Capability: advertised (name: lesw,domain name: n/a) received (name: frr1.ams.creativecdn.net,domain name: n/a)
Graceful Restart Capability: advertised and received
Remote Restart timer is 120 seconds
Address families by peer:
none
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Local GR Mode: Helper*
Remote GR Mode: Helper
R bit: True
Timers:
Configured Restart Time(sec): 120
Received Restart Time(sec): 120
IPv4 Unicast:
F bit: False
End-of-RIB sent: Yes
End-of-RIB sent after update: No
End-of-RIB received: Yes
Timers:
Configured Stale Path Time(sec): 360
Message statistics:
Inq depth is 0
Outq depth is 0
Sent Rcvd
Opens: 3 3
Notifications: 2 2
Updates: 2546 7152
Keepalives: 2612464 2613325
Route Refresh: 0 0
Capability: 0 0
Total: 2615015 2620482
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
PG-VRF2-SPINE peer-group member
Update group 3, subgroup 4
Packet Queue length 0
Inbound soft reconfiguration allowed
Community attribute sent to this neighbor(all)
Inbound path policy configured
Outbound path policy configured
Route map for incoming advertisements is *RMP-VRF2-SPINE-IMPORT
Route map for outgoing advertisements is *RMP-VRF2-SPINE-EXPORT
462 accepted prefixes
Connections established 3; dropped 2
Last reset 03w1d01h, Notification received (Cease/Peer Unconfigured)
Local host: 172.17.1.6, Local port: 51770
Foreign host: 172.16.63.1, Foreign port: 179
Nexthop: 172.17.1.6
Nexthop global: fe80::923c:b3ff:fec5:8d86
Nexthop local: fe80::923c:b3ff:fec5:8d86
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 10
Estimated round trip time: 1 ms
Read thread: on Write thread: on FD used: 30
ARP refresher
Due to the unremedied SONiC interal bug we’ve implemeted workaround bash script called arp-refresher.sh which is started every 2 minutes in the background via crontab.
Stale ARP is not refreshed by SONiC - it takes a hike from bridge (PortChannel) - then the servers in the “old” network are having issues connecting to the L3-BGP-only servers
root@lesw1:~# show arp | grep 172.16.21.116
172.16.21.116 0c:c4:7a:ea:7e:6e - 2
root@lesw1:~# show mac | grep -i 0c:c4:7a:ea:7e:6e
As you can see from the syslog ARP refresher is doing it’s job properly:
lesw1:~# tail -f /var/log/syslog | grep arp <--- Click this to show more
#!/bin/bash
# for every 'old' arp we refresh it with arping
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/bin/X11
exec &> >(while read line; do logger -t "$0" -i -- "$line"; done)
for VRF in $(ip -br link show type vrf | cut -f 1 -d ' ')
do
ip -s -4 n show vrf "$VRF" | grep -e STALE -e REACHABLE | while read LINE
do
# if several arp refreshes fail then arp should be removed,
# without this ip is blocked and can not be moved to routing table
if echo "$LINE" | grep -q -P 'STALE$'
then
IP=$(echo "$LINE" | cut -f 1 -d ' ')
DEV=$(echo "$LINE" | cut -f 3 -d ' ')
echo "STALE($IP): ip -4 neigh del $IP dev $DEV"
ip -4 neigh del "$IP" dev "$DEV"
continue
fi
# https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/tree/ip/ipneigh.c#n213
# https://unix.stackexchange.com/questions/473919/what-is-the-fifth-coloum-in-the-output-of-ip-stat-neighbour-show-stand-for/474006
AGE_MIN=$( echo "$LINE" | perl -ne ' /used (\d+)\/(\d+)\/(\d+) probes/ ;
my $min=$1;
if ($2 < $min ) { $min = $2 }
# if ($3 < $min ) { $min = $3 } # this number keeps STALE ( entry was last updated )
print $min;' )
# we set base_reachable_time_ms to 400000 while default equals 1800000 while cumulus-linux 1080000
if [[ $AGE_MIN -ge $(( 400000 / 1000 )) ]]
then
IP=$(echo "$LINE" | cut -f 1 -d ' ')
echo "AGE($AGE_MIN): ip vrf exec $VRF arping -c 1 -w 0.1 $IP"
ip vrf exec "$VRF" arping -q -c 1 -w 0.1 "$IP"
fi
done
done
This interconnection setup is working fine for over 2 months. We have critical services deployed on the servers that are prone to any distruption or latency-spikes in the network. After implementing ARP refresher SONiC works well for the setup described above on the Broadcom Tomahawk, Tomahawk II and Trident III ASICs.