FRnOG 40 - Pim van Pelt : VPP: A 100Gbps/100Mpps+ BGP/OSPF router with a single IPv4 address
Category
🤖
TechnologieTranscription
00:00Je m'appelle Pim, je suis néerlandais, and I don't speak French.
00:12So that was the introduction.
00:13I have 20 minutes, I need to go through this really quickly.
00:15I sound very American, but honestly I'm not.
00:18I'm Dutch.
00:19I started off in the Netherlands in the 90s.
00:21RIPE 34 was the first one that I went to.
00:24I actually stopped going to RIPE for a while
00:26and then came back a couple of years ago
00:28and they're still rolling out IPv6 next year.
00:30It's kind of funny.
00:31Okay, so I incorporated ipng.ch in 2021
00:34in the pandemic because I was very bored.
00:37We are a tiny developer of software routers
00:40based on DPDK and an open source thing called VPP.
00:44The town that we work in is called Brut Dijsselen.
00:47Nobody knows this town, even the Swiss don't,
00:50so don't worry about it.
00:51But the other thing that we run is a ring throughout Europe
00:55from Zurich up north to Frankfurt and Amsterdam
00:58and then the north of France, which we would call Riesel,
01:01but it's better for me to call it Lille
01:03because that allows me to peer on the flap.
01:05For me, the L is then Lille.
01:07With 2,200 or so adjacencies by now
01:09and I have a vanity four-digit AS number
01:11from back in the days of running six access
01:14and IPv6 tunnel broker.
01:16Quick intro on VPP.
01:18It is an open source data plane.
01:19It's like super, super quick.
01:21It runs in user space.
01:22It can provide networking, layer two and layer three,
01:25all types of services based off of DPDK, RDMA,
01:28VirtIO, VMXNet, other things.
01:30Easily exceeding 100 million packets per second
01:33on a commodity PC.
01:34Easily doing 100 gigs.
01:36In fact, a terabit has been done with this thing
01:38and it just runs on open hardware.
01:40You can run it on a Dell or an HP or what have you.
01:43I started working on this thing in 2021.
01:46In 2022, I presented on my contributions
01:49to the Linux control plane
01:50that allows us to run things like BGP, OSPF, VRP,
01:53enfin, that type of stuff.
01:55This talk talks about some changes I made to VPP
01:58and as well to BIRD, an open source routing platform
02:01that allows me to run routers in the DFZ
02:04with one IPv4 address.
02:08First off, quick intro on VPP.
02:10In the Linux control plane,
02:12there is a tool called VPP CTL or VPP CUDDLE
02:15and it allows us to take a data plane interface,
02:17in this case, 100 gigabit Ethernet 400
02:20and turn it into a Linux interface called ICE 0.
02:23That thing shows up and you can manipulate it
02:25just like you would any other interface.
02:27Give it an MTU and some IP addresses.
02:29Maybe create a sub-interface called IPNG
02:31and it's tagged with tag 101.
02:34Give it some addresses and some defaults
02:36and then you're off to the races to ping frnog.org.
02:40By the way, if you run this, it still doesn't do IPv6
02:43so I needed to use DNS64 and NAT64 on this one.
02:47But it does ping.
02:49So talk is in three pieces.
02:51Act one is to get an open OSPF v3 running with VPP in IPv4.
02:59Most people will use this with IPv6.
03:02There is an RFC called 5838.
03:05It's an absolutely terrible RFC.
03:09It does multi-address family routing with OSPF v3
03:14and it says, and I quote,
03:16although IPv6 link local addresses
03:18could be used as next hops for IPv4,
03:21it then completely demolishes all probability
03:24of you ever being able to use them
03:26because what happens here is they take the IPv4 address
03:29and stick it in the IPv6 field in the bottom 32 bits
03:34and zero out the rest of the bits
03:36and that is the way it is supposed to work.
03:38So this fundamentally breaks any opportunity
03:41for us to use IPv6 next hops.
03:43So they should never have said could be used.
03:45They should have said cannot ever be used.
03:47Thank you, ITF.
03:49But a clever solution from Andre of the BIRD team
03:53in the commit that I linked there
03:55adds a function called update loopback address
03:58which scans all IPv4 interfaces
04:00looking for one that has an address
04:02that we might be able to use.
04:04Starting with host interfaces slash 32
04:06and otherwise OSPF stub interfaces
04:08and otherwise any old IPv4 address
04:10and it uses that to put it in the link LSA
04:13to announce sort of the next hop
04:15with that IPv6 field carrying an IPv4 address.
04:18Then all routes learned will be on link
04:21and I'll get to that in a second
04:23from any neighbor that we might find on this link.
04:26So my first attempt and there's a checklist here
04:29on my successes and failures along the way
04:31is in VPP to create a loopback address
04:34on loop zero with v4 and v6
04:36and then don't create any addresses
04:39on the interfaces gigabit ethernet 10.0.0 and 10.0.1.
04:44So they show up with only a link local.
04:47I'll turn on BFD because I wish to share
04:50the BFD session between v4 and v6.
04:52That makes sense to me.
04:54The OSPF configuration is actually quite simple.
04:57The top one here in light blue,
04:59that's where I want to draw your attention on these slides
05:02is OSPF v3 called OSPF 4.
05:04So this thing has a channel called IPv4
05:07which makes it invoke this RFC 5838.
05:10Obviously OSPF v6, that's all as it was before.
05:14So this thing works and it turns up adjacencies.
05:17We can see two BFD sessions here,
05:19one eastbound on e0 and one westbound on e1
05:22and the cool thing is now we have an OSPF 4
05:25which has router IDs in the IPv4
05:29realm that were learned on link local nexthops.
05:32That's kind of nice, that's the blue stuff there.
05:35So adjacencies are formed and also routes are learned
05:38and so we learn a route for example 10.3
05:43on 10.2 on the interface e1 called onlink
05:48and that's that blue thing here.
05:50All these routes just look normal
05:52except their nexthop is v4 slash 32's onlink.
05:56But it didn't work.
05:58So when I ping, VPP would not forward this traffic
06:01because VPP claims that these interfaces
06:04have not enabled IPv4 because they only have
06:07an IPv6 link local and so they drop all the packets.
06:10So my first attempt was just to force enable IPv4 on
06:13on all these interfaces and actually that works
06:16except none of the intermediary routers
06:19will respond ICMPv4 like a trace route is aesthetic
06:23but PathMTU is a little bit more assiduous.
06:25It just wouldn't work at all because they didn't have
06:28an IPv4 address to answer from.
06:30So my second attempt was to use a feature in VPP
06:33it's also common in hardware routers
06:36to unnumber an interface and borrow from another interface
06:39and that works.
06:41Now e0 and e1 share the 10.1 IP address from loop 0
06:46but forwarding doesn't work anymore
06:49because I'm trying to ARP now from 10.1 to 10.2
06:52my neighbours say and up until now
06:55VPP would not respond to ARP
06:58from onlink network peers.
07:01So if you're in a slash 29 it would respond to every IP address there
07:04but if you're in a slash 32 it just would drop all the ARP packets
07:07and never respond.
07:09So my final attempt which was also committed
07:12and merged upstream is to inhibit the sync of these
07:15unnumbered IP addresses into Linux
07:18so I'm using the light blue e0 still having link local addresses
07:21by disabling LCP sync unnumbered
07:24and by fixing this ARP issue
07:27by forcing VPP really to just respond to these
07:30onlink ARP requests.
07:33It's quite normal, all the other vendors will do this as well.
07:36So with that my trace route at MTU 9000
07:39now responds to IPv4 using the loopback addresses
07:42so you can see this trace route here
07:45for 10, 0, 1, 2 and 3 works just fine.
07:48Alright then I decided to roll it out
07:51AS8298 has like 14 or so routers
07:54and something like 27 or so point to point networks
07:57so about a fifth or so of a slash 24
08:00is tied up in rather useless
08:03transit networks.
08:06So the start situation is every router has a loopback
08:09slash 32 and slash 128 and then a bunch of links to its peer routers
08:12slash 31s for IPv4 and slash 112s
08:15for IPv6. So I have to upgrade
08:18BIRD first that has Ondrej's change in it
08:21also upgrade my VPP data plane which is obviously intrusive
08:24and I'll make use of this moment to rename the
08:27OSPF 4 which was OSPF v2
08:30to a thing called OSPF 4 old and I'll create an empty
08:33OSPF 4 that's now OSPF v3.
08:36I'm going to say v3, v4 and v6 a lot by the way.
08:39Then I'll move interfaces one by one from the old v2
08:42to the new v3 OSPF 4
08:45and then finally OSPF 4 old will be empty and I can delete it
08:48and in the end every router will have exactly one
08:51IP address for IPv4 and IPv6 on loopback
08:54which it will share on all interfaces.
08:57So the upgrade here, I actually grabbed this from my bash history
09:00first I'll raise the OSPF cost by prepending
09:0310 to it, typically making it go from 15 to like
09:061015 which drains all the links into this router
09:09then I will rename the protocol OSPF 4
09:12by appending the underscore old, I'll download the packages
09:15for VPP as well as for the bird that I built
09:18and then I enter the data plane namespace and
09:21that pkill I think is really cool, kill VPP
09:24stop VPP and bird, depackage upgrade
09:27the VPP stuff, upgrade as well the bird stuff
09:30and restart the services. I don't know about all of you
09:33but typically these things on an ASR 9K
09:36or a Juniper take like forever to upgrade
09:39whereas here they upgraded in 92 seconds
09:42so a minute and a half later this machine was back up, fully converged
09:45in the DFZ and forwarding traffic
09:48and the OSPF V4 old
09:51was still carrying these two adjacencies
09:54with XE11 and XE10.304
09:57they're still per normal
10:01so what I do next in step 2 is I remove the addressing
10:04from the interface and I create unnumbered
10:07I also take the chance to rename this thing because now that
10:10I don't see IP addresses on it anymore I don't really know what goes to what
10:13and so I'll use a Linux feature to just rename the interface
10:16from XE11 to DDLN1
10:19in this case and I'll make it borrow its addresses from loop 0
10:22so I can plan this
10:25which would show the API calls called on the data plane
10:28and then apply this after which the interface
10:31no longer has IPv4 or IPv6
10:34what's left for me to do is move the interface
10:37from the old OSPF where it was called XE11
10:40into the new OSPF which is OSPF V3
10:43where it's called DDLN1 and I'll use BFD
10:46quelle surprise, this converges
10:49I have to do the other side as well but OSPF 4
10:52now has router ID
10:5564163.6 learned
10:58on interface DDLN1 from link local
11:01and I can see the route for that peer
11:04being its own IP address
11:07.6 on the interface DDLN1 on link
11:10so I can ARP for it and the other guy will respond
11:13so from here on it's just rinse and repeat
11:16like do this for every interface with coffee in hand, start small
11:19go up on the ring, go to Amsterdam
11:22and then back down over Lille, Paris and Genève
11:25and then end up back in Zurich again
11:28after that the machine looks like this
11:31this is the example in Paris, the loop back there is in blue
11:34and then I have a bunch of IP addresses that come from
11:37France IX and from some other smaller things that I have there
11:40and then two main interfaces called FRGGH0
11:43which goes to Lille and SCHPLO
11:46that goes to Planeslatt in Geneva
11:50trace routes look like this, quite nice
11:53first hop is of course my local VLAN entry point
11:56on .66 but from there on the backbone network
11:59all the routers use that exact one IP address for V4
12:02and one IP address for V6
12:05and this allows me to return 27 slash 31's and a whole bunch of
12:08slash 112's and I don't know if everyone knows this
12:11I hope you do but you never needed IPv6
12:14globally routable addresses anyway, you can just use OSPF
12:18so I have one more thing that I wanted to talk about
12:21people ask me all the time like why do you use this
12:24you can just use Linux or FreeBSD or OpenBSD
12:27which is true but VPP as I said in the beginning is really really fast
12:30so I took a 2016 Dell R730
12:33which I bought for 600 euros second hand
12:36and I racked it in this configuration
12:39I have three other machines
12:42the previous generation Dell 720
12:46three of them each having three dual NICs
12:49so in total 18 10 gigabit cards
12:52that go through a Mellanox switch
12:55by the way super cool, you can run Debian on them
12:58without any firmware issues and then down below
13:01I have this Dell R730 which has 24
13:04network cards so these quad
13:07X710's if you've seen them, the Intel cards
13:10three of these are on CPU NUMA 0
13:13and three of these would be on CPU NUMA 1
13:16all of this runs Debian, none of this has a binary block
13:19it's all fully open source
13:22so I'll take a tool called Cisco T-Rex which is an open source load tester
13:25and I have two methods here, method one is to use only one
13:28worker thread in VPP, it's a multi-threaded app
13:31if you add more threads you get more throughput
13:34but I'll limit it to one only and then I'll slam that with as much traffic
13:37as it's willing to forward and I'll measure how much that is
13:40up to 1000 packets per second, maybe 10,000, 1,000,000
13:4310,000,000, 100,000,000 packets until that one CPU thread is saturated
13:46and then method number two is just RFC 1544
13:49a linear ramp up of traffic from 0 to 100% of line rate
13:52and then see when the machine starts dropping
13:55more than 1 tenth of 1%
13:58so number one is actually very easy to do
14:01T-Rex has a textual user interface
14:04for all of us non-GUI people and here I have an overview
14:07of what it looks like, at the top there, number one
14:10shows the interface types that I have, I have 4 times 10 gigs
14:13in this case, number two is how many packets I'm sending
14:16out of the load tester, number three would be how much
14:19I'm receiving back from the device under test
14:22in this case the VPP machine and these should be the same number
14:25otherwise it's dropping traffic and to make absolutely sure
14:28there's also packet and byte counters for all interfaces
14:31and what you see here is a load test doing
14:344 times 10 gig, 64 byte packets
14:37the smallest we're allowed to send before they become runs
14:40and that's 59.4 million packets per second in both directions
14:43which is exactly 40, 40 gigabits per second
14:46and this shows me that L2 cross connects
14:49just Ethernet in to Ethernet out
14:52which is a cheap thing for VPP to do
14:55must do at least 14.88 million packets per core
14:58method one results here
15:02if you look at the top left you'll see the L2 cross connect
15:05semantic Ethernet packet in and out and other interface
15:08at 1000 packets per second
15:11on average takes 991 CPU cycles
15:14to switch that packet through the data plane
15:17this is by the way a really cool network software engineering
15:20question in an interview, how expensive is it really
15:23if I ramp up from 1000 to 1 million packets per second
15:26it's only 199 which is almost an order of magnitude better
15:30and the reason is that we can now use the CPU instruction cache
15:33the data cache, DDIO, all sorts of semantics
15:36smart stuff in the hardware
15:39to leverage as many packets as we can
15:42through the CPU
15:45and in total when I ramp it up until the core is saturated
15:48it does 15.3 million packets per second
15:51on one CPU thread
15:54and reminder this machine has 44 CPU threads
15:57so this thing does roughly 600 million packets per second
16:00for MPLS a little bit more expensive
16:03does 10.3 million, IPv4 11.1
16:06and IPv6 9.72
16:09and by the way this is with a full table FIB loaded
16:12so I've made claims in the past
16:15that this thing can easily do 100 gigabits
16:18so someone on Twitter called me out and said prove it
16:21and this is the proof of that in the condensed form
16:25but I'll do a load test with 18 interfaces
16:28all three of the load testers at the top
16:31fully sending as much traffic as they can down into the VPP machine
16:34and I'll start by proving the bandwidth
16:37so in this case I'm using large packets
16:40and if you see vendors say we do up to 20 gigabits
16:43it's typically because they do that with very large packets
16:46and I'll achieve 100 gig
16:49not a problem
16:53you don't even see the CPU time go from 0 to 0.1%
16:56so this is 14.7 million packets per second
16:59and 180 gigabits of throughput achieved
17:02obviously this is really easy for VPP
17:05because it's a packet bound software code
17:08also if you have enough PCI lanes
17:11and PCI bandwidth this thing just scales almost infinitely
17:14with CPU cores and PCI
17:17I have 24 unused CPU threads in the machine at this point
17:20and 6 unused NICs merely because I didn't have more network cards
17:23to generate more load width
17:26and this is proof that VPP scales linearly and easily forwards 100 gigs
17:29but the harder one is can it also do small packets
17:32and I talked before about doing a 64 byte load test
17:35and here I chose 128 and I'll get back to that in a second
17:38but I'll ramp up 128 byte load test
17:41again to line rate
17:44achieving 100 million packets per second somewhere in the middle
17:47and then ending up at 165 million packets per second
17:50on a 11 or 12 year old Dell
17:53that cost me 600 bucks and takes maybe 110 watts
17:56so most of our hardware does not do that these days
17:59and 165 million packets per second
18:02then turns into 150 or so gigs of traffic
18:05at 128 bytes each
18:08and again 24 CPU threads are doing nothing
18:11more than half the machine is completely left unused in this case
18:15One quick topic so that you don't ask questions about this
18:18these are 4 port network cards from Intel
18:21and the Intel chip that's behind it can only do
18:2435 to 36 million packets per second
18:27so if you have 4 ports you would be generating 60 million packets per second
18:30which is too much for the silicon to handle
18:33and what we see here is that we have to have a synthetically higher packet size
18:36otherwise we would saturate the silicon
18:39in the network card
18:42before VBP ever got to see it
18:45but 128 bytes is line rate for this thing
18:48and by the way in case you wanted to laugh at us software router people
18:51I hear the Trio 6 is 192 bytes as well
18:54so with that I think I have depleted my time
18:57maybe there's a question or two
19:00Are there any questions?
19:03Hi
19:06it's super cool to see
19:09unnumbered interfaces
19:12and replying with loopbacks
19:15for the trace routes
19:18but that also obfuscates a few things
19:21like I like to see the
19:24to be able to see the data
19:27and to be able to see
19:30which interfaces I am using
19:33when I do a trace route
19:36does that bother you sometimes
19:39when you are troubleshooting stuff
19:42and you can't see which interface you are entering
19:45and only have the loopback interfaces
19:48no not at all
19:51but there's a more sarcastic answer
19:54which is like it's great if your network grows
19:57but when the network gets sufficiently large
20:00you are just hemorrhaging IP addresses
20:03for no other reason than aesthetics
20:06but you could use non routable IP address private space
20:09that you would only see your reverse DNS
20:12so you can identify
20:15so people from the outside would see stars
20:18but on the inside you would be able to see
20:21no you are absolutely right
20:24that whole thing is in a VPC
20:27that is not connected to the internet
20:30using 198.19
20:33and so there I get exactly what you wanted
20:36and these all pair up with the MPLS underlay
20:39and also most of these are in a ring
20:42so there's really only two choices
20:45and the external interfaces do have an IP address
20:48like at DECX and FRANCAIX
20:51Thank you