FRnOG 40 - Pim van Pelt : VPP: A 100Gbps/100Mpps+ BGP/OSPF router with a single IPv4 address - Vidéo Dailymotion

Vidéos des réunions FRnOG

FRnOG 40 - Pim van Pelt : VPP: A 100Gbps/100Mpps+ BGP/OSPF router with a single IPv4 address

Transcript

00:00Je m'appelle Pim, je suis néerlandais, and I don't speak French.

00:12So that was the introduction.

00:13I have 20 minutes, I need to go through this really quickly.

00:15I sound very American, but honestly I'm not.

00:18I'm Dutch.

00:19I started off in the Netherlands in the 90s.

00:21RIPE 34 was the first one that I went to.

00:24I actually stopped going to RIPE for a while

00:26and then came back a couple of years ago

00:28and they're still rolling out IPv6 next year.

00:30It's kind of funny.

00:31Okay, so I incorporated ipng.ch in 2021

00:34in the pandemic because I was very bored.

00:37We are a tiny developer of software routers

00:40based on DPDK and an open source thing called VPP.

00:44The town that we work in is called Brut Dijsselen.

00:47Nobody knows this town, even the Swiss don't,

00:50so don't worry about it.

00:51But the other thing that we run is a ring throughout Europe

00:55from Zurich up north to Frankfurt and Amsterdam

00:58and then the north of France, which we would call Riesel,

01:01but it's better for me to call it Lille

01:03because that allows me to peer on the flap.

01:05For me, the L is then Lille.

01:07With 2,200 or so adjacencies by now

01:09and I have a vanity four-digit AS number

01:11from back in the days of running six access

01:14and IPv6 tunnel broker.

01:16Quick intro on VPP.

01:18It is an open source data plane.

01:19It's like super, super quick.

01:21It runs in user space.

01:22It can provide networking, layer two and layer three,

01:25all types of services based off of DPDK, RDMA,

01:28VirtIO, VMXNet, other things.

01:30Easily exceeding 100 million packets per second

01:33on a commodity PC.

01:34Easily doing 100 gigs.

01:36In fact, a terabit has been done with this thing

01:38and it just runs on open hardware.

01:40You can run it on a Dell or an HP or what have you.

01:43I started working on this thing in 2021.

01:46In 2022, I presented on my contributions

01:49to the Linux control plane

01:50that allows us to run things like BGP, OSPF, VRP,

01:53enfin, that type of stuff.

01:55This talk talks about some changes I made to VPP

01:58and as well to BIRD, an open source routing platform

02:01that allows me to run routers in the DFZ

02:04with one IPv4 address.

02:08First off, quick intro on VPP.

02:10In the Linux control plane,

02:12there is a tool called VPP CTL or VPP CUDDLE

02:15and it allows us to take a data plane interface,

02:17in this case, 100 gigabit Ethernet 400

02:20and turn it into a Linux interface called ICE 0.

02:23That thing shows up and you can manipulate it

02:25just like you would any other interface.

02:27Give it an MTU and some IP addresses.

02:29Maybe create a sub-interface called IPNG

02:31and it's tagged with tag 101.

02:34Give it some addresses and some defaults

02:36and then you're off to the races to ping frnog.org.

02:40By the way, if you run this, it still doesn't do IPv6

02:43so I needed to use DNS64 and NAT64 on this one.

02:47But it does ping.

02:49So talk is in three pieces.

02:51Act one is to get an open OSPF v3 running with VPP in IPv4.

02:59Most people will use this with IPv6.

03:02There is an RFC called 5838.

03:05It's an absolutely terrible RFC.

03:09It does multi-address family routing with OSPF v3

03:14and it says, and I quote,

03:16although IPv6 link local addresses

03:18could be used as next hops for IPv4,

03:21it then completely demolishes all probability

03:24of you ever being able to use them

03:26because what happens here is they take the IPv4 address

03:29and stick it in the IPv6 field in the bottom 32 bits

03:34and zero out the rest of the bits

03:36and that is the way it is supposed to work.

03:38So this fundamentally breaks any opportunity

03:41for us to use IPv6 next hops.

03:43So they should never have said could be used.

03:45They should have said cannot ever be used.

03:47Thank you, ITF.

03:49But a clever solution from Andre of the BIRD team

03:53in the commit that I linked there

03:55adds a function called update loopback address

03:58which scans all IPv4 interfaces

04:00looking for one that has an address

04:02that we might be able to use.

04:04Starting with host interfaces slash 32

04:06and otherwise OSPF stub interfaces

04:08and otherwise any old IPv4 address

04:10and it uses that to put it in the link LSA

04:13to announce sort of the next hop

04:15with that IPv6 field carrying an IPv4 address.

04:18Then all routes learned will be on link

04:21and I'll get to that in a second

04:23from any neighbor that we might find on this link.

04:26So my first attempt and there's a checklist here

04:29on my successes and failures along the way

04:31is in VPP to create a loopback address

04:34on loop zero with v4 and v6

04:36and then don't create any addresses

04:39on the interfaces gigabit ethernet 10.0.0 and 10.0.1.

04:44So they show up with only a link local.

04:47I'll turn on BFD because I wish to share

04:50the BFD session between v4 and v6.

04:52That makes sense to me.

04:54The OSPF configuration is actually quite simple.

04:57The top one here in light blue,

04:59that's where I want to draw your attention on these slides

05:02is OSPF v3 called OSPF 4.

05:04So this thing has a channel called IPv4

05:07which makes it invoke this RFC 5838.

05:10Obviously OSPF v6, that's all as it was before.

05:14So this thing works and it turns up adjacencies.

05:17We can see two BFD sessions here,

05:19one eastbound on e0 and one westbound on e1

05:22and the cool thing is now we have an OSPF 4

05:25which has router IDs in the IPv4

05:29realm that were learned on link local nexthops.

05:32That's kind of nice, that's the blue stuff there.

05:35So adjacencies are formed and also routes are learned

05:38and so we learn a route for example 10.3

05:43on 10.2 on the interface e1 called onlink

05:48and that's that blue thing here.

05:50All these routes just look normal

05:52except their nexthop is v4 slash 32's onlink.

05:56But it didn't work.

05:58So when I ping, VPP would not forward this traffic

06:01because VPP claims that these interfaces

06:04have not enabled IPv4 because they only have

06:07an IPv6 link local and so they drop all the packets.

06:10So my first attempt was just to force enable IPv4 on

06:13on all these interfaces and actually that works

06:16except none of the intermediary routers

06:19will respond ICMPv4 like a trace route is aesthetic

06:23but PathMTU is a little bit more assiduous.

06:25It just wouldn't work at all because they didn't have

06:28an IPv4 address to answer from.

06:30So my second attempt was to use a feature in VPP

06:33it's also common in hardware routers

06:36to unnumber an interface and borrow from another interface

06:39and that works.

06:41Now e0 and e1 share the 10.1 IP address from loop 0

06:46but forwarding doesn't work anymore

06:49because I'm trying to ARP now from 10.1 to 10.2

06:52my neighbours say and up until now

06:55VPP would not respond to ARP

06:58from onlink network peers.

07:01So if you're in a slash 29 it would respond to every IP address there

07:04but if you're in a slash 32 it just would drop all the ARP packets

07:07and never respond.

07:09So my final attempt which was also committed

07:12and merged upstream is to inhibit the sync of these

07:15unnumbered IP addresses into Linux

07:18so I'm using the light blue e0 still having link local addresses

07:21by disabling LCP sync unnumbered

07:24and by fixing this ARP issue

07:27by forcing VPP really to just respond to these

07:30onlink ARP requests.

07:33It's quite normal, all the other vendors will do this as well.

07:36So with that my trace route at MTU 9000

07:39now responds to IPv4 using the loopback addresses

07:42so you can see this trace route here

07:45for 10, 0, 1, 2 and 3 works just fine.

07:48Alright then I decided to roll it out

07:51AS8298 has like 14 or so routers

07:54and something like 27 or so point to point networks

07:57so about a fifth or so of a slash 24

08:00is tied up in rather useless

08:03transit networks.

08:06So the start situation is every router has a loopback

08:09slash 32 and slash 128 and then a bunch of links to its peer routers

08:12slash 31s for IPv4 and slash 112s

08:15for IPv6. So I have to upgrade

08:18BIRD first that has Ondrej's change in it

08:21also upgrade my VPP data plane which is obviously intrusive

08:24and I'll make use of this moment to rename the

08:27OSPF 4 which was OSPF v2

08:30to a thing called OSPF 4 old and I'll create an empty

08:33OSPF 4 that's now OSPF v3.

08:36I'm going to say v3, v4 and v6 a lot by the way.

08:39Then I'll move interfaces one by one from the old v2

08:42to the new v3 OSPF 4

08:45and then finally OSPF 4 old will be empty and I can delete it

08:48and in the end every router will have exactly one

08:51IP address for IPv4 and IPv6 on loopback

08:54which it will share on all interfaces.

08:57So the upgrade here, I actually grabbed this from my bash history

09:00first I'll raise the OSPF cost by prepending

09:0310 to it, typically making it go from 15 to like

09:061015 which drains all the links into this router

09:09then I will rename the protocol OSPF 4

09:12by appending the underscore old, I'll download the packages

09:15for VPP as well as for the bird that I built

09:18and then I enter the data plane namespace and

09:21that pkill I think is really cool, kill VPP

09:24stop VPP and bird, depackage upgrade

09:27the VPP stuff, upgrade as well the bird stuff

09:30and restart the services. I don't know about all of you

09:33but typically these things on an ASR 9K

09:36or a Juniper take like forever to upgrade

09:39whereas here they upgraded in 92 seconds

09:42so a minute and a half later this machine was back up, fully converged

09:45in the DFZ and forwarding traffic

09:48and the OSPF V4 old

09:51was still carrying these two adjacencies

09:54with XE11 and XE10.304

09:57they're still per normal

10:01so what I do next in step 2 is I remove the addressing

10:04from the interface and I create unnumbered

10:07I also take the chance to rename this thing because now that

10:10I don't see IP addresses on it anymore I don't really know what goes to what

10:13and so I'll use a Linux feature to just rename the interface

10:16from XE11 to DDLN1

10:19in this case and I'll make it borrow its addresses from loop 0

10:22so I can plan this

10:25which would show the API calls called on the data plane

10:28and then apply this after which the interface

10:31no longer has IPv4 or IPv6

10:34what's left for me to do is move the interface

10:37from the old OSPF where it was called XE11

10:40into the new OSPF which is OSPF V3

10:43where it's called DDLN1 and I'll use BFD

10:46quelle surprise, this converges

10:49I have to do the other side as well but OSPF 4

10:52now has router ID

10:5564163.6 learned

10:58on interface DDLN1 from link local

11:01and I can see the route for that peer

11:04being its own IP address

11:07.6 on the interface DDLN1 on link

11:10so I can ARP for it and the other guy will respond

11:13so from here on it's just rinse and repeat

11:16like do this for every interface with coffee in hand, start small

11:19go up on the ring, go to Amsterdam

11:22and then back down over Lille, Paris and Genève

11:25and then end up back in Zurich again

11:28after that the machine looks like this

11:31this is the example in Paris, the loop back there is in blue

11:34and then I have a bunch of IP addresses that come from

11:37France IX and from some other smaller things that I have there

11:40and then two main interfaces called FRGGH0

11:43which goes to Lille and SCHPLO

11:46that goes to Planeslatt in Geneva

11:50trace routes look like this, quite nice

11:53first hop is of course my local VLAN entry point

11:56on .66 but from there on the backbone network

11:59all the routers use that exact one IP address for V4

12:02and one IP address for V6

12:05and this allows me to return 27 slash 31's and a whole bunch of

12:08slash 112's and I don't know if everyone knows this

12:11I hope you do but you never needed IPv6

12:14globally routable addresses anyway, you can just use OSPF

12:18so I have one more thing that I wanted to talk about

12:21people ask me all the time like why do you use this

12:24you can just use Linux or FreeBSD or OpenBSD

12:27which is true but VPP as I said in the beginning is really really fast

12:30so I took a 2016 Dell R730

12:33which I bought for 600 euros second hand

12:36and I racked it in this configuration

12:39I have three other machines

12:42the previous generation Dell 720

12:46three of them each having three dual NICs

12:49so in total 18 10 gigabit cards

12:52that go through a Mellanox switch

12:55by the way super cool, you can run Debian on them

12:58without any firmware issues and then down below

13:01I have this Dell R730 which has 24

13:04network cards so these quad

13:07X710's if you've seen them, the Intel cards

13:10three of these are on CPU NUMA 0

13:13and three of these would be on CPU NUMA 1

13:16all of this runs Debian, none of this has a binary block

13:19it's all fully open source

13:22so I'll take a tool called Cisco T-Rex which is an open source load tester

13:25and I have two methods here, method one is to use only one

13:28worker thread in VPP, it's a multi-threaded app

13:31if you add more threads you get more throughput

13:34but I'll limit it to one only and then I'll slam that with as much traffic

13:37as it's willing to forward and I'll measure how much that is

13:40up to 1000 packets per second, maybe 10,000, 1,000,000

13:4310,000,000, 100,000,000 packets until that one CPU thread is saturated

13:46and then method number two is just RFC 1544

13:49a linear ramp up of traffic from 0 to 100% of line rate

13:52and then see when the machine starts dropping

13:55more than 1 tenth of 1%

13:58so number one is actually very easy to do

14:01T-Rex has a textual user interface

14:04for all of us non-GUI people and here I have an overview

14:07of what it looks like, at the top there, number one

14:10shows the interface types that I have, I have 4 times 10 gigs

14:13in this case, number two is how many packets I'm sending

14:16out of the load tester, number three would be how much

14:19I'm receiving back from the device under test

14:22in this case the VPP machine and these should be the same number

14:25otherwise it's dropping traffic and to make absolutely sure

14:28there's also packet and byte counters for all interfaces

14:31and what you see here is a load test doing

14:344 times 10 gig, 64 byte packets

14:37the smallest we're allowed to send before they become runs

14:40and that's 59.4 million packets per second in both directions

14:43which is exactly 40, 40 gigabits per second

14:46and this shows me that L2 cross connects

14:49just Ethernet in to Ethernet out

14:52which is a cheap thing for VPP to do

14:55must do at least 14.88 million packets per core

14:58method one results here

15:02if you look at the top left you'll see the L2 cross connect

15:05semantic Ethernet packet in and out and other interface

15:08at 1000 packets per second

15:11on average takes 991 CPU cycles

15:14to switch that packet through the data plane

15:17this is by the way a really cool network software engineering

15:20question in an interview, how expensive is it really

15:23if I ramp up from 1000 to 1 million packets per second

15:26it's only 199 which is almost an order of magnitude better

15:30and the reason is that we can now use the CPU instruction cache

15:33the data cache, DDIO, all sorts of semantics

15:36smart stuff in the hardware

15:39to leverage as many packets as we can

15:42through the CPU

15:45and in total when I ramp it up until the core is saturated

15:48it does 15.3 million packets per second

15:51on one CPU thread

15:54and reminder this machine has 44 CPU threads

15:57so this thing does roughly 600 million packets per second

16:00for MPLS a little bit more expensive

16:03does 10.3 million, IPv4 11.1

16:06and IPv6 9.72

16:09and by the way this is with a full table FIB loaded

16:12so I've made claims in the past

16:15that this thing can easily do 100 gigabits

16:18so someone on Twitter called me out and said prove it

16:21and this is the proof of that in the condensed form

16:25but I'll do a load test with 18 interfaces

16:28all three of the load testers at the top

16:31fully sending as much traffic as they can down into the VPP machine

16:34and I'll start by proving the bandwidth

16:37so in this case I'm using large packets

16:40and if you see vendors say we do up to 20 gigabits

16:43it's typically because they do that with very large packets

16:46and I'll achieve 100 gig

16:49not a problem

16:53you don't even see the CPU time go from 0 to 0.1%

16:56so this is 14.7 million packets per second

16:59and 180 gigabits of throughput achieved

17:02obviously this is really easy for VPP

17:05because it's a packet bound software code

17:08also if you have enough PCI lanes

17:11and PCI bandwidth this thing just scales almost infinitely

17:14with CPU cores and PCI

17:17I have 24 unused CPU threads in the machine at this point

17:20and 6 unused NICs merely because I didn't have more network cards

17:23to generate more load width

17:26and this is proof that VPP scales linearly and easily forwards 100 gigs

17:29but the harder one is can it also do small packets

17:32and I talked before about doing a 64 byte load test

17:35and here I chose 128 and I'll get back to that in a second

17:38but I'll ramp up 128 byte load test

17:41again to line rate

17:44achieving 100 million packets per second somewhere in the middle

17:47and then ending up at 165 million packets per second

17:50on a 11 or 12 year old Dell

17:53that cost me 600 bucks and takes maybe 110 watts

17:56so most of our hardware does not do that these days

17:59and 165 million packets per second

18:02then turns into 150 or so gigs of traffic

18:05at 128 bytes each

18:08and again 24 CPU threads are doing nothing

18:11more than half the machine is completely left unused in this case

18:15One quick topic so that you don't ask questions about this

18:18these are 4 port network cards from Intel

18:21and the Intel chip that's behind it can only do

18:2435 to 36 million packets per second

18:27so if you have 4 ports you would be generating 60 million packets per second

18:30which is too much for the silicon to handle

18:33and what we see here is that we have to have a synthetically higher packet size

18:36otherwise we would saturate the silicon

18:39in the network card

18:42before VBP ever got to see it

18:45but 128 bytes is line rate for this thing

18:48and by the way in case you wanted to laugh at us software router people

18:51I hear the Trio 6 is 192 bytes as well

18:54so with that I think I have depleted my time

18:57maybe there's a question or two

19:00Are there any questions?

19:03Hi

19:06it's super cool to see

19:09unnumbered interfaces

19:12and replying with loopbacks

19:15for the trace routes

19:18but that also obfuscates a few things

19:21like I like to see the

19:24to be able to see the data

19:27and to be able to see

19:30which interfaces I am using

19:33when I do a trace route

19:36does that bother you sometimes

19:39when you are troubleshooting stuff

19:42and you can't see which interface you are entering

19:45and only have the loopback interfaces

19:48no not at all

19:51but there's a more sarcastic answer

19:54which is like it's great if your network grows

19:57but when the network gets sufficiently large

20:00you are just hemorrhaging IP addresses

20:03for no other reason than aesthetics

20:06but you could use non routable IP address private space

20:09that you would only see your reverse DNS

20:12so you can identify

20:15so people from the outside would see stars

20:18but on the inside you would be able to see

20:21no you are absolutely right

20:24that whole thing is in a VPC

20:27that is not connected to the internet

20:30using 198.19

20:33and so there I get exactly what you wanted

20:36and these all pair up with the MPLS underlay

20:39and also most of these are in a ring

20:42so there's really only two choices

20:45and the external interfaces do have an IP address

20:48like at DECX and FRANCAIX

20:51Thank you

FRnOG 40 - Pim van Pelt : VPP: A 100Gbps/100Mpps+ BGP/OSPF router with a single IPv4 address

Catégorie

Transcription

Recommandations