Cloud Nets

S4 E2 AI Network Fabric

DriveNets Season 4 Episode 2

We have an issue with resolving the AI fabric, or an AI networking problem with large clusters of GPUs usually used for training. This episode looks at the issue with resolving the AI fabric and explores how an Ethernet based solution can resolve this – by building a chassis which is distributed (a disaggregated, distributed chassis). This approach has no packet loss, is lossless and is a fully scheduled fabric, but without the scale limitation of a chassis.

Hi, and welcome back to CloudNets-AI, where networks meet cloud.

And today we're going to talk about

AI and specifically about AI network fabric and the different ways to implement it.

 We have our AI network fabric expert, Yossi.

 Hi, Yossi.

 Hey, everyone.

 Thank you for joining us.

 Thank you for having us.

 So we have an issue with AI

network fabric, right.

 We have some requirements.

 We have some specific things we need to know about.

 Let's understand first, what is the problem?

 What problem are we trying to resolve?

 Great question.

 Essentially, AI networks rely on two fundamentals.

 First one, they use RDMA

Memory Access.

 Now, the reason AI networks use RDMA

is because we want to reduce latency

 of read/write operations as much as we can.

 Now, the second thing we have, or

the second characteristics we have in RDMA

 or AI networks is elephant flows.

 These folks, the GPU's that participate in a cluster, usually send very long flows  of data.

 Now, these two characteristics that I

mentioned, the RDMA nature of it and  the elephant flows, causes several problems.

 You want to talk about it?

 Oh, yeah, please.

 Okay.

 So essentially, RDMA is not tolerant for loss.

 RDMA works with an algorithm called Go-Back-N (GBN).

 So what happens is we lose a

lot of time, and time is expensive

 when you're talking about AI networks.

 Because job completion time means the

utilization of the GPUs.

 And this is very expensive.

 Exactly.

 So first thing first, you're not allowed

to lose any packet when you're talking

 about AI networking or RDMA networking in

specific.

 Second thing I was mentioning is elephant

flows.

 Now, the problem with, with elephant flows

is that they naturally have low entropy,

 right?

 Which means you cannot efficiently load

balance, which means you will have packet loss,

 which contradicts the first.

 Exactly.

 So essentially what happens is with the

classic or standard ECMP or hashing

 mechanism that you have today, what

happens is, is you bombard some specific

 links in your network, while other links

or other resources in the network are

 essentially idle.

 Okay, so we understand the problem, we

understand what we need to do.

 And basically there are two main

philosophies about how to resolve this

 problem.

 One is based on the endpoints and

things we do there in order to

 mitigate congestion and to ensure all the

things you mentioned.

 The other is based on a fabric,

the network itself.

 So let's talk about both of them.

 Let's start from the endpoints.

 What do we do there?

 Yeah, so you mentioned perfectly, you have

two types of approaches.

 The first type, which is endpoint

congestion control or endpoint scheduling

 mechanisms, are talking about how to solve

the problem

 once it occurs.

 Okay, then you have

a type of solution that is talking

 about how to proactively prevent the

problem from happening.

 Let's deep dive into the NIC based

or the endpoint based solution.

 So if you look at the industry

today, you'll see all types of vendors,

 you'll see NVIDIA offering their

SpectrumX, you'll see all sorts

 of collaborations between switch vendors

and NIC vendors trying to

 somehow integrate the NIC into the switch

in order to solve it.

 But there are a few fundamental problems

with it, right?

 First one, it's very costly, right?

 It's costly because the SuperNIC or the

DPU costs a lot of

 money, right?

 That's st.

 nd, it's costly because the DPU usually

consumes a lot of power and a

 lot of cooling that it needs.

 And that's basically the worst solution

you can choose.

 And I'm being specific here when you're

talking about TCO, okay?

 Now it solves the problem of elephant

flows because then you can activate on

 your network some kind of smarter load

balancing mechanism like some kind of

 packet spraying and stuff like that, and

then somehow reorder it on the NIC

 side, right?

 So it does solve the problem, but

then it introduces another problem that we

 haven't mentioned.

 And this is an operational problem.

 Now the operational problem that I'm

referring to is specifically with fine

 tuning the network.

 If you want to have a good

congestion control mechanism or a good

 reordering mechanism in your endpoints,

you need skillset, you need people to

 maintain that, right?

 You need people to go ahead and

prepare your infrastructure every time you

 want to run a model.

 So it solves the problem.

 On the technical side, it's costly and

it requires some decent

 expertise and decent skillset.

 Okay, so this is a valid solution,

but it has its flaws.

 What about resolving it or avoiding the

problem altogether in the network, in the

 fabric itself?

 So if you ask me when I'm

talking, when I'm, when I'm thinking about

 AI networks, I think the optimal solution

would be chassis, right?

 Yeah.

 Imagine just taking a bunch of GPUs

and connecting it into one chassis, right?

 And then the chassis does all the

magic.

 Yeah, it's a single hop Ethernet.

 the backplane is...

 Exactly, it's a single

hop from NIFT to NIF or from

 port to port.

 Right.

 From Ethernet port to Ethernet port.

 It has no congestion in it.

 Right.

 The connection from the NIF to the

fabric and then to the NIF is

 end to end scheduling and you lose no

packets.

 And in terms of operations in terms

of how to maintain that.

 It's.

 Plug and play.

 It's given every guy with a CCNA

can do it.

 But if I have , GPUs, there

is no such chassis.

 Exactly.

 Now that's the major problem.

 And we have a solution for that.

 You want to hear about it?

 Oh, yeah.

 Never heard of it.

 So essentially what we did with DriveNets

is we took a chassis, we distributed

 it, we disaggregated it, but that's a

whole different topic.

 So we disaggregated it, we distributed it,

and essentially we made it scalable to

 an extent the industry have never seen

before.

 Right?

 So essentially our solution is based on

two building blocks.

 We have the NCP and NCF, while

NCPs are equivalent to the old fashioned

 line cards, and the NCF is equivalent

to the fabric board that we're used

 to from the back plan of the

chassis.

 And essentially this distribution of the

chassis gives us the benefits of a

 chassis which is end to end VoQ

system, fully scheduled, lossless by

 nature, and then scalable.

 In fact, you can have a chassis

like solution, which is optimal in terms

 of operations and in terms of technical

abilities in AI networks, and you can

 have it scale up to , GPUs

in a single cluster.

 Okay, this is very cool.

 So thank you, Yossi.

 This was mind blowing.

 The three things we need to remember

about resolving the AI fabric, or

 AI networking problem with large clusters of

GPU's usually used for training, is .

 One, the problem itself derived from the fact

that we use RDMA, which means that

 we need a lossless scheduled, high

performance fabric, and the elephant

 flow nature of the information

distribution within the cluster, which

 means classic load balancing like ECMP

would not work.

 So we have two solutions.

 The second and third point.

 The first solution is endpoint scheduling,

which relies on the endpoints, which needs

 to be very smart, very compute and

power savvy, like DPUs.

 This is a congestion control or congestion

mitigation solution, which

 bring you so far in terms of

performance, but it costs you a lot

 and also is very complicated to manage.

 This is coming from vendors like NVIDIA

with their SpectrumX, and also other

 vendors that are cooperating in the Ultra

Ethernet Consortium, for instance.

 And the third point is the network

based solution for resolving this issue.

 The network based solution is practically

building a chassis which is distributed,

 hence disaggregated, distributed chassis,

which means you have no packet loss,

 a lossless and fully scheduled fabric, but

without the scale limitation of a chassis.

 And this is coming, of course, from

DriveNets with our DDC, but also for

 other vendors.

 We will talk about Arista DES in

the next movie.

 So this is what you need to

remember.

 Thank you very much Yossi.

 Thank you for having me and thank

you for watching.

 See you next time on CloudNets.

 

People on this episode