I've never heard of routers using the ttl field for hashing, why would that ever be useful?
Also it seems an incorrect use of anycast to terminate the same flow at different machines. At Google, anycast traffic goes through clusters of maglevs that directs individual flows to the same endpoint consistently.
If you're doing multi-tier ECMP with the same hashing algorithm at each tier, using TTL can ensure you don't get polarization issues. Though there's plenty of other ways to avoid polarization, this works out of the box.
I have heard fastly competitors tut-tut their use of anycast as potentially unreliable in the face of issues like this.
I dunno, in my experience anything that "works right out of the box" usually misses a bunch of edge cases (possibly necessarily/by definition). The description seems apt to me.
Consider a case where you have a router balancing over four links. It chooses which link based on a hash of some information from the packet.
Now imagine you have four more routers on each of those links, hashing out over four more links. So you have a tree with 16 outputs.
If the second-tier router uses the same hash algorithm as the first one, all the packets it receives will hash to the same link, because it's doing the same calculation as the router before it.
Thus the 2nd tier of four routers will only use 4 of their outputs, instead of all 16.
I think you are describing a CLOS(spine and leaf) network data center topology here.
>"Now imagine you have four more routers on each of those links, hashing out over four more links. So you have a tree with 16 outputs."
I'm not really understanding this as a link exists exactly between only two routers do you mean "path" instead?
>"If the second-tier router uses the same hash algorithm as the first one, all the packets it receives will hash to the same link, because it's doing the same calculation as the router before it.
And this is kind of the tax of flow based preservation which is fine compared to the price of TCP reordering no? Efficient hash-based ECMP utilization is going to be a function of the distribution of source IP and port in the 5 tuple used in hashing. You can see this outside of EMCP for example when running LVS with the hashing algo and you have customers that are all behind the same NAT box. But also there's nothing stopping you from using different hashing on your spine tier than you do on your leaf tier. You could assign a 4 tuple on one and a 5 tuple on the other.
At any rate a common CLOS ECMP design with BGP is to put each ToR switch in its own ASN and then load balance across ASNs. So using your example and a 3 tier CLOS network. If Tier 2 had the 16 outputs then the a router should ECMP over the 16 different ToR ASNs to the destination.
> And this is kind of the tax of flow based preservation which is fine compared to the price of TCP reordering no? Efficient hash-based ECMP utilization is going to be a function of the distribution of source IP and port in the 5 tuple used in hashing.
mcpherrinm's example is worse than what I think you're suggesting. Because a given second-tier router will only receive packets that hash to a specific set of values, if that router has the same number of downstream links that the first tier router had, it will send all packets it receives to the same link, ignoring other links. Which is a terrible trade-off for avoiding OOO packets.
The more reasonable trade-off is giving up on utilizing all routers for a single flow, to avoid OOO packets.
>"The more reasonable trade-off is giving up on utilizing all routers for a single flow, to avoid OOO packets."
Huh? No you would never want to use "all" routers for a single flow anyway. Each router just needs to make a deterministic selection for each packet. The alternative to a hash based scheme would be per packet load balancing which is practically never used b/c it gives you TCP packet reordering.
>"if that router has the same number of downstream links that the first tier router had, it will send all packets it receives to the same link, ignoring other links. Which is a terrible trade-off for avoiding OOO packets"
No it would not be a terrible trade off. Optimizing for maximum link utilization only matters if you have congestion in your network and even then ECMP is congestion agnostic. In reality your leaf network has less downstream link that it has upstream. A common topology is 4x4x2 where each leaf node has two downstream links to two ToR switches.
Pedantic: Clos isn’t an acronym and only the first letter should be capitalized. It’s a non-blocking switched network named after its inventor, Charles Clos.
Thank you for this explanation, I understand now. Originally I thought you would want all routers in the path to hash the same flow the same way, but didn't think about how it interacts with layers of routers.
On first guess, I would think a per-device salt is another way to address polarization. What are other ways?
Per device salt is the usual way. See https://docs.cumulusnetworks.com/plugins/servlet/mobile#cont... for some example documentation (ctrlf hash seed). Or you can change the inputs to the hash function (eg, one tier hashes on source is and the other doesn't) but that's more troublesome.
The property you want is that all packets for a given flow take the same path through the network (absent topology changes), to minimize out-of-order packets and ease troubleshooting.
A1 +- B1 +- C1
| `- C2
|
`- B2 +- C3
`- C4
In this example, if all routers compute the same hash value for a given packet, links B1 -> C2 and B2 -> C4 never get used, for any flows.
So all you really need is for each router to make a consistent decision about packets in the same flow. They don't have to use the exact same hashing function. In addition to the cumulus networks link provided in the GP, it looks like Cisco gear also has a per-device salt: https://www.cisco.com/c/en/us/support/docs/ip/express-forwar...
Note that this has nothing to do with endpoint selection, just the intermediate hops.
Edit: I'm also leaving out how the final modulo can produce different selections between layers. So in practice, if you have layers that are different sizes you'd get better utilization. That seems like a fragile thing to depend upon, though.
>"So all you really need is for each router to make a consistent decision about packets in the same flow."
Indeed this is what I was trying to articulate but maybe I didn't do a good job of that. I mentioned somewhere else you can also just change the hash at each tier in your network - add IP protocol in one, don't use IP protocol in another. This should achieve the same as adding a "seed." to a router's hash.
The problem is if all of the equipment hashes the same way, many of your links will be underutilized.
If A has a link to B1 and B2, and each of B1 and B2 have two links to C, and everything hashes the same, there will only be two possible paths to C, instead of the four you should have.
Ex: if packet hashes to 0, it goes to B1, and then over the 0 link to C. If packet hashes to 1, it goes to B2 and over the 1 link to C. If the Bs hash on different values than A (including if they have a different salt), you'll have better distribution on the B to C links, and actually be able to hit all 4 paths.
The term for unwanted oscillation between different paths is, IIRC, "flapping".
However, when working with ECMP (equal-cost multi path), you actually do want to use all your paths (hence "equal cost") simultaneously (for load-balancing purposes, say). "Polarisation" refers to the unwanted condition when the hash algorithm that decides which path a flow/packet takes is not properly distributing flows/packets across links, leading to underutilisation and/or overutilisation of links/routes.
Yeah I think you could seed each router's hashing with something unique like a mac address hardware ID but you could also just use different hashing at each tier in your network like using source IP + dest IP on one tier and source IP + dest IP + IP protocol on another tier. This should achieve the same result.
Yeah but then you need to keep coming up with new ways of hashing things as you add tiers. It's easier to generate a new salt than it is to figure out a new but still-valid way of hashing the packets.
Practically speaking this isn't an issue though. You don't see many spine and leaf networks that are deeper than 3 tiers(edge, distribution, ToR) which means one link upstream and one down. It's not like people build networks arbitrarily deep.
Incapsula response is different however: at some point we urgently needed to implement a DDoS protection using GRE tunnels. An Incapsula eng spoke with us as if he was challenging us to a programming competition puzzle: "I can tell you that you are big enough to implement GRE only if you can answer "yes" to these four networking questions". We went with Verisign as the result, they made things look simpler
A consistent route for every packet in a flow is considered desirable, but it's not an entitlement or a requirement of the IP protocol. An DDOS prevention tool that insists that TTL be the same for every packet is broken. Here's one that checks for inconsistent TTL, but it has some tolerance for variation.[1] The one mentioned in the original post didn't like a difference of 1 in TTL.
There's no DDoS prevention tool in play here (aside from that being Fastly's business).
Arista's default of having the TTL as part of the hashing function to calculate packet path means that any TCP connection that does not have a consistent TTL when it reaches this (source) ISP's routers can potentially be bucketed onto multiple different egress paths (transit or peering connections) onto different upstream SP's.
Because BGP works on the concept of AS Numbers rather than IP's (and because Fastly uses Anycast, which announces the same set of IP's from multiple POP's), it's possible that packets that leave the ISP network over different transit connections end up at different Fastly POP's, as each transit provider or peer will have a different route to Fastly's AS.
Fastly does not share TCP session information between POP's as I imagine at that scale it's massively prohibitive to do so.
The TCP RST happens because there is no established TCP session at the secondary POP which receives the TLS handshake packet, and because TTL is part of the hash function it will always end up following a different path (and therefore hitting a different datacenter, if routing stays consistent) if the TTL is reduced by 1.
It's generally deemed 'acceptable' in Anycast terms to cause a TCP RST when traffic switches from one POP to another, as this usually happens rarely, in response to changes in the path between two AS numbers.
For things like websites, browsers will usually just retry the connection, and as the path has changed and stays consistent, it starts to work and all you notice is a slight delay.
It doesn't work in this situation because the second packet with the reduced TTL is consistently routed to an 'incorrect' datacenter and breaks the TCP session every time, effectively dropping service to 0.
>It's generally deemed 'acceptable' in Anycast terms to cause a TCP RST when traffic switches from one POP to another...
>For things like websites, browsers will usually just retry the connection
This is not the case. Every single TCP RST you send on an active http connection will lead to an http request failing in a way which causes application brokenness. No browser auto-retries in that case. The user always has to hit refresh to fix it.
At least with the TCP RST it'll fail fast. What's annoying is that some providers decide that you aren't worthy of the RST; then your browser has to wait for like 3 minutes until the timeout is reached — really annoying.
I'm not a networker, but sounds like from a correctness standpoint, a problem on Fastly's end -- they're reusing frontend IPs for distinct sets of machines, and traffic directed to the 'wrong' PoP is dropped hard rather than attempting any kind of internal routing
Of course that kind of routing would create a potential bottleneck for an attacker to exploit ("simply" force traffic to the wrong IPs to the wrong PoP, assuming $attacker had this level of access to the backbone), but that's the problem Fastly are supposedly paid to deal with
Their scheme is fine and dandy with a protocol like DNS where UDP retries are transparent and TCP is a tiny fraction of weird traffic, but for business applications handling credit cards, surely the occasional RST is already too many
Or another way to look at it, basically they're saying their IP addresses are special snowflakes and actually the full address includes the route, and source networks are wrong for assuming things work the way they're supposed to everywhere else on the Internet
> Of course that kind of routing would create a potential bottleneck for an attacker to exploit ("simply" force traffic to the wrong IPs to the wrong PoP, assuming $attacker had this level of access to the backbone)
It's actually simpler than that — the state of the TCP connection is controlled by the hosts to the TCP connection, so, all it takes is for a "client" host to pretend that the connection has already been established (sending a single packet alleging as such) — no need for any special access to any backbone.
Not really. Anycast is fairly standard and usable for stateful
connections - the issue is again middleboxes fucking with stuff and a weird default of incorporating TTL in the ECMP hashing algo
i tend to agree: anycasting stateful services is always a gamble.. however, i do think including ttl in the ecmp hashing alg is prone to failure, and is probably a poor default.
We run Arista 7280SR's on our edge. I went looking, and as far as I can tell, the TTL is NOT part of the default ECMP algorithm on this platform (Jericho chipset)? We certainly haven't tuned this setting one way or another. I would love for someone more experienced with Arista kit to weigh in on this, since it seems like it may be platform dependent. We do have a support contract so I can reach out to them directly, too.
These commands are the best I could find browsing the Arista docs/blog/forums.
edit: fixed code formatting
edit2: We're on the EOS 4.20.x version train, fwiw
#show port-channel load-balance jericho fields | grep TTL
IP TTL field hashing is OFF
#show load-balance profile (output snipped for brevity)
---------- default (global) ----------
IPv4 hash fields:
Source IPv4 Address is ON
Protocol is ON
Time-To-Live is OFF
Destination IPv4 Address is ON
I don't think MPTCP is going to interact very well with modern high load sites. Even using unicast, it's going to be hard (impossible?) to ensure that all the individual flows make it to the same nic queue, which is what you'd really want for performance. I suspect this is part of the reason why it hasn't caught on very much. (Also, Google doesn't seem to want to invest in a good tcp stack on Android, instead they put an additional tcp layer (http/2) and then built tcp on top of udp (quic) ).
Sad. Apple seemed interested, at one point anyway.
But damn, it does work amazingly well site-to-site. I've managed ~50 Mbps throughput using bbcp (four streams) between Tor .onion services via OnionCat. And ~190 Mbps total from one source transferring simultaneously to five target servers. Each peer had six .onion services.
With six OnionCat interfaces per peer, in MPTCP full-mesh mode, there are up to ~36 subflows per TCP connection. So using bbcp with four streams, there are as many as ~150 tcp6 connections via Tor per bbcp transfer. And with five simultaneous transfers, the MPTCP kernel in the source VPS was managing up to ~750 tcp6 connections. That's impressive!
Apple is still pushing it, which is great. I'm just not sure how well it's going to scale -- MPTCP adds an extra layer of indirection, and extra locking in the easy case where it's just one server handling an IP. In the load balancing case, people are going to have to teach their load balancers a lot of new tricks to get the subflows aligned. If using anycast, the client is likely to be using multiple networks, so using the same server address seems likely to get to a different pop; exposing a pop specific extra server ip seems like something people don't want to do, since exposing that may make it easier to do a single pop.
I've been debugging an issue where incidentally I'm hitting 1gbps on a single tcp connection (server to server, with tls), so I'm not sure why MPTCP is required? ;) But I guess if we had it, I would probably hit 2gbps instead of being capped by the one nic.
Where it gets useful at consumer level is when your phone can hit WiFi and 4G simultaneously, so you can aggregate. And it's even more useful when both WiFi and 4G are iffy, so you seamlessly use one, the other, or both.
And yes, if both of your servers have to gigabit NICs, you can get 2 Gbps. But only if those uplinks aren't bottlenecked at 1 Gbps at the rack or data center level.
That's what caused the issue to surface, but the root cause is offering a stateful service over anycast. There is no hard requirement that you ECMP flows consistently. Spraying may be sub-optimal, but it must be accepted.
I do agree that including TTL by default is weird, though.
Am I the only one who sees a different issue here? The problem is neither stateful service over anycast nor TTL based hashing.
Being a DDoS service, one can imagine the need to have stateful POP since each edge needs to track the TCP state in order to provide DOS protection. At the same time, it is understandable that the state can't be replicated at scale.
As for including TTL in hashing algorithm, it is aimed to solve link under-utilization so it is also a valid implementation.
The real bug here is Arista CPE mangling the TTL bits for the Client Hello packets. I always hate it when networking gear meddles with the protocol stack. Sure it gives some flexibility but time and time again, it ends up breaking something somewhere in the path since much of internet networking is a pile of assumptions. Tampering with protocol fields unilaterally is going to break someone's assumptions somewhere down the path.
Again, there is no hard requirement that all packets in a flow take the same route. Keeping the TCP state machine POP-local when running anycast TCP is exposing a buggy TCP implementation to the Internet. I understand that it is desirable to do so, especially under calm routing conditions where it tends to work, but it is not correct.
I agree that Arista shouldn't be mangling TTL here. However, TTL mangling shouldn't break TCP because TTLs can already change under normal operation, e.g. when routes change, which happens all the time on the Internet. In a non-anycast situation, TCP connections would stay open under these conditions.
To clarify, the Arista routers are the ISP's border routers. It was the residential/business customers' SOHO routers, of various makes and models, that were not decrementing the initial TCP SYN.
Thanks for pointing out. Sorry I overlooked that part. Perhaps I got primed by the opening statements which implied that problem happened only after placing new Arista routers and hastily assumed that it was Arista's routers that mangled the bits.
You are correct, the problem started only after placing new Arista border routers. The previous border routers were not doing ECMP.
The issue was a combination of the use of anycast, diverse Internet transit egress, this model of Arista defaulting to using the packet's TTL in its ECMP hash calculations, and the end-customer router CPE egressing packets that are part of the same TCP connection with variable TTL values. Change any one of those items and the issue would not have shown up.
Also it seems an incorrect use of anycast to terminate the same flow at different machines. At Google, anycast traffic goes through clusters of maglevs that directs individual flows to the same endpoint consistently.
https://ai.google/research/pubs/pub44824
Also I'd like to echo that Fastly's NOC is great. Super responsive and smart.
Disclaimer: I work at Google