Table of contents (for this page):
BGP and IPv6 routing coursesSeveral times a year I teach two training courses, one about BGP and one about IPv6. The BGP course is half theory and half hands-on practice, and so is the new IPv6 routing course. Previously, we did an IPv6 course without a hands-on part.
The courses consists of a theory part in the morning and a practical part in the afternoon where the participants implement several assignments on a Cisco router (in groups of two participants per router).
The next dates are February 2 for the BGP course in Dutch and February 3 for the IPv6 routing course in Dutch. (There will be dates for the courses in English later in 2015.) Go to the NL-ix website to find more information and sign up. The location will be The Hague, Netherlands.
Interdomain Routing & IPv6 News
Yesterday, I wrote:
❝In almost a week, I received zero IPv4 "too big" messages.
However, João Taveira Araújo told me that he sees a good number of IPv4 ICMP "too big" messages. Those seem to result from SSL traffic. At first, that seemed strange, as SSL is just payload for TCP so TCP MSS clamping should work on SSL sessions the same as on non-SSL sessions.
But then I realized that this traffic could be SSL VPNs. If a VPN gateway takes an IPv4 packet and encapsulates that in SSL in a single TCP segment, then the size of those TCP segments isn't influenced by MSS clamping, so too big messages will be generated if the path MTU is smaller than the MTUs of the endpoints of the SSL connection. I wonder if those SSL VPN implementations handle path MTU discovery properly, though.
Please read yesterday's post Maximum packet sizes on the internet first. There, I looked at the maximum supported packet sizes that are included in the TCP MSS option in HTTP requests to my server. Today I'll look at the values in ICMP(v6) "too big" messages.
(If you don't need the history lesson, skip ahead to "IPv4 measurements".)
The computer making that HTTP request doesn't necessarily know about the MTU sizes supported along the entire path that the data flows over. If the hops in between support larger packets, there is no issue, although it would be more efficient if we could make use of that ability.
The problems start if there's one or more hops along the way that can't support the packet size that the two computers that are communicating want to use. These days, pretty much everything is Ethernet, so a 1500-byte MTU would be expected. (Ironically, most of the technologies that are now replaced by Ethernet, such as IP over ATM and IP over FDDI, support packet sizes much larger than 1500 bytes.)
However, various types of tunnels (including VPNs) as well as PPP over Ethernet may reduce the the MTU for a given hop to less than 1500. The original way that IP(v4) handled this was through fragmentation. So if a host (computer) sends a 1500-byte IPv4 packet, but the next hop is a PPPoE link that can only handle 1492 bytes, the router will break the original packet into two fragments. Almost always the first packet will be a 1492-byte one, and the second packet holds the remaining 8 bytes It has its own copy of the IP header, so its total size is 28 bytes. (It would be much better to make two 760-byte packets...)
Fragmentation creates a whole bunch of problems. First of all, it takes a lot of extra time and processing for the router to fragment packets. Then the receiving host has to spend extra CPU cycles to reassemble the packet. If the communication rate is high enough, a lost fragment can easily lead to the situation where fragments of two different packets are reassembled. Usually, the TCP/UDP checksum will catch this, but that checksum isn't very strong, so one in every 65000 or so instances, it won't catch the problem and the communication will be corrupted. Last but not least, because the TCP/UDP port numbers are only present in the first fragment, NATs and firewalls have a hard time dealing with fragments.
The solution is path MTU discovery (PMTUD), which was specified in 1990. In order to avoid fragmentation in routers, hosts try to figure out the largest packets they can send to another host without fragmentation. They do this by setting the DF (don't fragment) bit in the IPv4 header. With that bit set, routers aren't allowed to fragment the packet. Instead, they send back a "packet too big but fragmentation needed" message. The PMTUD specification updated this message's format to include the supported packet size. This makes PMTUD very easy: simply send packets as large as you like with the DF bit set, and if you get "too big" messages back, lower your packet size to the value in the too big message.
However, this introduced a new problem: if the too big message doesn't make it back, either because the router didn't generate it, it got lost along the way or because it was filtered by a firewall, the host keeps resending large packets that never make it to their destination. To add insult to injury, the initial TCP handshake uses small packets, so that part works, but then the packets that actually contain data never make it to the other side. This is a PMTUD black hole.
PMTUD is optional with IPv4, although it's universally used, even by people who filter the ICMP "too big" messages. With IPv6, if you want to send packets larger than 1280 bytes, PMTUD is mandatory, as routers aren't allowed to fragment IPv6 packets. And unlike with IPv4, PMTUD also works for non-TCP protocols (such as UDP) with IPv6, as the source host's networking stack will fragment non-TCP packets before transmission. Enough background, on to the...
Over the course of more than five days, my server received 52758 IPv4 ICMP messages. Those were:
In almost a week, I received zero IPv4 "too big" messages.
So it seems in the IPv4 world, path MTU discovery is dead. Turns out that so many people filter ICMP messages, that if you rely on PMTUD for IPv4, there's just too much breakage. So what (home) routers that sit in front of a reduced-MTU link do is "MSS clamping". They rewrite the value in the TCP MSS option to what's supported on the interface they're about to transmit the packet over.
(Please don't read this as "it's ok to filter ICMP "too big" messages. It could easily be that some users still depend on these.)
Over the same five days, my server received 57244 ICMPv6 messages:
The 84 ICMPv6 too big messages came from 9 unique sources, although one of those is a tunnel gateway that returned two different sizes for (presumably) two different tunnels. So that's 10 values, with the following distribution:
One of those 1480 results is my own connection at home, which uses an IPv6-in-IPv4 tunnel terminated on my home router. So my computers at home don't know the path MTU is 1480 and depend on PMTUD, which seems to work without obvious problems. Maybe two or three times a week I encounter a page that won't load, which may or may not be an IPv6-related issue, which in turn may or may not be a PTMUD issue.
Hopefully, IPv6 won't lose PMTUD to ICMP filtering like IPv4 did. MSS clamping is effective for TCP, but it doesn't work for non-TCP protocols or IPsec-protected communication. It's also a burden on routers.
Stéphane Bortzmeyer replied to yesterday's post with a link to this 20-year-old (to the day!) message, which has results for very similar measurements. The results are different in interesting ways, with the real stunner being that in 1994, 94% of all systems could handle 1500 bytes, but in 2014, this is down to 65%.
After some heated discussions about packet sizes on the mailinglist of the IETF v6ops working group, I decided to do some measurements to find out what maximum packet sizes are supported on today's internet. I did this by capturing two types of packets: the ICMP "too big" messages that routers send to tell a computer to send smaller packets, and the first packet of a TCP session, which contains the MSS option. The maximum segment size (MSS) option is used in TCP sessions to tell the other side what the maximum packet size is that we can receive. This depends on the maximum transfer unit (MTU) of the hardware, which may be further reduced by system administrators.
The Ethernet standard uses an MTU of 1500 bytes, although a lot of Ethernet hardware can support more, such as 9000-byte "jumboframes". Wi-Fi also uses 1500 bytes to be compatible with Ethernet. However, sometimes one protocol needs to be tunneled over another protocol, such as IPv6 over IPv4 (over Ethernet) or PPP over Ethernet, which reduces the supported packet size to 1480 or 1492, respectively. The IPv6 specifications require that a minimum MTU of 1280 bytes is supported. IPv4 has no minimum MTU. Note that all of this is about the maximum packet size, it is of course perfectly fine to send smaller packets.
TCP (and UDP) use segments which are put inside IP packets that are then transmitted inside Ethernet frames. A 1500-byte IPv4 packet supports 1460-byte TCP frames (1500 bytes minus the 20-byte IPv4 header and the 20-byte TCP header). This 1500-byte IP packet is transmitted as a 1518-byte Ethernet frame, although some people only count 14 bytes for the Ethernet header, ignoring the 4-byte checksum that's at the end of the Ethernet frame. Because the IPv6 header is 40 bytes, a 1500-byte IPv6 packet can hold a 1440-byte TCP segment, while a 1500-byte IPv4 packet can hold a 1460-byte TCP segment. I'll be talking about IP MTU sizes rather than segment/MSS sizes to make it easier to compare IPv4 and IPv6 results.
Over the better part of a week, my server received 41753 incoming TCP SYN packets with an MSS option on port 80. Another 140 packets didn't have the MSS option, and looked like they were mostly TCP-based traceroute packets. 24246 packets were IPv4 packets, coming from 4164 unique IP addresses. 17507 were IPv6 packets, which I reduced down to 227 unique IP addresses. Turns out, most of the IPv6 traffic on my server is from bots that check if I've added any new content to the site. Some of them use the same address each time, but others keep using different addresses, but, strangely, the same source port number (12000 - 12006). I removed these addresses to keep them from drowning out the real data.
The data showed no fewer than 72 different MTU sizes for IPv4, ranging from 576 to 9198 bytes. However, both of these extremes only showed up once, and other values below 1280 and above 1500 are also quite rare:
I found the 9001 value quite curious; computers really like to work on nice round multiples of 2, 4 or 8 bytes. 9001, on the other hand, is a prime number. Turns out that 9001 bytes is used Amazon's datacenters, where some of the bots that index my website reside. These are the more common MTU sizes advertised in the TCP MSS option:
1300 and 1400 look like someone set them manually; 1300 is also a common VPN MTU. 1440 bytes seems to be hardcoded in some home routers. 1460 could indicate IPv4-over-IPv6 tunneling. 1470 seems to be used by a number of broadband ISPs and 1492 results from PPP over Ethernet (PPPoE). Last but not least, just under two thirds of IPv4 visitors support the Ethernet MTU of 1500.
These are the results for IPv6 with the < 1% values removed (there were no values below 1280 and above 1500):
1280 and 1480 are probably IPv6-in-IPv4 tunnels and 1428 AYIYA tunnels. 1472 could be IPv6-in-UDP-in-IPv4 tunnels or IPv6-in-IPv4-over-PPPoE. The image below shows the cumulative frequency of MTU sizes for both IPv4 (red) and IPv6 (blue), where the line shows how many systems support a given MTU value, starting at 99.98% for 1200 and ending at 65.56% for 1500 (for IPv4).
The 90th percentile MTU size is 1428 for IPv6 and 1440 for IPv4. Obviously 100% of IPv6 systems support 1280, but 99.7% of IPv4 systems also support this size.
The MSS reflects the maximum size that the systems at both ends of a connection think they can use. However, there may be a bottleneck somewhere along the path. In that case, routers send back an ICMP Packet Too Big message. Tomorrow, I'll look at those.
I'm in Honolulu for the IETF meeting this week. As always, on sunday morning before the meeting proper starts, there's the IEPG, where there's always interesting stuff being presented, usually from the operational side of networking.
Today, there were talks about IPv6 packets with extension headers being dropped, routing table and packet size issues by Geoff Huston, and a discussion on Shim6 and Multipath TCP (MPTCP) failure recovery by Brian Carpenter. All good stuff. However, at the end of Brian's presentation, Lorenzo Colitti thanked Brian for the interesting presentation about the performance of undeployable protocol A vs undeployable protocol B.
I kind of get why Shim6 and MPTCP are considered undeployable, because you need to have addresses from two different ISPs, and you need to make sure that packets with addresses from ISP 1 go to ISP 1 and those with addresses from ISP 2 to ISP 2. If not, BCP 38 ingress filtering will block the packets. The trouble is that the RIRs started giving out provider independent IPv6 addresses shortly before Shim6 was finished so larger networks simply use those, Shim6 never got any traction and so if you want to use it now you'll find that nobody else uses it, and you need it at both ends for it to work. It's still somewhat early days for MPTCP, but it doesn't seem to be setting the world on fire, either.
But Lorenzo was talking about the fact that MPTCP uses TCP options that are filtered out by firewalls. Brian already mentioned that the Shim6 extension header is also often filtered, and suggested that probe packets should look like normal data packets.
However, when both of these were designed, those issues were considered. Obviously it would have been great if we could have implemented these two protocols without the need for additional options or headers, but I don't see how that would have been possible. So the next best thing was to make sure that if the options, or the packets containing options, are filtered, communication still works without the benefits of Shim6 or MPTCP. This means the protocols were never undeployable: you can turn them on by default without any issues. If the headers/options are filtered, you simply don't get any benefit, but everything still works. For paths where the options/headers are left alone, Shim6 and MPTCP get to do their thing and you benefit from being able to use additional paths. Over time, hopefully firewall operators realize these protocols don't cause any harm and stop filtering them.
Unfortunately, there are protocols that really do turn out to be undeployable, because firewalls or bad implementations break any communication that uses those protocols. Archives of all articles - RSS feed
My Books: "BGP" and "Running IPv6"On this page you can find more information about my book "BGP". Or you can jump immediately to chapter 6, "Traffic Engineering", (approx. 150kB) that O'Reilly has put online as a sample chapter. Information about the Japanese translation can be found here.
"no synchronization"When you run BGP on two or more routers, you need to configure internal BGP (iBGP) between all of them. If those routers are Cisco routers, they won't work very well unless you configure them with no synchronization.
The no synchronization configuration command tells the routers that you don't want them to "synchronize" iBGP and the internal routing protocol such as OSPF. The idea behind synchronizing is that when you have two iBGP speaking routers with another router in between that doesn't speak BGP, the non-BGP router in the middle needs to have the same routing information as the BGP routers, or there could be routing loops. The way to make sure that the non-BGP router is aware of the routing information in BGP, is to redistribute the BGP routing information into the internal routing protocol.
By default, Cisco routers expect you to do this, and wait for the BGP routing information to show up in an internal routing protocol before they'll use any routes learned through iBGP. However, these days redistributing full BGP routing into another protocol isn't really done any more, because it's easier to simply run BGP on any routers in the middle.
But if you don't redistribute BGP into internal routing, the router will still wait for the BGP routes to show up in an internal routing protocol, which will never happen, so the iBGP routes are never used.
BGP SecurityBGP has some security holes. This sounds very bad, and of course it isn't good, but don't be overly alarmed. There are basically two problems: sessions can be hijacked, and it is possible to inject incorrect information into the BGP tables for someone who can either hijack a session or someone who has a legitimate BGP session.
Session hijacking is hard to do for someone who can't see the TCP sequence number for the TCP session the BGP protocol runs over, and if there are good anti-spoofing filters it is even impossible. And of course using the TCP MD5 password option (RFC 2385) makes all of this nearly impossible even for someone who can sniff the BGP traffic.
Nearly all ISPs filter BGP information from customers, so in most cases it isn't possible to successfully inject false information. However, filtering on peering sessions between ISPs isn't as widespread, although some networks do this. A rogue ISP could do some real damage here.
There are now two efforts underway to better secure BGP:
The IETF RPSEC (routing protocol security) working group is active in this area.
IPv6BGPexpert is available over IPv6 as well as IPv4. www.bgpexpert.com has both an IPv4 and an IPv6 address. You can see which one you're connected to at the bottom of the page. Alternatively, you can click on www.ipv6.bgpexpert.com to see if you can connect over IPv6. This URL only has an IPv6 address.
What is BGPexpert.com?BGPexpert.com is a website dedicated to Internet routing issues. What we want is for packets to find their way from one end of the globe to another, and make the jobs of the people that make this happen a little easier.
Ok, but what is BGP?Have a look at the "what is BGP" page. There is also a list of BGP and interdomain routing terms on this page.
BGP and MultihomingIf you are not an ISP, your main reason to be interested in BGP will probably be to multihome. By connecting to two or more ISPs at the same time, you are "multihomed" and you no longer have to depend on a single ISP for your network connectivity.
This sounds simple enough, but as always, there is a catch. For regular customers, it's the Internet Service Provider who makes sure the rest of the Internet knows where packets have to be sent to reach their customer. If you are multihomed, you can't let your ISP do this, because then you would have to depend on a single ISP again. This is where the BGP protocol comes in: this is the protocol used to carry this information from ISP to ISP. By announcing reachability information for your network to two ISPs, you can make sure everybody still knows how to reach you if one of those ISPs has an outage.
For those of you interested in multihoming in IPv6 (which is pretty much impossible at the moment), have a look at the "IPv6 multihoming solutions" page.
Are you a BGP expert? Take the test to find out!
These questions are somewhat Cisco-centric. We now also have another set of questions and answers for self-study purposes.
You are visiting bgpexpert.com over IPv4. Your address is 18.104.22.168.