From 7ff4f86e11217269d01dc580758317522a75daf5 Mon Sep 17 00:00:00 2001 From: magalii Date: Tue, 29 Jan 2019 00:42:20 +0100 Subject: [PATCH 1/3] Corrections of misprints, language and ponctuation in part Principles --- book-2nd/principles/dv.rst | 48 ++--- book-2nd/principles/naming.rst | 35 ++-- book-2nd/principles/network.rst | 155 ++++++++-------- book-2nd/principles/reliability.rst | 4 +- book-2nd/principles/security.rst | 85 +++++---- book-2nd/principles/sharing.rst | 269 ++++++++++++++-------------- book-2nd/principles/transport.rst | 144 ++++++++------- 7 files changed, 367 insertions(+), 373 deletions(-) diff --git a/book-2nd/principles/dv.rst b/book-2nd/principles/dv.rst index 1d011b9..81f081d 100644 --- a/book-2nd/principles/dv.rst +++ b/book-2nd/principles/dv.rst @@ -6,7 +6,7 @@ Distance vector routing ----------------------- -Distance vector routing is a simple distributed routing protocol. Distance vector routing allows routers to automatically discover the destinations reachable inside the network as well as the shortest path to reach each of these destinations. The shortest path is computed based on `metrics` or `costs` that are associated to each link. We use `l.cost` to represent the metric that has been configured for link `l` on a router. +Distance vector routing is a simple distributed routing protocol. Distance vector routing allows routers to automatically discover the destinations reachable inside the network as well as the shortest path to reach each of these destinations. The shortest path is computed based on `metrics` or `costs` that are associated to each link. We use `l.cost` to represent the metric that has been configured for link `l` on a router. Each router maintains a routing table. The routing table `R` can be modelled as a data structure that stores, for each known destination address `d`, the following attributes : @@ -18,14 +18,14 @@ A router that uses distance vector routing regularly sends its distance vector o .. code-block:: python - Every N seconds: + Every N seconds: v=Vector() for d in R[]: # add destination d to vector v.add(Pair(d,R[d].cost)) for i in interfaces # send vector v on this interface - send(v,i) + send(v,i) When a router boots, it does not know any destination in the network and its routing table only contains itself. It thus sends to all its neighbours a distance vector that contains only its address at a distance of `0`. When a router receives a distance vector on link `l`, it processes it as follows. @@ -35,24 +35,24 @@ When a router boots, it does not know any destination in the network and its rou # V : received Vector # l : link over which vector is received def received(V,l): - # received vector from link l + # received vector from link l for d in V[] if not (d in R[]) : - # new route + # new route R[d].cost=V[d].cost+l.cost R[d].link=l R[d].time=now else : # existing route, is the new better ? if ( ((V[d].cost+l.cost) < R[d].cost) or ( R[d].link == l) ) : - # Better route or change to current route + # Better route or change to current route R[d].cost=V[d].cost+l.cost R[d].link=l R[d].time=now -The router iterates over all addresses included in the distance vector. If the distance vector contains an address that the router does not know, it inserts the destination inside its routing table via link `l` and at a distance which is the sum between the distance indicated in the distance vector and the cost associated to link `l`. If the destination was already known by the router, it only updates the corresponding entry in its routing table if either : - +The router iterates over all addresses included in the distance vector. If the distance vector contains an address that the router does not know, it inserts the destination inside its routing table via link `l` and at a distance which is the sum between the distance indicated in the distance vector and the cost associated to link `l`. If the destination was already known by the router, it only updates the corresponding entry in its routing table if either : + - the cost of the new route is smaller than the cost of the already known route `( (V[d].cost+l.cost) < R[d].cost)` - the new route was learned over the same link as the current best route towards this destination `( R[d].link == l)` @@ -63,43 +63,43 @@ To understand the operation of a distance vector protocol, let us consider the n .. figure:: ../../book/network/svg/dv-1.png :align: center - :scale: 100 + :scale: 100 Operation of distance vector routing in a simple network Assume that `A` is the first to send its distance vector `[A=0]`. - - `B` and `D` process the received distance vector and update their routing table with a route towards `A`. + - `B` and `D` process the received distance vector and update their routing table with a route towards `A`. - `D` sends its distance vector `[D=0,A=1]` to `A` and `E`. `E` can now reach `A` and `D`. - `C` sends its distance vector `[C=0]` to `B` and `E` - `E` sends its distance vector `[E=0,D=1,A=2,C=1]` to `D`, `B` and `C`. `B` can now reach `A`, `C`, `D` and `E` - `B` sends its distance vector `[B=0,A=1,C=1,D=2,E=1]` to `A`, `C` and `E`. `A`, `B`, `C` and `E` can now reach all destinations. - - `A` sends its distance vector `[A=0,B=1,C=2,D=1,E=2]` to `B` and `D`. + - `A` sends its distance vector `[A=0,B=1,C=2,D=1,E=2]` to `B` and `D`. At this point, all routers can reach all other routers in the network thanks to the routing tables shown in the figure below. .. figure:: ../../book/network/svg/dv-full.png :align: center - :scale: 100 + :scale: 100 Routing tables computed by distance vector in a simple network -To deal with link and router failures, routers use the timestamp stored in their routing table. As all routers send their distance vector every `N` seconds, the timestamp of each route should be regularly refreshed. Thus no route should have a timestamp older than `N` seconds, unless the route is not reachable anymore. In practice, to cope with the possible loss of a distance vector due to transmission errors, routers check the timestamp of the routes stored in their routing table every `N` seconds and remove the routes that are older than :math:`3 \times N` seconds. When a router notices that a route towards a destination has expired, it must first associate an :math:`\infty` cost to this route and send its distance vector to its neighbours to inform them. The route can then be removed from the routing table after some time (e.g. :math:`3 \times N` seconds), to ensure that the neighbouring routers have received the bad news, even if some distance vectors do not reach them due to transmission errors. +To deal with link and router failures, routers use the timestamp stored in their routing table. As all routers send their distance vector every `N` seconds, the timestamp of each route should be regularly refreshed. Thus no route should have a timestamp older than `N` seconds, unless the route is not reachable anymore. In practice, to cope with the possible loss of a distance vector due to transmission errors, routers check the timestamp of the routes stored in their routing table every `N` seconds and remove the routes that are older than :math:`3 \times N` seconds. When a router notices that a route towards a destination has expired, it must first associate an :math:`\infty` cost to this route and send its distance vector to its neighbours to inform them. The route can then be removed from the routing table after some time (e.g. :math:`3 \times N` seconds), to ensure that the neighbouring routers have received the bad news, even if some distance vectors do not reach them due to transmission errors. Consider the example above and assume that the link between routers `A` and `B` fails. Before the failure, `A` used `B` to reach destinations `B`, `C` and `E` while `B` only used the `A-B` link to reach `A`. The affected entries timeout on routers `A` and `B` and they both send their distance vector. - `A` sends its distance vector :math:`[A=0,B=\infty,C=\infty,D=1,E=\infty]`. `D` knows that it cannot reach `B` anymore via `A` - `D` sends its distance vector :math:`[D=0,B=\infty,A=1,C=2,E=1]` to `A` and `E`. `A` recovers routes towards `C` and `E` via `D`. - `B` sends its distance vector :math:`[B=0,A=\infty,C=1,D=2,E=1]` to `E` and `C`. `C` learns that there is no route anymore to reach `A` via `B`. - - `E` sends its distance vector :math:`[E=0,A=2,C=1,D=1,B=1]` to `D`, `B` and `C`. `D` learns a route towards `B`. `C` and `B` learn a route towards `A`. - + - `E` sends its distance vector :math:`[E=0,A=2,C=1,D=1,B=1]` to `D`, `B` and `C`. `D` learns a route towards `B`. `C` and `B` learn a route towards `A`. + At this point, all routers have a routing table allowing them to reach all another routers, except router `A`, which cannot yet reach router `B`. `A` recovers the route towards `B` once router `D` sends its updated distance vector :math:`[A=1,B=2,C=2,D=1,E=1]`. This last step is illustrated in figure :ref:`fig-afterfailure`, which shows the routing tables on all routers. .. _fig-afterfailure: .. figure:: ../../book/network/svg/dv-failure-2.png :align: center - :scale: 100 + :scale: 100 Routing tables computed by distance vector after a failure @@ -118,40 +118,40 @@ This count to infinity problem occurs because router `A` advertises to router `D .. code-block:: python - Every N seconds: + Every N seconds: # one vector for each interface for l in interfaces: v=Vector() for d in R[]: if (R[d].link != l) : - v=v+Pair(d,R[d.cost]) + v=v+Pair(d,R[d].cost) send(v) # end for d in R[] - #end for l in interfaces + #end for l in interfaces This technique is called `split-horizon`. With this technique, the count to infinity problem would not have happened in the above scenario, as router `A` would have advertised :math:`[A=0]`, since it learned all its other routes via router `D`. Another variant called `split-horizon with poison reverse` is also possible. Routers using this variant advertise a cost of :math:`\infty` for the destinations that they reach via the router to which they send the distance vector. This can be implemented by using the pseudo-code below. .. code-block:: python - Every N seconds: + Every N seconds: for l in interfaces: # one vector for each interface v=Vector() for d in R[]: if (R[d].link != l) : - v=v+Pair(d,R[d.cost]) + v=v+Pair(d,R[d].cost) else: v=v+Pair(d,infinity); send(v) # end for d in R[] - #end for l in interfaces + #end for l in interfaces -Unfortunately, split-horizon, is not sufficient to avoid all count to infinity problems with distance vector routing. Consider the failure of link `A-B` in the network of four routers below. +Unfortunately, split-horizon is not sufficient to avoid all count to infinity problems with distance vector routing. Consider the failure of link `A-B` in the network of four routers below. .. figure:: ../../book/network/svg/dv-infinity.png :align: center - :scale: 100 + :scale: 100 Count to infinity problem diff --git a/book-2nd/principles/naming.rst b/book-2nd/principles/naming.rst index 7343ca8..dff66fe 100644 --- a/book-2nd/principles/naming.rst +++ b/book-2nd/principles/naming.rst @@ -8,18 +8,18 @@ Naming and addressing The network and the transport layers rely on addresses that are encoded as fixed size bit strings. A network layer uniquely identifies a host. Several transport layer entities can use the service of the same network layer. For example, a reliable transport protocol and a connectionless transport protocol can coexist on the same host. In this case, the network layer multiplexes the segments produced by the two protocols. This multiplexing is usually achieved by placing in the network packet header a field that indicates which transport protocol produced and should process the segment. Given that there are few different transport protocols, this field does not need to be long. The port numbers play a similar role in the transport layer since they enable it to multiplex data from several application processes. -While addresses are natural for the network and transport layer entities, human users prefer to use names when interacting with servers. Names can be encoded as a character string and a mapping services allows applications to map a name into the corresponding address. Using names is friendlier for the human users than addresses, but it also provides a level of indirection which is very useful in various situations. Before looking at these benefits of names, it is useful to discuss how names are used on the Internet. +While addresses are natural for the network and transport layer entities, human users prefer to use names when interacting with servers. Names can be encoded as a character string and a mapping services allows applications to map a name into the corresponding address. Using names is friendlier for the human users than addresses, but it also provides a level of indirection which is very useful in various situations. Before looking at these benefits of names, it is useful to discuss how names are used on the Internet. -In the early days of the Internet, there were only a few number of hosts (mainly minicomputers) connected to the network. The most popular applications were remote login and file transfer. By 1983, there were already five hundred hosts attached to the Internet. Each of these hosts were identified by a unique IPv4 address. Forcing human users to remember the IPv4 addresses of the remote hosts that they want to use was not user-friendly. Human users prefer to remember names, and use them when needed. Using names as aliases for addresses is a common technique in Computer Science. It simplifies the development of applications and allows the developer to ignore the low level details. For example, by using a programming language instead of writing machine code, a developer can write software without knowing whether the variables that it uses are stored in memory or inside registers. +In the early days of the Internet, there were only a few number of hosts (mainly minicomputers) connected to the network. The most popular applications were remote login and file transfer. By 1983, there were already five hundred hosts attached to the Internet. Each of these hosts were identified by a unique IPv4 address. Forcing human users to remember the IPv4 addresses of the remote hosts that they want to use was not user-friendly. Human users prefer to remember names, and use them when needed. Using names as aliases for addresses is a common technique in Computer Science. It simplifies the development of applications and allows the developer to ignore the low level details. For example, by using a programming language instead of writing machine code, a developer can write software without knowing whether the variables that it uses are stored in memory or inside registers. -Because names are at a higher level than addresses, they allow (both in the example of programming above, and on the Internet) to treat addresses as mere technical identifiers, which can change at will. Only the names are stable. +Because names are at a higher level than addresses, they allow (both in the example of programming above, and on the Internet) to treat addresses as mere technical identifiers, which can change at will. Only the names are stable. .. index:: Network Information Center, hosts.txt -The first solution that allowed applications to use names was the :term:`hosts.txt` file. This file is similar to the symbol table found in compiled code. It contains the mapping between the name of each Internet host and its associated IP address [#fhosts]_. It was maintained by SRI International that coordinated the Network Information Center (NIC). When a new host was connected to the network, the system administrator had to register its name and IP address at the NIC. The NIC updated the :term:`hosts.txt` file on its server. All Internet hosts regularly retrieved the updated :term:`hosts.txt` file from the server maintained by SRI_. This file was stored at a well-known location on each Internet host (see :rfc:`952`) and networked applications could use it to find the IP address corresponding to a name. +The first solution that allowed applications to use names was the :term:`hosts.txt` file. This file is similar to the symbol table found in compiled code. It contains the mapping between the name of each Internet host and its associated IP address [#fhosts]_. It was maintained by SRI International that coordinated the Network Information Center (NIC). When a new host was connected to the network, the system administrator had to register its name and IP address at the NIC. The NIC updated the :term:`hosts.txt` file on its server. All Internet hosts regularly retrieved the updated :term:`hosts.txt` file from the server maintained by SRI_. This file was stored at a well-known location on each Internet host (see :rfc:`952`) and networked applications could use it to find the IP address corresponding to a name. -A :term:`hosts.txt` file can be used when there are up to a few hundred hosts on the network. However, it is clearly not suitable for a network containing thousands or millions of hosts. A key issue in a large network is to define a suitable naming scheme. The ARPANet initially used a flat naming space, i.e. each host was assigned a unique name. To limit collisions between names, these names usually contained the name of the institution and a suffix to identify the host inside the institution (a kind of poor man's hierarchical naming scheme). On the ARPANet few institutions had several hosts connected to the network. +A :term:`hosts.txt` file can be used when there are up to a few hundred hosts on the network. However, it is clearly not suitable for a network containing thousands or millions of hosts. A key issue in a large network is to define a suitable naming scheme. The ARPANet initially used a flat naming space, i.e. each host was assigned a unique name. To limit collisions between names, these names usually contained the name of the institution and a suffix to identify the host inside the institution (a kind of poor man's hierarchical naming scheme). On the ARPANet few institutions had several hosts connected to the network. However, the limitations of a flat naming scheme became clear before the end of the ARPANet and :rfc:`819` proposed a hierarchical naming scheme. While :rfc:`819` discussed the possibility of organising the names as a directed graph, the Internet opted eventually for a tree structure capable of containing all names. In this tree, the top-level domains are those that are directly attached to the root. The first top-level domain was `.arpa` [#fdnstimeline]_. This top-level name was initially added as a suffix to the names of the hosts attached to the ARPANet and listed in the `hosts.txt` file. In 1984, the `.gov`, `.edu`, `.com`, `.mil` and `.org` generic top-level domain names were added and :rfc:`1032` proposed the utilisation of the two letter :term:`ISO-3166` country codes as top-level domain names. Since :term:`ISO-3166` defines a two letter code for each country recognised by the United Nations, this allowed all countries to automatically have a top-level domain. These domains include `.be` for Belgium, `.fr` for France, `.us` for the USA, `.ie` for Ireland or `.tv` for Tuvalu, a group of small islands in the Pacific and `.tm` for Turkmenistan. Today, the set of top-level domain-names is managed by the Internet Corporation for Assigned Names and Numbers (:term:`ICANN`). Recently, :term:`ICANN` added a dozen of generic top-level domains that are not related to a country and the `.cat` top-level domain has been registered for the Catalan language. There are ongoing discussions within :term:`ICANN` to increase the number of top-level domains. @@ -27,7 +27,7 @@ Each top-level domain is managed by an organisation that decides how sub-domain .. figure:: ../../book/application/png/app-fig-007-c.png :align: center - :scale: 50 + :scale: 50 The tree of domain names @@ -46,47 +46,47 @@ This hierarchical naming scheme is a key component of the Domain Name System (DN .. figure:: ../../book/application/png/app-fig-006-c.png :align: center - :scale: 50 + :scale: 50 A simple tree of domain names A :term:`nameserver` that is responsible for domain `dom` can directly answer the following queries : - + - the IP address of any host residing directly inside domain `dom` (e.g. `h2.dom` in the figure above) - the nameserver(s) that are responsible for any direct sub-domain of domain `dom` (i.e. `sdom1.dom` and `sdom2.dom` in the figure above, but not `z.sdom1.dom`) To retrieve the mapping for host `h2.dom`, a client sends its query to the name server that is responsible for domain `.dom`. The name server directly answers the query. To retrieve a mapping for `h3.a.sdom1.dom` a DNS client first sends a query to the name server that is responsible for the `.dom` domain. This nameserver returns the nameserver that is responsible for the `sdom1.dom` domain. This nameserver can now be contacted to obtain the nameserver that is responsible for the `a.sdom1.dom` domain. This nameserver can be contacted to retrieve the mapping for the `h3.a.sdom1.dom` name. Thanks to this organisation of the nameservers, it is possible for a DNS client to obtain the mapping of any host inside the `.dom` domain or any of its subdomains. To ensure that any DNS client will be able to resolve any fully qualified domain name, there are special nameservers that are responsible for the root of the domain name hierarchy. These nameservers are called :term:`root nameserver`. There are currently about a dozen root nameservers. -.. [#fdozen]_. +.. [#fdozen]_. -Each root nameserver maintains the list [#froot]_ of all the nameservers that are responsible for each of the top-level domain names and their IP addresses [#frootv6]_. All root nameservers are synchronised and provide the same answers. By querying any of the root nameservers, a DNS client can obtain the nameserver that is responsible for any top-level-domain name. From this nameserver, it is possible to resolve any domain name. +Each root nameserver maintains the list [#froot]_ of all the nameservers that are responsible for each of the top-level domain names and their IP addresses [#frootv6]_. All root nameservers are synchronised and provide the same answers. By querying any of the root nameservers, a DNS client can obtain the nameserver that is responsible for any top-level-domain name. From this nameserver, it is possible to resolve any domain name. -To be able to contact the root nameservers, each DNS client must know their IP addresses. This implies, that DNS clients must maintain an up-to-date list of the IP addresses of the root nameservers. Without this list, it is impossible to contact the root nameservers. Forcing all Internet hosts to maintain the most recent version of this list would be difficult from an operational point of view. To solve this problem, the designers of the DNS introduced a special type of DNS server : the DNS resolvers. A :term:`resolver` is a server that provides the name resolution service for a set of clients. A network usually contains a few resolvers. Each host in these networks is configured to send all its DNS queries via one of its local resolvers. These queries are called `recursive queries` as the :term:`resolver` must recurse through the hierarchy of nameservers to obtain the `answer`. +To be able to contact the root nameservers, each DNS client must know their IP addresses. This implies, that DNS clients must maintain an up-to-date list of the IP addresses of the root nameservers. Without this list, it is impossible to contact the root nameservers. Forcing all Internet hosts to maintain the most recent version of this list would be difficult from an operational point of view. To solve this problem, the designers of the DNS introduced a special type of DNS server : the DNS resolvers. A :term:`resolver` is a server that provides the name resolution service for a set of clients. A network usually contains a few resolvers. Each host in these networks is configured to send all its DNS queries via one of its local resolvers. These queries are called `recursive queries` as the :term:`resolver` must recurse through the hierarchy of nameservers to obtain the `answer`. -DNS resolvers have several advantages over letting each Internet host query directly nameservers. Firstly, regular Internet hosts do not need to maintain the up-to-date list of the IP addresses of the root servers. Secondly, regular Internet hosts do not need to send queries to nameservers all over the Internet. Furthermore, as a DNS resolver serves a large number of hosts, it can cache the received answers. This allows the resolver to quickly return answers for popular DNS queries and reduces the load on all DNS servers [JSBM2002]_. +DNS resolvers have several advantages over letting each Internet host query directly nameservers. Firstly, regular Internet hosts do not need to maintain the up-to-date list of the IP addresses of the root servers. Secondly, regular Internet hosts do not need to send queries to nameservers all over the Internet. Furthermore, as a DNS resolver serves a large number of hosts, it can cache the received answers. This allows the resolver to quickly return answers for popular DNS queries and reduces the load on all DNS servers [JSBM2002]_. Benefits of names ^^^^^^^^^^^^^^^^^ Using names instead of addresses inside applications has several important benefits in addition to being more human friendly. To understand these benefits, let us consider a popular application that provides information stored on servers. This application involves clients and servers. The server processes provide information upon requests from client processes running on remote hosts. A first deployment of this application would be to rely only on addresses. In this case, the server process would be installed on one host and the clients would connect to this server to retrieve information. Such a deployment has several drawbacks : - + - if the server process moves to another physical server, all clients must be informed about the new server address - - if there are many concurrent clients, the load of the server will increase without any possibility of adding another server without changing the server addresses user by the clients + - if there are many concurrent clients, the load of the server will increase without any possibility of adding another server without changing the server addresses user by the clients -Using names solves these problems and provide additional benefits. If clients are configured with the name of the server, they will query the name service before connecting to the server. The name service will resolve the name into the corresponding address. If a server process needs to move from one physical server to another, it suffices to update the name to address mapping of the name service to allow all clients to connect to the new server. The name service also enables the servers to better sustain be load. Assume a very popular server which is accessed by millions of user. This service cannot be provided by a single physical server due to performance limitations. Thanks to the utilisation of names, it is possible to scale this service by mapping a given name to a set of addresses. When a client queries the name service for the server's name, the name service returns one of the addresses in the set. Various strategies can be used to select one particular address inside the set of addresses. A first strategy is to select a random address in the set. A second strategy is to maintain information about the load on the servers and return the address of the less loaded server. Note that the list of server addresses does not need to remain fixed. It is possible to add and remove addresses from the list to cope with load fluctuations. Another strategy is to infer the location of the client from the name request and return the address of the closest server. +Using names solves these problems and provide additional benefits. If clients are configured with the name of the server, they will query the name service before connecting to the server. The name service will resolve the name into the corresponding address. If a server process needs to move from one physical server to another, it suffices to update the name to address mapping of the name service to allow all clients to connect to the new server. The name service also enables the servers to better sustain a big load. Assume a very popular server which is accessed by millions of user. This service cannot be provided by a single physical server due to performance limitations. Thanks to the utilisation of names, it is possible to scale this service by mapping a given name to a set of addresses. When a client queries the name service for the server's name, the name service returns one of the addresses in the set. Various strategies can be used to select one particular address inside the set of addresses. A first strategy is to select a random address in the set. A second strategy is to maintain information about the load on the servers and return the address of the less loaded server. Note that the list of server addresses does not need to remain fixed. It is possible to add and remove addresses from the list to cope with load fluctuations. Another strategy is to infer the location of the client from the name request and return the address of the closest server. Mapping a single name onto a set of addresses allow popular servers to scale dynamically. There are also benefits in mapping multiple names, possibly a large number of them, onto a single address. Consider the case of information servers run by individuals or SMEs. Some of these servers attract only a few clients per day. Using a single physical server for each of these services would be a waste of resources. A better approach is to use a single server for a set of services that are all identified by different names. This enables service providers to support a large number of servers, identifiied by different names, onto a single physical server. If one of these servers becomes very popular, it will be possible to map its name onto a set of addresses to be able to sustain the load. There are some deployments where this mapping is done dynamically in function of the load. -Names provide a lot of flexibility compared to addresses. For the network, they play a similar role as variables in programming languages. No programmer using a high-level programming language would consider using addresses instead of variables. For the same reasons, all networked applications should depend on names and avoid dealing with addresses as much as possible. +Names provide a lot of flexibility compared to addresses. For the network, they play a similar role as variables in programming languages. No programmer using a high-level programming language would consider using addresses instead of variables. For the same reasons, all networked applications should depend on names and avoid dealing with addresses as much as possible. .. rubric:: Footnotes .. [#fhosts] The :term:`hosts.txt` file is not maintained anymore. A historical snapshot retrieved on April 15th, 1984 is available from http://ftp.univie.ac.at/netinfo/netinfo/hosts.txt -.. [#fdnstimeline] See http://www.donelan.com/dnstimeline.html for a time line of DNS related developments. +.. [#fdnstimeline] See http://www.donelan.com/dnstimeline.html for a time line of DNS related developments. .. [#fidn] This specification evolved later to support domain names written by using other character sets than us-ASCII :rfc:`5890`. This extension is important to support languages other than English, but a detailed discussion is outside the scope of this document. @@ -99,4 +99,3 @@ Names provide a lot of flexibility compared to addresses. For the network, they .. include:: /links.rst - diff --git a/book-2nd/principles/network.rst b/book-2nd/principles/network.rst index ff0b72b..4793342 100644 --- a/book-2nd/principles/network.rst +++ b/book-2nd/principles/network.rst @@ -5,7 +5,7 @@ Building a network ****************** -.. warning:: +.. warning:: This is an unpolished draft of the second edition of this ebook. If you find any error or have suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues?milestone=2 @@ -22,7 +22,7 @@ The main objective of the network layer is to allow endsystems, connected to dif \node[router] (R3) {R3}; \node[router, below left=of R3] (R1) {R1}; \node[router, below right=of R3] (R2) {R2}; - + \node[host, left=of R1] (A) {A}; \node[host, right=of R3] (B) {B}; @@ -37,18 +37,18 @@ The main objective of the network layer is to allow endsystems, connected to dif -Before explaining the network layer in detail, it is useful to remember the characteristics of the service provided by the `datalink` layer. There are many variants of the datalink layer. Some provide a reliable service while others do not provide any guarantee of delivery. The reliable datalink layer services are popular in environments such as wireless networks were transmission errors are frequent. On the other hand, unreliable services are usually used when the physical layer provides an almost reliable service (i.e. only a negligible fraction of the frames are affected by transmission errors). Such `almost reliable` services are frequently used in wired and optical networks. In this chapter, we will assume that the datalink layer service provides an `almost reliable` service since this is both the most general one and also the most widely deployed one. +Before explaining the network layer in detail, it is useful to remember the characteristics of the service provided by the `datalink` layer. There are many variants of the datalink layer. Some provide a reliable service while others do not provide any guarantee of delivery. The reliable datalink layer services are popular in environments such as wireless networks were transmission errors are frequent. On the other hand, unreliable services are usually used when the physical layer provides an almost reliable service (i.e. only a negligible fraction of the frames are affected by transmission errors). Such `almost reliable` services are frequently used in wired and optical networks. In this chapter, we will assume that the datalink layer service provides an `almost reliable` service since this is both the most general one and also the most widely deployed one. .. TODO add footnote ? Using a connection-oriented datalink layer causes some problems that are beyond the scope of this chapter. See :rfc:`3819` for a discussion on this topic. .. figure:: ../../book/network/svg/osi-datalink.png :align: center - :scale: 70 + :scale: 70 The point-to-point datalink layer -There are two main types of datalink layers. The simplest datalink layer is when there are only two communicating systems that are directly connected through the physical layer. Such a datalink layer is used when there is a point-to-point link between the two communicating systems. The two systems can be endsystems or routers. :abbr:`PPP (Point-to-Point Protocol)`, defined in :rfc:`1661`, is an example of such a point-to-point datalink layer. Datalink layers exchange `frames` and a datalink :term:`frame` sent by a datalink layer entity on the left is transmitted through the physical layer, so that it can reach the datalink layer entity on the right. Point-to-point datalink layers can either provide an unreliable service (frames can be corrupted or lost) or a reliable service (in this case, the datalink layer includes retransmission mechanisms). +There are two main types of datalink layers. The simplest datalink layer is when there are only two communicating systems that are directly connected through the physical layer. Such a datalink layer is used when there is a point-to-point link between the two communicating systems. The two systems can be endsystems or routers. :abbr:`PPP (Point-to-Point Protocol)`, defined in :rfc:`1661`, is an example of such a point-to-point datalink layer. Datalink layers exchange `frames` and a datalink :term:`frame` sent by a datalink layer entity on the left is transmitted through the physical layer, so that it can reach the datalink layer entity on the right. Point-to-point datalink layers can either provide an unreliable service (frames can be corrupted or lost) or a reliable service (in this case, the datalink layer includes retransmission mechanisms). .. The unreliable service is frequently used above physical layers (e.g. optical fiber, twisted pairs) having a low bit error ratio while reliability mechanisms are often used in wireless networks to recover locally from transmission errors. @@ -94,14 +94,14 @@ To understand the key principles behind the operation of a network, let us analy .. index:: address -The network layer enables the transmission of information between hosts that are not directly connected through intermediate routers. This transmission is carried out by putting the information to be transmitted inside a data structure which is called a `packet`. Like a frame that contains useful data and control information, a packet also contains useful data and control information. An important issue in the network layer is the ability to identify a node (host or router) inside the network. This identification is performed by associating an address to each node. An `address` is usually represented as a sequence of bits. Most networks use fixed-length addresses. At this stage, let us simply assume that each of the nodes in the above network has an address which corresponds to the binary representation on its name on the figure. +The network layer enables the transmission of information between hosts that are not directly connected through intermediate routers. This transmission is carried out by putting the information to be transmitted inside a data structure which is called a `packet`. Like a frame that contains useful data and control information, a packet also contains useful data and control information. An important issue in the network layer is the ability to identify a node (host or router) inside the network. This identification is performed by associating an address to each node. An `address` is usually represented as a sequence of bits. Most networks use fixed-length addresses. At this stage, let us simply assume that each of the nodes in the above network has an address which corresponds to the binary representation on its name on the figure. To send one byte of information to host `B`, host `A` needs to place this information inside a `packet`. In addition to the data being transmitted, the packet must also contain either the addresses of the source and the destination nodes or information that indicates the path that needs to be followed to reach the destination. There are two possible organisations for the network layer : - `datagram` - - `virtual circuits` + - `virtual circuits` The datagram organisation @@ -113,38 +113,38 @@ The first and most popular organisation of the network layer is the datagram org - its own network layer address - the information to be sent -.. The network layer limits the maximum packet size. Thus, the information must have been divided in packets by the transport layer before being passed to the network layer. +.. The network layer limits the maximum packet size. Thus, the information must have been divided in packets by the transport layer before being passed to the network layer. To understand the datagram organisation, let us consider the figure below. A network layer address, represented by a letter, has been assigned to each host and router. To send some information to host `J`, host `A` creates a packet containing its own address, the destination address and the information to be exchanged. .. figure:: ../../book/network/svg/simple-internetwork.png :align: center - :scale: 80 + :scale: 80 - A simple internetwork + A simple internetwork .. index:: hop-by-hop forwarding, forwarding table -With the datagram organisation, routers use `hop-by-hop forwarding`. This means that when a router receives a packet that is not destined to itself, it looks up the destination address of the packet in its `forwarding table`. A `forwarding table` is a data structure that maps each destination address (or set of destination addresses) to the outgoing interface over which a packet destined to this address must be forwarded to reach its final destination. The router consults its forwarding table for each packet that it handles. +With the datagram organisation, routers use `hop-by-hop forwarding`. This means that when a router receives a packet that is not destined to itself, it looks up the destination address of the packet in its `forwarding table`. A `forwarding table` is a data structure that maps each destination address (or set of destination addresses) to the outgoing interface over which a packet destined to this address must be forwarded to reach its final destination. The router consults its forwarding table for each packet that it handles. The figure illustrates some possible forwarding tables in this network. By inspecting the forwarding tables of the different routers, one can find the path followed by packets sent from a source to a particular destination. In the example above, host `A` sends its packet to router `R1`. `R1` consults its routing table and forwards the packet towards `R2`. Based on its own routing table, `R2` decides to forward the packet to `R5` that can deliver it to its destination. Thus, the path from `A` to `J` is `A -> R1 -> R2 -> R5 -> J`. -The computation of the forwarding tables of all the routers inside a network is a key element for the correct operation of the network. This computation can be carried out in different ways and it is possible to use both distributed and centralized algorithms. These algorithms provide different performance, may lead to different types of paths, but their composition must lead to valid path. +The computation of the forwarding tables of all the routers inside a network is a key element for the correct operation of the network. This computation can be carried out in different ways and it is possible to use both distributed and centralized algorithms. These algorithms provide different performance, may lead to different types of paths, but their composition must lead to valid path. -In a network, a path can be defined as the list of all intermediate routers for a given source destination pair. For a given source/destination pair, the path can be derived by first consulting the forwarding table of the router attached to the source to determine the next router on the path towards the chosen destination. Then, the forwarding table of this router is queried for the same destination... The queries continue until the destination is reached. In a network that has valid forwarding tables, all the paths between all source/destination pairs contain a finite number of intermediate routers. However, if forwarding tables have not been correctly computed, two types of invalid path can occur. +In a network, a path can be defined as the list of all intermediate routers for a given source destination pair. For a given source/destination pair, the path can be derived by first consulting the forwarding table of the router attached to the source to determine the next router on the path towards the chosen destination. Then, the forwarding table of this router is queried for the same destination... The queries continue until the destination is reached. In a network that has valid forwarding tables, all the paths between all source/destination pairs contain a finite number of intermediate routers. However, if forwarding tables have not been correctly computed, two types of invalid path can occur. .. index:: black hole -A path may lead to a black hole. In a network, a black hole is a router that receives packets for at least one given source/destination pair but does not have any entry inside its forwarding table for this destination. Since it does not know how to reach the destination, the router cannot forward the received packets and must discard them. Any centralized or distributed algorithm that computes forwarding tables must ensure that there are not black holes inside the network. +A path may lead to a black hole. In a network, a black hole is a router that receives packets for at least one given source/destination pair but does not have any entry inside its forwarding table for this destination. Since it does not know how to reach the destination, the router cannot forward the received packets and must discard them. Any centralized or distributed algorithm that computes forwarding tables must ensure that there are no black holes inside the network. .. index:: forwarding loop A second type of problem may exist in networks using the datagram organisation. Consider a path that contains a cycle. For example, router `R1` sends all packets towards destination `D` via router `R2`, router `R2` forwards these packets to router `R3` and finally router `R3`'s forwarding table uses router `R1` as its nexthop to reach destination `D`. In this case, if a packet destined to `D` is received by router `R1`, it will loop on the `R1 -> R2 -> R3 -> R1` cycle and will never reach its final destination. As in the black hole case, the destination is not reachable from all sources in the network. However, in practice the loop problem is worse than the black hole problem because when a packet is caught in a forwarding loop, it unnecessarily consumes bandwidth. In the black hole case, the problematic packet is quickly discarded. We will see later that network layer protocols include techniques to minimize the impact of such forwarding loops. -Any solution which is used to compute the forwarding tables of a network must ensure that all destinations are reachable from any source. This implies that it must guarantee the absence of black holes and forwarding loops. +Any solution which is used to compute the forwarding tables of a network must ensure that all destinations are reachable from any source. This implies that it must guarantee the absence of black holes and forwarding loops. .. index:: data plane @@ -154,7 +154,7 @@ The `forwarding tables` and the precise format of the packets that are exchanged .. index:: control plane -Besides the `data plane`, a network is also characterized by its `control plane`. The control plane includes all the protocols and algorithms (often distributed) that are used to compute the forwarding tables that are installed on all routers inside the network. While there is only one possible `data plane` for a given networking technology, different networks using the same technology may use different control planes. The simplest `control plane` for a network is always to compute manually the forwarding tables of all routers inside the network. This simple control plane is sufficient when the network is (very) small, usually up to a few routers. +Besides the `data plane`, a network is also characterized by its `control plane`. The control plane includes all the protocols and algorithms (often distributed) that are used to compute the forwarding tables that are installed on all routers inside the network. While there is only one possible `data plane` for a given networking technology, different networks using the same technology may use different control planes. The simplest `control plane` for a network is always to compute manually the forwarding tables of all routers inside the network. This simple control plane is sufficient when the network is (very) small, usually up to a few routers. In most networks, manual forwarding tables are not a solution for two reasons. First, most networks are too large to enable a manual computation of the forwarding tables. Second, with manually computed forwarding tables, it is very difficult to deal with link and router failures. Networks need to operate 24h a day, 365 days per year. During the lifetime of a network, many events can affect the routers and links that it contains. Link failures are regular events in deployed networks. Links can fail for various reasons, including electromagnetic interference, fiber cuts, hardware or software problems on the terminating routers, ... Some links also need to be added to the network or removed because their utilisation is too low or their cost is too high. Similarly, routers also fail. There are two types of failures that affect routers. A router may stop forwarding packets due to hardware or software problem (e.g. due to a crash of its operating system). A router may also need to be halted from time to time (e.g. to upgrade its operating system to fix some bugs). These planned and unplanned events affect the set of links and routers that can be used to forward packets in the network. Still, most network users expect that their network will continue to correctly forward packets despite all these events. With manually computed forwarding tables, it is usually impossible to precompute the forwarding tables while taking into account all possible failure scenarios. @@ -165,7 +165,7 @@ An alternative to manually computed forwarding tables is to use a network manage .. Openflow is an example of this kind of solution. -Another interesting point that is worth being discussed is when the forwarding tables are computed. A widely used solution is to compute the entries of the forwarding tables for all destinations on all routers. This ensures that each router has a valid route towards each destination. These entries can be updated when an event occurs and the network topology changes. A drawback of this approach is that the forwarding tables can become large in large networks since each router must maintain one entry for each destination at all times inside its forwarding table. +Another interesting point that is worth being discussed is the way the forwarding tables are computed. A widely used solution is to compute the entries of the forwarding tables for all destinations on all routers. This ensures that each router has a valid route towards each destination. These entries can be updated when an event occurs and the network topology changes. A drawback of this approach is that the forwarding tables can become large in large networks since each router must maintain one entry for each destination at all times inside its forwarding table. Some networks use the arrival of packets as the trigger to compute the corresponding entries in the forwarding tables. Several technologies have been built upon this principle. When a packet arrives, the router consults its forwarding table to find a path towards the destination. If the destination is present in the forwarding table, the packet is forwarded. Otherwise, the router needs to find a way to forward the packet and update its forwarding table. @@ -186,7 +186,7 @@ To understand the operation of the port-address table, let us consider the examp .. tikz:: - :libs: positioning, matrix, arrows + :libs: positioning, matrix, arrows \tikzstyle{arrow} = [thick,->,>=stealth] \tikzset{router/.style = {rectangle, draw, text centered, minimum height=2em}, } @@ -201,13 +201,13 @@ To understand the operation of the port-address table, let us consider the examp \node[router, below=of R4] (R5) {R5}; \node[host, right=of R4] (B) {B}; \path[draw,thick] - (A) edge (R1) - (R1) edge (R2) - (R1) edge (R3) - (R2) edge (C) + (A) edge (R1) + (R1) edge (R2) + (R1) edge (R3) + (R2) edge (C) (R3) edge (R4) (R5) edge (B) - (R3) edge (R5); + (R3) edge (R5); Host `A` sends a packet towards `B`. When receiving this packet, `R1` learns that `A` is reachable via its `North` interface. Since it does not have an entry for destination `B` in its port-address table, it forwards the packet to both `R2` and `R3`. When `R2` receives the packet, it updates its own forwarding table and forward the packet to `C`. Since `C` is not the intended recipient, it simply discards the received packet. Node `R3` also received the packet. It learns that `A` is reachable via its `North` interface and broadcasts the packet to `R4` and `R5`. `R5` also updates its forwarding table and finally forwards it to destination `B`.`Let us now consider what happens when `B` sends a reply to `A`. `R5` first learns that `B` is attached to its `South` port. It then consults its port-address table and finds that `A` is reachable via its `North` interface. The packet is then forwarded hop-by-hop to `A` without any broadcasting. If `C` sends a packet to `B`, this packet will reach `R1` that contains a valid forwarding entry in its forwarding table. @@ -217,7 +217,7 @@ By inspecting the source and destination addresses of packets, network nodes can .. tikz:: - :libs: positioning, matrix, arrows + :libs: positioning, matrix, arrows \tikzstyle{arrow} = [thick,->,>=stealth] \tikzset{router/.style = {rectangle, draw, text centered, minimum height=2em}, } @@ -229,15 +229,15 @@ By inspecting the source and destination addresses of packets, network nodes can \node[router, below=of R1] (R2) {R2}; \node[host, right=of R3] (B) {B}; \path[draw,thick] - (A) edge (R1) - (R1) edge (R2) - (R1) edge (R3) - (R3) edge (R2) - (R3) edge (B); + (A) edge (R1) + (R1) edge (R2) + (R1) edge (R3) + (R3) edge (R2) + (R3) edge (B); Assume that the network has started and all port-station and forwarding tables are empty. Host `A` sends a packet towards `B`. Upon reception of this packet, `R1` updates its port-address table. Since `B` is not present in the port-address table, the packet is broadcasted. Both `R2` and `R3` receive a copy of the packet sent by `A`. They both update their port-address table. Unfortunately, they also both broadcast the received packet. `B` receives a first copy of the packet, but `R3` and `R2` receive it again. `R3` will then broadcast this copy of the packet to `B` and `R1` while `R2` will broadcast its copy to `R1`. Although `B` has already received two copies of the packet, it is still inside the network and will continue to loop. Due to the presence of the cycle, a single packet towards an unknown destination generates copies of this packet that loop and will saturate the network bandwidth. Network operators who are using port-address tables to automatically compute the forwarding tables also use distributed algorithms to ensure that the network topology is always a tree. -.. +.. // imagepath="../svg/icons/:../../svg/icons/"; // r1 [label="R1" labelloc=bottom shapefile="router.png" ]; // r2 [label="R2" labelloc=bottom shape=box imagescale=true image="router.png" ]; @@ -254,10 +254,10 @@ Another technique can be used to automatically compute forwarding tables. It has - the `data packets` - the `control packets` -`Data packets` are used to exchange data while `control packets` are used to discover the paths between endhosts. With `Source routing`, network nodes can be kept as simple as possible and all the complexity is placed on the endhosts. This is in contrast with the previous technique where the nodes had to maintain a port-address and a forwarding table while the hosts simply sent and received packets. Each node is configured with one unique address and there is one identifier per outgoing link. For simplicity and to avoid cluttering the figures with those identifiers, we will assume that each node uses as link identifiers north, west, south, ... In practice, a node would associate one integer to each outgoing link. +`Data packets` are used to exchange data while `control packets` are used to discover the paths between endhosts. With `Source routing`, network nodes can be kept as simple as possible and all the complexity is placed on the endhosts. This is in contrast with the previous technique where the nodes had to maintain a port-address and a forwarding table while the hosts simply sent and received packets. Each node is configured with one unique address and there is one identifier per outgoing link. For simplicity and to avoid cluttering the figures with those identifiers, we will assume that each node uses as link identifiers north, west, south, ... In practice, a node would associate one integer to each outgoing link. .. tikz:: - :libs: positioning, matrix, arrows + :libs: positioning, matrix, arrows \tikzstyle{arrow} = [thick,->,>=stealth] \tikzset{router/.style = {rectangle, draw, text centered, minimum height=2em}, } @@ -270,15 +270,15 @@ Another technique can be used to automatically compute forwarding tables. It has \node[router, right=of R3] (R4) {R4}; \node[host, right=of R4] (B) {B}; \path[draw,thick] - (A) edge (R1) - (R1) edge (R2) - (R1) edge (R3) - (R3) edge (R2) - (R3) edge (R4) - (R4) edge (B); + (A) edge (R1) + (R1) edge (R2) + (R1) edge (R3) + (R3) edge (R2) + (R3) edge (R4) + (R4) edge (B); -In the network above, node `R2` is attached to two outgoing links. `R2` is connected to both `R1` and `R3`. `R2` can easily determine that it is connected to these two nodes by exchanging packets with them or observing the packets that it receives over each interface. Assume for example that when a host or node starts, it sends a special control packet over each of its interfaces to advertise its own address to its neighbors. When a host or node receives such a packet, it automatically replies with its own address. This exchange can also be used to verify whether a neighbor, either node or host, is still alive. With `source routing`, the data plane packets include a list of identifiers. This list is called a `source route` and indicates the path to be followed by the packet as a sequence of link identifiers. When a node receives such a `data plane` packet, it first checks whether the packet's destination is direct neighbor. In this case, the packet is forwarded to the destination. Otherwise, the node extracts the next address from the list and forwards it to the neighbor. This allows the source to specify the explicit path to be followed for each packet. For example, in the figure above there are two possible paths between `A` and `B`. To use the path via `R2`, `A` would send a packet that contains `R1,R2,R3` as source route. To avoid going via `R2`, `A` would place `R1,R3` as the source route in its transmitted packet. If `A` knows the complete network topology and all link identifiers, it can easily compute the source route towards each destination. If needed, it could even use different paths, e.g. for redundancy, to reach a given destination. However, in a real network hosts do not usually have a map of the entire network topology. +In the network above, node `R2` is attached to two outgoing links. `R2` is connected to both `R1` and `R3`. `R2` can easily determine that it is connected to these two nodes by exchanging packets with them or observing the packets that it receives over each interface. Assume for example that when a host or node starts, it sends a special control packet over each of its interfaces to advertise its own address to its neighbors. When a host or node receives such a packet, it automatically replies with its own address. This exchange can also be used to verify whether a neighbor, either node or host, is still alive. With `source routing`, the data plane packets include a list of identifiers. This list is called a `source route` and indicates the path to be followed by the packet as a sequence of link identifiers. When a node receives such a `data plane` packet, it first checks whether the packet's destination is a direct neighbor. In this case, the packet is forwarded to the destination. Otherwise, the node extracts the next address from the list and forwards it to the neighbor. This allows the source to specify the explicit path to be followed for each packet. For example, in the figure above there are two possible paths between `A` and `B`. To use the path via `R2`, `A` would send a packet that contains `R1,R2,R3` as source route. To avoid going via `R2`, `A` would place `R1,R3` as the source route in its transmitted packet. If `A` knows the complete network topology and all link identifiers, it can easily compute the source route towards each destination. If needed, it could even use different paths, e.g. for redundancy, to reach a given destination. However, in a real network hosts do not usually have a map of the entire network topology. .. index:: record route @@ -288,7 +288,7 @@ In networks that rely on source routing, hosts use control packets to automatica For example, consider again the network topology above. `A` sends a control packet towards `B`. The initial `record route` is empty. When `R1` receives the packet, it adds its own address to the `record route` and forwards a copy to `R2` and another to `R3`. `R2` receives the packet, adds itself to the `record route` and forwards it to `R3`. `R3` receives two copies of the packet. The first contains the `[R1,R2]` `record route` and the second `[R1]`. In the end, `B` will receive two control packets containing `[R1,R2,R3,R4]` and `[R1,R3,R4]` as `record routes`. `B` can keep these two paths or select the best one and discard the second. A popular heuristic is to select the `record route` of the first received packet as being the best one since this likely corresponds to the shortest delay path. -With the received `record route`, `B` can send a `data packet` to `A`. For this, it simply reverses the chosen `record route`. However, we still need to communicate the chosen path to `A`. This can be done by putting the `record route` inside a control packet which is sent back to `A` over the reverse path. An alternative is to simply send a `data packet` back to `A`. This packet will travel back to `A`. To allow `A` to inspect the entire path followed by the `data packet`, its `source route` must contain all intermediate routers when it is received by `A`. This can be achieved by encoding the `source route` using a data structure that contains an index and the ordered list of node addresses. The index always points to the next address in the `source route`. It is initialized at `0` when a packet is created and incremented by each intermediate node. +With the received `record route`, `B` can send a `data packet` to `A`. For this, it simply reverses the chosen `record route`. However, we still need to communicate the chosen path to `A`. This can be done by putting the `record route` inside a control packet which is sent back to `A` over the reverse path. An alternative is to simply send a `data packet` back to `A`. This packet will travel back to `A`. To allow `A` to inspect the entire path followed by the `data packet`, its `source route` must contain all intermediate routers when it is received by `A`. This can be achieved by encoding the `source route` using a data structure that contains an index and the ordered list of node addresses. The index always points to the next address in the `source route`. It is initialized at `0` when a packet is created and incremented by each intermediate node. Flat or hierarchical addresses @@ -306,11 +306,11 @@ A drawback of the `flat addressing scheme` is that the forwarding tables grow li A widely used alternative to the `flat addressing scheme` is the `hierarchical addressing scheme`. This addressing scheme builds upon the fact that networks usually contain much more hosts than network nodes. In this case, a first solution to reduce the size of the forwarding tables is to create a hierarchy of addresses. This is the solution chosen by the post office were addresses contain a country, sometimes a state or province, a city, a street and finally a street number. When an enveloppe is forwarded by a postoffice in a remote country, it only looks at the destination country, while a post office in the same province will look at the city information. Only the post office responsible for a given city will look at the street name and only the postman will use the street number. `Hierarchical addresses` provide a similar solution for network addresses. For example, the address of an Internet host attached to a campus network could contain in the high-order bits an identification of the Internet Service Provider (ISP) that serves the campus network. Then, a subsequent block of bits identifies the campus network which is one of the customers from the ISP. Finally, the low order bits of the address identify the host in the campus network. -This hierarchical allocation of addresses can be applied in any type of network. In practice, the allocation of the addresses must follow the network topology. Usually, this is achieved by dividing the addressing space in consecutive blocks and then allocating these blocks to different parts of the network. In a small network, the simplest solution is to allocate one block of addresses to each network node and assign the host addresses from the attached node. +This hierarchical allocation of addresses can be applied in any type of network. In practice, the allocation of the addresses must follow the network topology. Usually, this is achieved by dividing the addressing space in consecutive blocks and then allocating these blocks to different parts of the network. In a small network, the simplest solution is to allocate one block of addresses to each network node and assign the host addresses from the attached node. .. tikz:: - :libs: positioning, matrix, arrows + :libs: positioning, matrix, arrows \tikzstyle{arrow} = [thick,->,>=stealth] \tikzset{router/.style = {rectangle, draw, text centered, minimum height=2em}, } @@ -323,20 +323,20 @@ This hierarchical allocation of addresses can be applied in any type of network. \node[router, right=of R3] (R4) {R4}; \node[host, right=of R4] (B) {B}; \path[draw,thick] - (A) edge (R1) - (R1) edge (R2) - (R1) edge (R3) - (R3) edge (R2) - (R3) edge (R4) - (R4) edge (B); + (A) edge (R1) + (R1) edge (R2) + (R1) edge (R3) + (R3) edge (R2) + (R3) edge (R4) + (R4) edge (B); -In the above figure, assume that the network uses 16 bits addresses and that the prefix `01001010` has been assigned to the entire network. Since the network contains four routers, the network operator could assign one block of sixty-four addresses to each router. `R1` would use address `0100101000000000` while `A` could use address `0100101000000001`. `R2` could be assigned all adresses from `0100101001000000` to `0100101001111111`. `R4` could then use `0100101011000000` and assign `0100101011000001` to `B`. Other allocation schemes are possible. For example, `R3` could be allocated a larger block of addresses than `R2` and `R4` could use a sub-block from `R3` 's address block. +In the above figure, assume that the network uses 16 bits addresses and that the prefix `01001010` has been assigned to the entire network. Since the network contains four routers, the network operator could assign one block of sixty-four addresses to each router. `R1` would use address `0100101000000000` while `A` could use address `0100101000000001`. `R2` could be assigned all adresses from `0100101001000000` to `0100101001111111`. `R4` could then use `0100101011000000` and assign `0100101011000001` to `B`. Other allocation schemes are possible. For example, `R3` could be allocated a larger block of addresses than `R2` and `R4` could use a sub-block from `R3` 's address block. The main advantage of hierarchical addresses is that it is possible to significantly reduce the size of the forwarding tables. In many networks, the number of nodes can be several orders of magnitude smaller than the number of hosts. A campus network may contain a few dozen of network nodes for thousands of hosts. The largest Internet Services Providers typically contain no more than a few tens of thousands of network nodes but still serve tens or hundreds of millions of hosts. -Despite their popularity, `hierarchical addresses` have some drawbacks. Their first drawback is that a lookup in the forwarding table is more complex than when using `flat addresses`. For example, on the Internet, network nodes have to perform a longest-match to forward each packet. This is partially compensated by the reduction in the size of the forwarding tables, but the additional complexity of the lookup operation has been a difficulty to implement hardware support for packet forwarding. A second drawback of the utilisation of hierarchical addresses is that when a host connects for the first time to a network, it must contact one network node to determine its own address. This requires some packet exchanges between the host and some network nodes. Furthermore, if a host moves and is attached to another network node, its network address will change. This can be an issue with some mobile hosts. +Despite their popularity, `hierarchical addresses` have some drawbacks. Their first drawback is that a lookup in the forwarding table is more complex than when using `flat addresses`. For example, on the Internet, network nodes have to perform a longest-match to forward each packet. This is partially compensated by the reduction in the size of the forwarding tables, but the additional complexity of the lookup operation has been a difficulty to implement hardware support for packet forwarding. A second drawback of the utilisation of hierarchical addresses is that when a host connects for the first time to a network, it must contact one network node to determine its own address. This requires some packet exchanges between the host and some network nodes. Furthermore, if a host moves and is attached to another network node, its network address will change. This can be an issue with some mobile hosts. Dealing with heterogeneous datalink layers ------------------------------------------ @@ -379,7 +379,7 @@ Considering in the network above that host `A` wants to send a 900 bytes packet #. Each of the packet fragments is a valid packet that contains a header with the source (host `A`) and destination (host `B`) addresses. When router `R2` receives a packet fragment, it treats this packet as a regular packet and forwards it to its final destination (host `B`). Host `B` reassembles the received fragments. -These three solutions have advantages and drawbacks. With the first solution, routers remain simple and do not need to perform any fragmentation operation. This is important when routers are implemented mainly in hardware. However, hosts are more complex since they need to store the packets that they produce if they need to pass through a link that does not support large packets. This increases the buffering required on the end hosts. Furthermore, a single large packet may potentially need to be retransmitted several times. Consider for example a network similar to the one shown above but with four routers. Assume that the link `R1->R2` supports 1000 bytes packets, link `R2->R3` 800 bytes packets and link `R3->R4` 600 bytes packets. A host attached to `R1` that sends large packet will have to first try 1000 bytes, then 800 bytes and finally 600 bytes. Fortunately, this scenario does not occur very often in practice and this is the reason why this solution is used in real networks. +These three solutions have advantages and drawbacks. With the first solution, routers remain simple and do not need to perform any fragmentation operation. This is important when routers are implemented mainly in hardware. However, hosts are more complex since they need to store the packets that they produce if they need to pass through a link that does not support large packets. This increases the buffering required on the end hosts. Furthermore, a single large packet may potentially need to be retransmitted several times. Consider for example a network similar to the one shown above but with four routers. Assume that the link `R1->R2` supports 1000 bytes packets, link `R2->R3` 800 bytes packets and link `R3->R4` 600 bytes packets. A host attached to `R1` that sends large packet will have to first try 1000 bytes, then 800 bytes and finally 600 bytes. Fortunately, this scenario does not occur very often in practice and this is the reason why this solution is used in real networks. Fragmenting packets on a per-link basis, as presented for the second solution, can minimize the transmission overhead since a packet is only fragmented on the links where fragmentation is required. Large packets can continue to be used downstream of a link that only accepts small packets. However, this reduction of the overhead comes with two drawbacks. First, fragmenting packets, potentially on all links, increases the processing time and the buffer requirements on the routers. Second, this solution leads to a longer end-to-end delay since the downstream router has to reassemble all the packet fragments before forwarding the packet. @@ -408,7 +408,7 @@ In a network using virtual circuits, all hosts are also identified with a networ - the outgoing interface for the packet - the label for the outgoing packet -For example, consider the `label forwarding table` of a network node below. +For example, consider the `label forwarding table` of a network node below. +--------+--------------------+----------+ @@ -428,7 +428,7 @@ If this node receives a packet with `label=2`, it forwards the packet on its `We `Label switching` enables a full control over the path followed by packets inside the network. Consider the network below and assume that we want to use two virtual circuits : `R1->R3->R4->R2->R5` and `R2->R1->R3->R4->R5`. .. tikz:: - :libs: positioning, matrix, arrows + :libs: positioning, matrix, arrows \tikzstyle{arrow} = [thick,->,>=stealth] \tikzset{router/.style = {rectangle, draw, text centered, minimum height=2em}, } @@ -440,17 +440,17 @@ If this node receives a packet with `label=2`, it forwards the packet on its `We \node[router, below right=of R3] (R4) {R4}; \node[router, below=of R4] (R5) {R5}; \path[draw,thick] - (R1) edge (R2) - (R1) edge (R3) - (R4) edge (R3) - (R4) edge (R2) - (R2) edge (R5) - (R4) edge (R5); + (R1) edge (R2) + (R1) edge (R3) + (R4) edge (R3) + (R4) edge (R2) + (R2) edge (R5) + (R4) edge (R5); -To create these virtual circuits, we need to configure the -label forwarding tables` of all network nodes. For simplicity, assume that a label forwarding table only contains two entries. Assume that `R5` wants to receive the packets from the virtual circuit created by `R1` (resp. `R2`) with `label=1` (`label=0`). `R4` could use the following `label forwarding table`: +To create these virtual circuits, we need to configure the +label forwarding tables of all network nodes. For simplicity, assume that a label forwarding table only contains two entries. Assume that `R5` wants to receive the packets from the virtual circuit created by `R1` (resp. `R2`) with `label=1` (`label=0`). `R4` could use the following `label forwarding table`: +--------+--------------------+----------+ | index | outgoing interface | label | @@ -493,7 +493,7 @@ With the above `label forwarding table`, `R1` needs to originate the packets tha The figure below shows the path followed by the packets on the `R1->R3->R4->R2->R5` path in red with on each arrow the label used in the packets. .. tikz:: - :libs: positioning, matrix, arrows + :libs: positioning, matrix, arrows \tikzstyle{arrow} = [thick,->,>=stealth] \tikzset{router/.style = {rectangle, draw, text centered, minimum height=2em}, } @@ -505,13 +505,13 @@ The figure below shows the path followed by the packets on the `R1->R3->R4->R2-> \node[router, below right=of R3] (R4) {R4}; \node[router, below=of R4] (R5) {R5}; \path[draw,thick] - (R1) edge (R2) - (R4) edge (R5); - \draw[arrow, dashed, red] (R1) -- (R3) node [midway, fill=white] {1}; - \draw[arrow, dashed, red] (R3) -- (R4) node [midway, fill=white] {0}; - \draw[arrow, dashed, red] (R4) -- (R2) node [midway, fill=white] {0}; + (R1) edge (R2) + (R4) edge (R5); + \draw[arrow, dashed, red] (R1) -- (R3) node [midway, fill=white] {1}; + \draw[arrow, dashed, red] (R3) -- (R4) node [midway, fill=white] {0}; + \draw[arrow, dashed, red] (R4) -- (R2) node [midway, fill=white] {0}; \draw[arrow, dashed, red] (R2) -- (R5) node [midway, fill=white] {1}; - + @@ -521,12 +521,12 @@ Nowadays, most deployed networks rely on distributed algorithms, called routing .. The datagram organisation has been very popular in computer networks. Datagram based network layers include IPv4 and IPv6 in the global Internet, CLNP defined by the ISO, IPX defined by Novell or XNS defined by Xerox [Perlman2000]_. -.. +.. .. figure:: svg/simple-lan.png :align: center - :scale: 80 - - A local area network + :scale: 80 + + A local area network .. An important difference between the point-to-point datalink layers and the datalink layers used in LANs is that in a LAN, each communicating device is identified by a unique `datalink layer address`. This address is usually embedded in the hardware of the device and different types of LANs use different types of datalink layer addresses. A communicating device attached to a LAN can send a datalink frame to any other communicating device that is attached to the same LAN. Most LANs also support special broadcast and multicast datalink layer addresses. A frame sent to the broadcast address of the LAN is delivered to all communicating devices that are attached to the LAN. The multicast addresses are used to identify groups of communicating devices. When a frame is sent towards a multicast datalink layer address, it is delivered by the LAN to all communicating devices that belong to the corresponding group. @@ -535,28 +535,28 @@ Nowadays, most deployed networks rely on distributed algorithms, called routing .. The third type of datalink layers are used in Non-Broadcast Multi-Access (NBMA) networks. These networks are used to interconnect devices like a LAN. All devices attached to an NBMA network are identified by a unique datalink layer address. However, and this is the main difference between an NBMA network and a traditional LAN, the NBMA service only supports unicast. The datalink layer service provided by an NBMA network supports neither broadcast nor multicast. -.. The network layer itself relies on the following principles : +.. The network layer itself relies on the following principles : .. #. Each network layer entity is identified by a `network layer address`. This address is independent of the datalink layer addresses that it may use. .. #. The service provided by the network layer does not depend on the service or the internal organisation of the underlying datalink layers. -.. #. The network layer is conceptually divided into two planes : the `data plane` and the `control plane`. The `data plane` contains the protocols and mechanisms that allow hosts and routers to exchange packets carrying user data. The `control plane` contains the protocols and mechanisms that enable routers to efficiently learn how to forward packets towards their final destination. +.. #. The network layer is conceptually divided into two planes : the `data plane` and the `control plane`. The `data plane` contains the protocols and mechanisms that allow hosts and routers to exchange packets carrying user data. The `control plane` contains the protocols and mechanisms that enable routers to efficiently learn how to forward packets towards their final destination. .. The independence of the network layer from the underlying datalink layer is a key principle of the network layer. It ensures that the network layer can be used to allow hosts attached to different types of datalink layers to exchange packets through intermediate routers. Furthermore, this allows the datalink layers and the network layer to evolve independently from each other. This enables the network layer to be easily adapted to a new datalink layer every time a new datalink layer is invented. -.. rubric:: Footnotes +.. rubric:: Footnotes -.. [#flabels] We will see later a more detailed description of Multiprotocol Label Switching, a networking technology that is capable of using one or more labels. +.. [#flabels] We will see later a more detailed description of Multiprotocol Label Switching, a networking technology that is capable of using one or more labels. The control plane ================= -One of the objectives of the `control plane` in the network layer is to maintain the routing tables that are used on all routers. As indicated earlier, a routing table is a data structure that contains, for each destination address (or block of addresses) known by the router, the outgoing interface over which the router must forward a packet destined to this address. The routing table may also contain additional information such as the address of the next router on the path towards the destination or an estimation of the cost of this path. +One of the objectives of the `control plane` in the network layer is to maintain the routing tables that are used on all routers. As indicated earlier, a routing table is a data structure that contains, for each destination address (or block of addresses) known by the router, the outgoing interface over which the router must forward a packet destined to this address. The routing table may also contain additional information such as the address of the next router on the path towards the destination or an estimation of the cost of this path. In this section, we discuss the main techniques that can be used to maintain the forwarding tables in a network. @@ -566,4 +566,3 @@ In this section, we discuss the main techniques that can be used to maintain the .. include:: /links.rst - diff --git a/book-2nd/principles/reliability.rst b/book-2nd/principles/reliability.rst index e8f0a0a..853698f 100644 --- a/book-2nd/principles/reliability.rst +++ b/book-2nd/principles/reliability.rst @@ -327,7 +327,7 @@ When running on top of a perfect framing sublayer, a datalink entity can simply The simplest reliable protocol -Unfortunately, this is not always sufficient to ensure a reliable delivery of the SDUs. Consider the case where a client sends tens of SDUs to a server. If the server is faster that the client, it will be able to receive and process all the segments sent by the client and deliver their content to its user. However, if the server is slower than the client, problems may arise. The datalink entity contains buffers to store SDUs that have been received as a `DATA.request` but have not yet been sent. If the application is faster than the physical link, the buffer may become full. At this point, the operating system suspends the application to let the datalink entity empty its transmission queue. The datalink entity also uses a buffer to store the received frames that have not yet been processed by the application. If the application is slow to process the data, this buffer may overflow and the datalink entity will not able to accept any additional frames. The buffers of the datalink entity have a limited size and if they overflow, the arriving frames will be discarded, even if they are correct. +Unfortunately, this is not always sufficient to ensure a reliable delivery of the SDUs. Consider the case where a client sends tens of SDUs to a server. If the server is faster that the client, it will be able to receive and process all the segments sent by the client and deliver their content to its user. However, if the server is slower than the client, problems may arise. The datalink entity contains buffers to store SDUs that have been received as a `DATA.request` but have not yet been sent. If the application is faster than the physical link, the buffer may become full. At this point, the operating system suspends the application to let the datalink entity empty its transmission queue. The datalink entity also uses a buffer to store the received frames that have not yet been processed by the application. If the application is slow to process the data, this buffer may overflow and the datalink entity will not be able to accept any additional frames. The buffers of the datalink entity have a limited size and if they overflow, the arriving frames will be discarded, even if they are correct. To solve this problem, a reliable protocol must include a feedback mechanism that allows the receiver to inform the sender that it has processed a frame and that another one can be sent. This feedback is required even though there are no transmission errors. To include such a feedback, our reliable protocol must process two types of frames : @@ -478,7 +478,7 @@ The only solution to protect against transmission errors is to add redundancy to .. This simple coding scheme forces the sender to transmit three bits for each source bit. However, it allows the receiver to correct single bit errors. More advanced coding systems that allow to recover from errors are used in several types of physical layers. -Besides framing, datalink layers also include mechanisms to detect and sometimes even recover from transmission errors. To allow a receiver to detect transmission errors, a sender must add some redundant information as an `error detection` code to the frame sent. This `error detection` code is computed by the sender on the frame that it transmits. When the receiver receives a frame with an error detection code, it recomputes it and verifies whether the received `error detection code` matches the computed `error detection code`. If they match, the frame is considered to be valid. Many error detection schemes exist and entire books have been written on the subject. A detailed discussion of these techniques is outside the scope of this book, and we will only discuss some examples to illustrate the key principles. +Besides framing, datalink layers also include mechanisms to detect and sometimes even recover from transmission errors. To allow a receiver to detect transmission errors, a sender must add some redundant information as an `error detection code` to the frame sent. This `error detection code` is computed by the sender on the frame that it transmits. When the receiver receives a frame with an error detection code, it recomputes it and verifies whether the received `error detection code` matches the computed `error detection code`. If they match, the frame is considered to be valid. Many error detection schemes exist and entire books have been written on the subject. A detailed discussion of these techniques is outside the scope of this book, and we will only discuss some examples to illustrate the key principles. To understand `error detection codes`, let us consider two devices that exchange bit strings containing `N` bits. To allow the receiver to detect a transmission error, the sender converts each string of `N` bits into a string of `N+r` bits. Usually, the `r` redundant bits are added at the beginning or the end of the transmitted bit string, but some techniques interleave redundant bits with the original bits. An `error detection code` can be defined as a function that computes the `r` redundant bits corresponding to each string of `N` bits. The simplest error detection code is the parity bit. There are two types of parity schemes : even and odd parity. With the `even` (resp. `odd`) parity scheme, the redundant bit is chosen so that an even (resp. odd) number of bits are set to `1` in the transmitted bit string of `N+r` bits. The receiver can easily recompute the parity of each received bit string and discard the strings with an invalid parity. The parity scheme is often used when 7-bit characters are exchanged. In this case, the eighth bit is often a parity bit. The table below shows the parity bits that are computed for bit strings containing three bits. diff --git a/book-2nd/principles/security.rst b/book-2nd/principles/security.rst index 1646e03..fce70a8 100644 --- a/book-2nd/principles/security.rst +++ b/book-2nd/principles/security.rst @@ -18,10 +18,10 @@ mechanism is the password. A `username` is assigned to each user and when this user wants to access the computer, he or she needs to provide his/her `username` and his/her `password`. Most passwords are composed of a sequence of characters. -The strength of the password is function of the difficulty of guessing the -characters chosen by each user. Various guidelines have been defined on how -to select a good password [#fpasswords]_. Some systems require regular -modifications of the passwords chosen by their users. +The strength of the password is function of the difficulty of guessing the +characters chosen by each user. Various guidelines have been defined on how +to select a good password [#fpasswords]_. Some systems require regular +modifications of the passwords chosen by their users. .. introduce the need for passwords and exchanging them through the network @@ -30,7 +30,7 @@ developed to enable them to access to remote computers through the network. To authenticate the remote users, these applications have also relied on usernames and passwords. When a user connects to a distant computer, she sends her username through the network and then provides her password -to confirm her `identity`. This authentication scheme can be represented +to confirm her `identity`. This authentication scheme can be represented by the time sequence diagram shown below. .. msc:: @@ -66,7 +66,7 @@ by the time sequence diagram shown below. and Hellman [DH1976]_. Since then, Alice and Bob are the most frequently used names to represent the users who interact with a network. Other characters such as Eve or Mallory have been added over the years. - We will explain their respective roles later. + We will explain their respective roles later. .. The usernames and passward can be sent in different types of packets and segments @@ -87,7 +87,7 @@ important threats that a network architect must take into account. .. index:: passive attacker -The first type of attacker is called the `passive attacker`. +The first type of attacker is called the `passive attacker`. A `passive attacker` is someone who is able to observe and usually store the information (e.g. the packets) exchanged in a given network or subset of it (e.g. a specific link). This @@ -97,7 +97,7 @@ are vulnerable to this type of attack. In the above example, a passive attacker could easily capture the password sent by Alice and reuse it later to be authenticated as Alice on the remote computer. This is illustrated on the figure below where we do not show anymore the ``DATA.req`` and -``DATA.ind`` primitives but only show the messages that are exchanged. +``DATA.ind`` primitives but only show the messages that are exchanged. Throughout this chapter, we will always use `Eve` as a user who is able to eavesdrop the data passing in front of her. @@ -140,7 +140,7 @@ easy. In other networks, this is a bit more complex depending on the network technology used, but various software packages exist to automate this process. As will be described later, the best approach to prevent this type of attack is to rely on cryptographic techniques to ensure that passwords are never -sent in clear. +sent in clear. .. index:: pervasive monitoring, Edward Snowden @@ -150,7 +150,7 @@ sent in clear. a particular user. This is not the only attack of this type. In 2013, based on documents collected by Edward Snowden, the press revealed that several governmental agencies were collecting lots of data on various links that - compose the global Internet [Greenwald2014]_. Thanks to this massive amount + compose the global Internet [Greenwald2014]_. Thanks to this massive amount of data, these governmental agencies have been able to extract lots of information about the behaviour of Internet users. Like Eve, they are in a position to extract @@ -219,10 +219,10 @@ given target. Since the attacking traffic comes from a wide range of sources, it is difficult for the victim to locate the culprit and also to counter the attack. Saturating a link is the simplest example of `Distributed Denial of Service (DDoS)` -attacks. +attacks. -In practice, there is a possibility of denial of service attacks as soon as -there is a limited resource somewhere in the network. +In practice, there is a possibility of denial of service attacks as soon as +there is a limited resource somewhere in the network. This ressource can be the bandwidth of a link, but it could also be the computational power of a server, its memory or even the size of tables used by a given protocol implementation. Defending @@ -231,7 +231,7 @@ controls a large number of sources that are used to launch the attacks. In terms of bandwidth, DoS attacks composed of a few Gbps to a few tens of Gbps of traffic are frequent on the Internet. In 2015, `github.com `_ suffered from a distributed DoS that -reached a top bandwidth of 400 Gbps according to some +reached a top bandwidth of 400 Gbps according to some `reports `_. .. index:: reflection attack, amplification @@ -247,9 +247,9 @@ response. Often the response is larger or much larger than the request sent by the client. Consider that such a simple protocol is used over a datagram network. When Alice sends a datagram to Bob containing her request, Bob extracts both the request and Alice's address from the packet. He then sends -his response in a single packet destined to Alice. Mallory would like to create a +his response in a single packet destined to Alice. Mallory would like to create a DoS attack against Alice without being identified. Since he has studied -the specification of this protocol, he can +the specification of this protocol, he can send a request to Bob inside a packet having Alice's address as its source address. Bob will process the request and send his (large) response to Alice. If the response has the same size as the request, Mallory @@ -264,7 +264,7 @@ requests, his victim receives :math:`k` Gbps of attack traffic. Such amplificati attacks are a very important problem and protocol designers should ensure that they never send a large response before having received the proof that the request that they have received originated from the source indicated in -the request. +the request. Cryptographic primitives @@ -321,7 +321,7 @@ the key, then the scheme becomes less secure since the same key is used to decrypt different parts of the message. In practice, `XOR` is often one of the basic operations used by encryption schemes. To be useable, the deployed encryption schemes use keys that are composed of a small number of bits, typically -56, 64, 128, 256, ... +56, 64, 128, 256, ... A secret key encryption scheme is a perfectly reversible functions, i.e. given an encryption function `E`, there is an associated @@ -329,7 +329,7 @@ decryption function `D` such that :math:`\forall k \forall M : D(K, E(M,K))=M`. .. index:: DES -Various secret key cryptographic functions have been proposed, implemented and +Various secret key cryptographic functions have been proposed, implemented and deployed. The most popular ones are : - DES, the Data Encryption Standard that became a standard in 1977 and has @@ -339,12 +339,12 @@ deployed. The most popular ones are : making the brute force attacks more difficult. - RC4 is an encryption scheme defined in the late 1980s by Ron Rivest for RSA Security. Given the speed of its software implementation, it has been included in - various protocols and implementations. However, cryptographers have + various protocols and implementations. However, cryptographers have identified several weaknesses in this algorithm. It is now deprecated - and should not be used anymore :rfc:`7465`. - - AES or the Advanced Encryption Standard is an encryption scheme that was - designed by the Belgian cryptographers Joan Daemen and Vincent Rijmen - in 2001 [DR2002]_. This algorithm has been standardised by the U.S. + and should not be used anymore :rfc:`7465`. + - AES or the Advanced Encryption Standard is an encryption scheme that was + designed by the Belgian cryptographers Joan Daemen and Vincent Rijmen + in 2001 [DR2002]_. This algorithm has been standardised by the U.S. National Institute of Standards and Technology (NIST). It is now used by a wide range of applications and various hardware and software implementations exist. Many @@ -355,8 +355,8 @@ deployed. The most popular ones are : the smallest message that can be encrypted and forces the sender to divide each message in blocks of the supported size. If the message is larger than an integer number of blocks, then the message must be padded before being - encrypted and this padding must be removed after decryption. The key size - indicates the resistance of the encryption scheme against brute force + encrypted and this padding must be removed after decryption. The key size + indicates the resistance of the encryption scheme against brute force attacks, i.e. attacks where the attacker tries all possible keys to find the correct one. @@ -364,7 +364,7 @@ deployed. The most popular ones are : AES is widely used as of this writing, but other secret key encryption schemes continue to appear. ChaCha20, proposed by D. Bernstein is now used by several internet protocols :rfc:`7539`. A detailed discussion of encryption -schemes is outside the scope of this book. We will consider encryption schemes +schemes is outside the scope of this book. We will consider encryption schemes as black boxes whose operation depends on a single key. A detailed overview of several of these schemes may be found in [MVV2011]_. @@ -380,7 +380,7 @@ cryptography, each user has two different keys : These two keys are generated together and they are linked by a complex mathematical relationship that is such that it is computationally difficult -to compute :math:`K_{priv}` from :math:`K_{pub}`. +to compute :math:`K_{priv}` from :math:`K_{pub}`. A public key cryptographic scheme is a combination of two functions : @@ -421,7 +421,7 @@ would then compute :math:`C=Checksum(M)` and :math:`SC=E_p(A_{priv},C)`. She would then send both `M` and `SC` to the recipient of the message who can easily compute `C` from `SC` and verify the authenticity of the message. Unfortunately, this solution does not protect Alice and the message's recipient against -a man-in-the-middle attack. If Mallory can intercept the message sent by Alice, +a man-in-the-middle attack. If Mallory can intercept the message sent by Alice, he can easily modify Alice's message and tweak it so that it has the same checksum as the original one. The CRCs, although more complex to compute, suffer from the same problem. @@ -431,17 +431,17 @@ suffer from the same problem. .. wikipedia illustration is nice https://en.wikipedia.org/wiki/MD5 To efficiently sign messages, Alice needs to be able to compute a summary -of her message in a way that makes prohibits an attacker from generating a +of her message in a way that prohibits an attacker from generating a different message that has the same summary. `Cryptographic hash functions` were designed to solve this problem. The ideal hash function is a function that returns a different number for every possible input. In practice, it is impossible to find such a function. Cryptographic hash functions are an approximation of this perfect summarisation function. They compute a summary of a given message in 128, 160, 256 bits or more. They also -exhibit the `avalanche effect`. This effect indicates that a small change in -the message causes a large change in the hash value. Finally hash functions -are very difficult to invert. Knowing a hash value, it is computationally very -difficult to find the corresponding input message. Several hash functions have +exhibit the `avalanche effect`. This effect indicates that a small change in +the message causes a large change in the hash value. Finally hash functions +are very difficult to invert. Knowing a hash value, it is computationally very +difficult to find the corresponding input message. Several hash functions have been proposed by cryptographers. The most popular ones are : - MD5, originally proposed in :rfc:`1321`. It has been used in a wide range of @@ -520,7 +520,7 @@ Mallory is to be authenticated as Alice. If Mallory can capture `Hash(passwd)`, he can simply replay this data, without being able to invert the hash function. This is called a `replay attack`. -To counter this replay attack, we need to ensure that Alice never sends the +To counter this replay attack, we need to ensure that Alice never sends the same information twice to Bob. A possible mode of operation is shown below. .. msc:: @@ -770,7 +770,7 @@ Alice and for Alice to authenticate Bob. A faster authentification could be the c=>d [ label = "" ]; -Alice sends her random nonce, :math:`R2`. Bob signs :math:`R2` and sends his nonce : +Alice sends her random nonce, :math:`R2`. Bob signs :math:`R2` and sends his nonce : :math:`R1`. Alice signs :math:`R1` and both are authenticated. @@ -836,7 +836,7 @@ protocol is not vulnerable anymore. .. index:: certificates, trusted third party -To cope with some of the above mentioned problems, +To cope with some of the above mentioned problems, public-key cryptography is usually combined with certificates. A `certificate` is a data structure that includes a signature from a trusted third party. A simple explanation of the utilisation of certificates @@ -855,7 +855,7 @@ for each certified user : Then, knowing Ted's public key, anyone can verify the validity of a certificate. When a user sends his/her public key, he/she must also attach the certificate to prove the link between his/her identity and the public key. In practice, -certificates are more complex than this. +certificates are more complex than this. Certificates will often be used to authenticate the server and sometimes to authenticate the client. @@ -895,7 +895,7 @@ and Bob need to both : Let us first explore how this could be realised by using public-key cryptography. We assume that Alice and Bob have both a public-private key pair and the corresponding certificates signed by a trusted -third party : Ted. +third party : Ted. A possible protocol would be the following. Alice sends :math:`Cert(Alice_{pub},Ted)`. This certificate provides Alice's @@ -920,8 +920,8 @@ used by an encryption algorithm even in the presence of an eavesdropper. The most widely used algorithm that allows two users to safely exchange an integer in the presence of an eavesdropper is the one proposed by Diffie and Hellman [DH1976]_. -It operates with (large) integers. Two of them are public, the modulus, p, -which is prime and the base, g, which must be a primitive root of p. +It operates with (large) integers. Two of them are public, the modulus, p, +which is prime and the base, g, which must be a primitive root of p. The communicating users select a random integer, :math:`a` for Alice and :math:`b` for Bob. The exchange starts as : @@ -1013,7 +1013,7 @@ mechanism should never be used without authentification. .. rubric:: Footnotes -.. [#fpasswords] The wikipedia page on passwords provides many of these references : +.. [#fpasswords] The wikipedia page on passwords provides many of these references : https://en.wikipedia.org/wiki/Password_strength .. [#frsa] A detailed explanation of the operation of the RSA algorithm is @@ -1028,4 +1028,3 @@ mechanism should never be used without authentification. CPU. .. include:: /links.rst - diff --git a/book-2nd/principles/sharing.rst b/book-2nd/principles/sharing.rst index 9ffe3b7..9ce2a3a 100644 --- a/book-2nd/principles/sharing.rst +++ b/book-2nd/principles/sharing.rst @@ -9,7 +9,7 @@ Sharing resources A network is designed to support a potentially large number of users that exchange information with each other. These users produce and consume information which is exchanged through the network. To support its users, a network uses several types of resources. It is important to keep in mind the different resources that are shared inside the network. -The first and more important resource inside a network is the link bandwidth. There are two situations where link bandwidth needs to be shared between different users. The first situation is when several hosts are attached to the same physical link. This situation mainly occurs in Local Area Networks (LAN). A LAN is a network that efficiently interconnects several hosts (usually a few dozens to a few hundreds) in the same room, building or campus. Consider for a example a network with five hosts. Any of these hosts needs to be able to exchange information with any of the other five hosts. A first organisation for this LAN is the full-mesh. +The first and most important resource inside a network is the link bandwidth. There are two situations where link bandwidth needs to be shared between different users. The first situation is when several hosts are attached to the same physical link. This situation mainly occurs in Local Area Networks (LAN). A LAN is a network that efficiently interconnects several hosts (usually a few dozens to a few hundreds) in the same room, building or campus. Consider for example a network with four hosts. Any of these hosts needs to be able to exchange information with any of the other three hosts. A first organization for this LAN is the full-mesh. .. figure:: ../../book/intro/svg/fullmesh.png :align: center @@ -17,49 +17,49 @@ The first and more important resource inside a network is the link bandwidth. Th A Full mesh network -The full-mesh is the most reliable and highest performing network to interconnect these five hosts. However, this network organisation has two important drawbacks. First, if a network contains `n` hosts, then :math:`\frac{n\times(n-1)}{2}` links are required. If the network contains more than a few hosts, it becomes impossible to lay down the required physical links. Second, if the network contains `n` hosts, then each host must have :math:`n-1` interfaces to terminante :math:`n-1` links. This is beyond the capabilities of most hosts. Furthermore, if a new host is added to the network, new links have to be laid down and one interface has to be added to each participating host. However, full-mesh has the advantage of providing the lowest delay between the hosts and the best resiliency against link failures. In practice, full-mesh networks are rarely used expected when there are few network nodes and resiliency is key. +The full-mesh is the most reliable and highest performing network to interconnect these five hosts. However, this network organization has two important drawbacks. First, if a network contains `n` hosts, then :math:`\frac{n\times(n-1)}{2}` links are required. If the network contains more than a few hosts, it becomes impossible to lay down the required physical links. Second, if the network contains `n` hosts, then each host must have :math:`n-1` interfaces to terminate :math:`n-1` links. This is beyond the capabilities of most hosts. Furthermore, if a new host is added to the network, new links have to be laid down and one interface has to be added to each participating host. However, full-mesh has the advantage of providing the lowest delay between the hosts and the best resiliency against link failures. In practice, full-mesh networks are rarely used expected when there are few network nodes and resiliency is key. -The second possible physical organisation, which is also used inside computers to connect different extension cards, is the bus. In a bus network, all hosts are attached to a shared medium, usually a cable through a single interface. When one host sends an electrical signal on the bus, the signal is received by all hosts attached to the bus. A drawback of bus-based networks is that if the bus is physically cut, then the network is split into two isolated networks. For this reason, bus-based networks are sometimes considered to be difficult to operate and maintain, especially when the cable is long and there are many places where it can break. Such a bus-based topology was used in early Ethernet networks. +The second possible physical organization, which is also used inside computers to connect different extension cards, is the bus. In a bus network, all hosts are attached to a shared medium, usually a cable, through a single interface. When one host sends an electrical signal on the bus, the signal is received by all hosts attached to the bus. A drawback of bus-based networks is that if the bus is physically cut, then the network is split into two isolated networks. For this reason, bus-based networks are sometimes considered to be difficult to operate and maintain, especially when the cable is long and there are many places where it can break. Such a bus-based topology was used in early Ethernet networks. .. figure:: ../../book/intro/svg/bus.png :align: center - :scale: 50 + :scale: 50 A network organized as a Bus -A third organisation of a computer network is a star topology. In such topologies, hosts have a single physical interface and there is one physical link between each host and the center of the star. The node at the center of the star can be either a piece of equipment that amplifies an electrical signal, or an active device, such as a piece of equipment that understands the format of the messages exchanged through the network. Of course, the failure of the central node implies the failure of the network. However, if one physical link fails (e.g. because the cable has been cut), then only one node is disconnected from the network. In practice, star-shaped networks are easier to operate and maintain than bus-shaped networks. Many network administrators also appreciate the fact that they can control the network from a central point. Administered from a Web interface, or through a console-like connection, the center of the star is a useful point of control (enabling or disabling devices) and an excellent observation point (usage statistics). +A third organization of a computer network is a star topology. In such topologies, hosts have a single physical interface and there is one physical link between each host and the center of the star. The node at the center of the star can be either a piece of equipment that amplifies an electrical signal, or an active device, such as a piece of equipment that understands the format of the messages exchanged through the network. Of course, the failure of the central node implies the failure of the network. However, if one physical link fails (e.g. because the cable has been cut), then only one node is disconnected from the network. In practice, star-shaped networks are easier to operate and maintain than bus-shaped networks. Many network administrators also appreciate the fact that they can control the network from a central point. Administered from a Web interface, or through a console-like connection, the center of the star is a useful point of control (enabling or disabling devices) and an excellent observation point (usage statistics). .. figure:: ../../book/intro/svg/star.png :align: center - :scale: 50 + :scale: 50 - A network organised as a Star + A network organized as a Star -A fourth physical organisation of a network is the ring topology. Like the bus organisation, each host has a single physical interface connecting it to the ring. Any signal sent by a host on the ring will be received by all hosts attached to the ring. From a redundancy point of view, a single ring is not the best solution, as the signal only travels in one direction on the ring; thus if one of the links composing the ring is cut, the entire network fails. In practice, such rings have been used in local area networks, but are now often replaced by star-shaped networks. In metropolitan networks, rings are often used to interconnect multiple locations. In this case, two parallel links, composed of different cables, are often used for redundancy. With such a dual ring, when one ring fails all the traffic can be quickly switched to the other ring. +A fourth physical organization of a network is the ring topology. Like the bus organization, each host has a single physical interface connecting it to the ring. Any signal sent by a host on the ring will be received by all hosts attached to the ring. From a redundancy point of view, a single ring is not the best solution, as the signal only travels in one direction on the ring; thus if one of the links composing the ring is cut, the entire network fails. In practice, such rings have been used in local area networks, but are now often replaced by star-shaped networks. In metropolitan networks, rings are often used to interconnect multiple locations. In this case, two parallel links, composed of different cables, are often used for redundancy. With such a dual ring, when one ring fails all the traffic can be quickly switched to the other ring. .. figure:: ../../book/intro/svg/ring.png :align: center - :scale: 50 + :scale: 50 - A network organised as a ring + A network organized as a ring -A fifth physical organisation of a network is the tree. Such networks are typically used when a large number of customers must be connected in a very cost-effective manner. Cable TV networks are often organised as trees. +A fifth physical organization of a network is the tree. Such networks are typically used when a large number of customers must be connected in a very cost-effective manner. Cable TV networks are often organized as trees. .. figure:: ../../book/intro/svg/tree.png :align: center - :scale: 50 + :scale: 50 A network organized as a Tree Sharing bandwidth ================= - + In all these networks, except the full-mesh, the link bandwidth is shared among all connected hosts. Various algorithms have been proposed and are used to efficiently share the access to this resource. We explain several of them in the Medium Access Control section below. .. note:: Fairness in computer networks - Sharing resources is important to ensure that the network efficiently serves its user. In practice, there are many ways to share resources. Some resource sharing schemes consider that some users are more important than others and should obtain more resources. For example, on the highways, police cars and ambulances have priority to use the highways. In some cities, traffic lanes are reserved for buses to promote public services, ... In computer networks, the same problem arise. Given that resources are limited, the network needs to enable users to efficiently share them. Before designing an efficient resource sharing scheme, one needs to first formalize its objectives. In computer networks, the most popular objective for resource sharing schemes is that they must be `fair`. In a simple situation, for example two hosts using a shared 2 Mbps link, the sharing scheme should allocate the same bandwidth to each user, in this case 1 Mbps. However, in a large networks, simply dividing the available resources by the number of users is not sufficient. Consider the network shown in the figure below where `A1` sends data to `A2`, `B1` to `B2`, ... In this network, how should we divide the bandwidth among the different flows ? A first approach would be to allocate the same bandwidth to each flow. In this case, each flow would obtain 5 Mbps and the link between `R2` and `R3` would not be fully loaded. Another approach would be to allocate 10 Mbps to `A1-A2`, 20 Mbps to `C1-C2` and nothing to `B1-B2`. This is clearly unfair. + Sharing resources is important to ensure that the network efficiently serves its user. In practice, there are many ways to share resources. Some resource sharing schemes consider that some users are more important than others and should obtain more resources. For example, on the highways, police cars and ambulances have priority to use the highways. In some cities, traffic lanes are reserved for buses to promote public services, ... In computer networks, the same problem arise. Given that resources are limited, the network needs to enable users to efficiently share them. Before designing an efficient resource sharing scheme, one needs to first formalize its objectives. In computer networks, the most popular objective for resource sharing schemes is that they must be `fair`. In a simple situation, for example two hosts using a shared 2 Mbps link, the sharing scheme should allocate the same bandwidth to each user, in this case 1 Mbps. However, in a large networks, simply dividing the available resources by the number of users is not sufficient. Consider the network shown in the figure below where `A1` sends data to `A2`, `B1` to `B2`, ... In this network, how should we divide the bandwidth among the different flows ? A first approach would be to allocate the same bandwidth to each flow. In this case, each flow would obtain 5 Mbps and the link between `R2` and `R3` would not be fully loaded. Another approach would be to allocate 10 Mbps to `A1-A2`, 20 Mbps to `C1-C2` and nothing to `B1-B2`. This is clearly unfair. .. graphviz:: @@ -109,12 +109,12 @@ In all these networks, except the full-mesh, the link bandwidth is shared among In large networks, fairness is always a compromise. The most widely used definition of fairness is the `max-min fairness`. A bandwidth allocation in a network is said to be `max-min fair` if it is such that it is impossible to allocate more bandwidth to one of the flows without reducing the bandwidth of a flow that already has a smaller allocation than the flow that we want to increase. If the network is completely known, it is possible to derive a `max-min fair` allocation as follows. Initially, all flows have a null bandwidth and they are placed in the candidate set. The bandwidth allocation of all flows in the candidate set is increased until one link becomes congested. At this point, the flows that use the congested link have reached their maximum allocation. They are removed from the candidate set and the process continues until the candidate set becomes empty. - In the above network, the allocation of all flows would grow until `A1-A2` and `B1-B2` reach 5 Mbps. At this point, link `R1-R2` becomes congested and these two flows have reached their maximum. The allocation for flow `C1-C2` can increase until reaching 15 Mbps. At this point, link `R2-R3` is congested. To increase the bandwidth allocated to `C1-C2`, one would need to reduce the allocation to flow `B1-B2`. Similarly, the only way to increase the allocation to flow `B1-B2` would require a decrease of the allocation to `A1-A2`. + In the above network, the allocation of all flows would grow until `A1-A2` and `B1-B2` reach 5 Mbps. At this point, link `R1-R2` becomes congested and these two flows have reached their maximum. The allocation for flow `C1-C2` can increase until reaching 15 Mbps. At this point, link `R2-R3` is congested. To increase the bandwidth allocated to `C1-C2`, one would need to reduce the allocation to flow `B1-B2`. Similarly, the only way to increase the allocation to flow `B1-B2` would require a decrease of the allocation to `A1-A2`. Network congestion ================== -Sharing bandwidth among the hosts directly attached to a link is not the only sharing problem that occurs in computer networks. To understand the general problem, let us consider a very simple network which contains only point-to-point links. This network contains three hosts and two network nodes. All links inside the network have the same capacity. For example, let us assume that all links have a bandwidth of 1000 bits per second and that the hosts send packets containing exactly one thousand bits. +Sharing bandwidth among the hosts directly attached to a link is not the only sharing problem that occurs in computer networks. To understand the general problem, let us consider a very simple network which contains only point-to-point links. This network contains three hosts and two network nodes. All links inside the network have the same capacity. For example, let us assume that all links have a bandwidth of 1000 bits per second and that the hosts send packets containing exactly one thousand bits. .. graphviz:: @@ -141,8 +141,8 @@ Sharing bandwidth among the hosts directly attached to a link is not the only sh R2--C []; } - -In the network above, consider the case where host `A` is transmitting packets to destination `C`. `A` can send one packet per second and its packets will be delivered to `C`. Now, let us explore what happens when host `B` also starts to transmit a packet. Node `R1` will receive two packets that must be forwarded to `R2`. Unfortunately, due to the limited bandwidth on the `R1-R2` link, only one of these two packets can be transmitted. The outcome of the second packet will depend on the available buffers on `R1`. If `R1` has one available buffer, it could store the packet that has not been transmitted on the `R1-R2` link until the link becomes available. If `R1` does not have available buffers, then the packet needs to be discarded. + +In the network above, consider the case where host `A` is transmitting packets to destination `C`. `A` can send one packet per second and its packets will be delivered to `C`. Now, let us explore what happens when host `B` also starts to transmit a packet. Node `R1` will receive two packets that must be forwarded to `R2`. Unfortunately, due to the limited bandwidth on the `R1-R2` link, only one of these two packets can be transmitted. The outcome of the second packet will depend on the available buffers on `R1`. If `R1` has one available buffer, it could store the packet that has not been transmitted on the `R1-R2` link until the link becomes available. If `R1` does not have available buffers, then the packet needs to be discarded. .. index:: network congestion @@ -156,19 +156,19 @@ If `R1` has enough buffers, it will be able to absorb the load without having to .. note:: Congestion collapse on the Internet - Congestion collapse is unfortunately not only an academic experience. Van Jacobson reports in [Jacobson1988]_ one of these events that affected him while he was working at the Lawrence Berkeley Laboratory (LBL). LBL was two network nodes away from the University of California in Berkeley. At that time, the link between the two sites had a bandwidth of 32 Kbps, but some hosts were already attached to 10 Mbps LANs. "In October 1986, the data throughput from LBL to UC Berkeley ... dropped from 32 Kbps to 40 bps. We were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad." This work lead to the development of various congestion control techniques that have allowed the Internet to continue to grow without experiencing widespread congestion collapse events. + Congestion collapse is unfortunately not only an academic experience. Van Jacobson reports in [Jacobson1988]_ one of these events that affected him while he was working at the Lawrence Berkeley Laboratory (LBL). LBL was two network nodes away from the University of California in Berkeley. At that time, the link between the two sites had a bandwidth of 32 Kbps, but some hosts were already attached to 10 Mbps LANs. "In October 1986, the data throughput from LBL to UC Berkeley ... dropped from 32 Kbps to 40 bps. We were fascinated by this sudden factor-of-thousand drop in bandwidth and embarked on an investigation of why things had gotten so bad." This work lead to the development of various congestion control techniques that have allowed the Internet to continue to grow without experiencing widespread congestion collapse events. -Besides bandwidth and memory, a third resource that needs to be shared inside a network is the (packet) processing capacity. To forward a packet, a network node needs bandwidth on the outgoing link, but it also needs to analyze the packet header to perform a lookup inside its forwarding table. Performing these lookup operations require resources such as CPU cycles or memory accesses. Network nodes are usually designed to be able to sustain a given packet processing rate, measured in packets per second. +Besides bandwidth and memory, a third resource that needs to be shared inside a network is the (packet) processing capacity. To forward a packet, a network node needs bandwidth on the outgoing link, but it also needs to analyze the packet header to perform a lookup inside its forwarding table. Performing these lookup operations requires resources such as CPU cycles or memory accesses. Network nodes are usually designed to be able to sustain a given packet processing rate, measured in packets per second. .. note:: Packets per second versus bits per second The performance of network nodes can be characterized by two key metrics : - + - the node's capacity measured in bits per second - the node's lookup performance measured in packets per second - The node's capacity in bits per second mainly depends on the physical interfaces that it uses and also on the capacity of the internal interconnection (bus, crossbar switch, ...) between the different interfaces inside the node. Many vendors, in particular for low-end devices will use the sum of the bandwidth of the nodes' interfaces as the node capacity in bits per second. Measurements do not always match this maximum theoretical capacity. A well designed network node will usually have a capacity in bits per second larger than the sum of its link capacities. Such nodes will usually reach this maximum capacity when forwarding large packets. + The node's capacity in bits per second mainly depends on the physical interfaces that it uses and also on the capacity of the internal interconnection (bus, crossbar switch, ...) between the different interfaces inside the node. Many vendors, in particular for low-end devices will use the sum of the bandwidth of the nodes' interfaces as the node capacity in bits per second. Measurements do not always match this maximum theoretical capacity. A well designed network node will usually have a capacity in bits per second larger than the sum of its link capacities. Such nodes will usually reach this maximum capacity when forwarding large packets. When a network node forwards small packets, its performance is usually limited by the number of lookup operations that it can perform every second. This lookup performance is measured in packets per second. The performance may depend on the length of the forwarded packets. The key performance factor is the number of minimal size packets that are forwarded by the node every second. This rate can lead to a capacity in bits per second which is much lower than the sum of the bandwidth of the node's links. @@ -180,9 +180,9 @@ Let us now try to present a broad overview of the congestion problem in networks Let us first explore which mechanisms can be used inside a network to control congestion and how these mechanisms can influence the behavior of the end hosts. -As explained earlier, one of the first manifestation of congestion on network nodes is the saturation of the network links that leads to a growth in the occupancy of the buffers of the node. This growth of the buffer occupancy implies that some packets will spend more time in the buffer and thus in the network. If hosts measure the network delays (e.g. by measuring the round-trip-time between the transmission of a packet and the return of the corresponding acknowledgement) they could start to sense congestion. On low bandwidth links, a growth in the buffer occupancy can lead to an increase of the delays which can be easily measured by the end hosts. On high bandwidth links, a few packets inside the buffer will cause a small variation in the delay which may not necessarily be larger that the natural fluctuations of the delay measurements. +As explained earlier, one of the first manifestation of congestion on network nodes is the saturation of the network links that leads to a growth in the occupancy of the buffers of the node. This growth of the buffer occupancy implies that some packets will spend more time in the buffer and thus in the network. If hosts measure the network delays (e.g. by measuring the round-trip-time between the transmission of a packet and the return of the corresponding acknowledgement) they could start to sense congestion. On low bandwidth links, a growth in the buffer occupancy can lead to an increase of the delays which can be easily measured by the end hosts. On high bandwidth links, a few packets inside the buffer will cause a small variation in the delay which may not necessarily be larger that the natural fluctuations of the delay measurements. -If the buffer's occupancy continues to grow, it will overflow and packets will need to be discarded. Discarding packets during congestion is the second possible reaction of a network node to congestion. Before looking at how a node can discard packets, it is interesting to discuss qualitatively the impact of the buffer occupancy on the reliable delivery of data through a network. This is illustrated by the figure below, adapted from [Jain1990]_. +If the buffer's occupancy continues to grow, it will overflow and packets will need to be discarded. Discarding packets during congestion is the second possible reaction of a network node to congestion. Before looking at how a node can discard packets, it is interesting to discuss qualitatively the impact of the buffer occupancy on the reliable delivery of data through a network. This is illustrated by the figure below, adapted from [Jain1990]_. .. figure:: figures/png/jain.png :align: center @@ -190,7 +190,7 @@ If the buffer's occupancy continues to grow, it will overflow and packets will n Network congestion -When the network load is low, buffer occupancy and link utilizations are low. The buffers on the network nodes are mainly used to absorb very short bursts of packets, but on average the traffic demand is lower than the network capacity. If the demand increases, the average buffer occupancy will increase as well. Measurements have shown that the total throughput increases as well. If the buffer occupancy is zero or very low, transmission opportunities on network links can be missed. This is not the case when the buffer occupancy is small but non zero. However, if the buffer occupancy continues to increase, the buffer becomes overloaded and the throughput does not increase anymore. When the buffer occupancy is close to the maximum, the throughput may decrease. This drop in throughput can be caused by excessive retransmissions of reliable protocols that incorrectly assume that previously sent packets have been lost while they are still waiting in the buffer. The network delay on the other hand increases with the buffer occupancy. In practice, a good operating point for a network buffer is a low occupancy to achieve high link utilization and also low delay for interactive applications. +When the network load is low, buffer occupancy and link utilizations are low. The buffers on the network nodes are mainly used to absorb very short bursts of packets, but on average the traffic demand is lower than the network capacity. If the demand increases, the average buffer occupancy will increase as well. Measurements have shown that the total throughput increases as well. If the buffer occupancy is zero or very low, transmission opportunities on network links can be missed. This is not the case when the buffer occupancy is small but non zero. However, if the buffer occupancy continues to increase, the buffer becomes overloaded and the throughput does not increase anymore. When the buffer occupancy is close to the maximum, the throughput may decrease. This drop in throughput can be caused by excessive retransmissions of reliable protocols that incorrectly assume that previously sent packets have been lost while they are still waiting in the buffer. The network delay on the other hand increases with the buffer occupancy. In practice, a good operating point for a network buffer is a low occupancy to achieve high link utilization and also low delay for interactive applications. .. index:: packet discard mechanism @@ -202,9 +202,9 @@ Discarding packets is one of the signals that the network nodes can use to infor By combining different answers to these questions, network researchers have developed different packet discard mechanisms. - - `tail drop` is the simplest packet discard technique. When a buffer is full, the arriving packet is discarded. `Tail drop` can be easily implemented. This is, by far, the most widely used packet discard mechanism. However, it suffers from two important drawbacks. First, since `tail drop` discards packets only when the buffer is full, buffers tend to be congested and realtime applications may suffer from the increased delays. Second, `tail drop` is blind when it discards a packet. It may discard a packet from a low bandwidth interactive flow while most of the buffer is used by large file transfers. + - `tail drop` is the simplest packet discard technique. When a buffer is full, the arriving packet is discarded. `Tail drop` can be easily implemented. This is, by far, the most widely used packet discard mechanism. However, it suffers from two important drawbacks. First, since `tail drop` discards packets only when the buffer is full, buffers tend to be congested and realtime applications may suffer from the increased delays. Second, `tail drop` is blind when it discards a packet. It may discard a packet from a low bandwidth interactive flow while most of the buffer is used by large file transfers. - `drop from front` is an alternative packet discard technique. Instead of removing the arriving packet, it removes the packet that was at the head of the queue. Discarding this packet instead of the arriving one can have two advantages. First, it already stayed a long time in the buffer. Second, hosts should be able to detect the loss (and thus the congestion) earlier. - - `probabilistic drop`. Various random drop techniques have been proposed. Compared to the previous techniques. A frequently cited technique is `Random Early Discard` (RED) [FJ1993]_. RED measures the average buffer occupancy and probabilistically discards packets when this average occupancy is too high. Compared to `tail drop` and `drop from front`, an advantage of `RED` is that thanks to the probabilistic drops, packets should be discarded from different flows in proportion of their bandwidth. + - `probabilistic drop`. Various random drop techniques have been proposed. Compared to the previous techniques. A frequently cited technique is `Random Early Discard` (RED) [FJ1993]_. RED measures the average buffer occupancy and probabilistically discards packets when this average occupancy is too high. Compared to `tail drop` and `drop from front`, an advantage of `RED` is that thanks to the probabilistic drops, packets should be discarded from different flows in proportion of their bandwidth. Discarding packets is a frequent reaction to network congestion. Unfortunately, discarding packets is not optimal since a packet which is discarded on a network node has already consumed resources on the upstream nodes. There are other ways for the network to inform the end hosts of the current congestion level. A first solution is to mark the packets when a node is congested. Several networking technologies have relied on this kind of packet marking. @@ -221,14 +221,14 @@ If the packet header does not contain any bit in the header to represent the cur Dropping and marking packets is not the only possible reaction of a router that becomes congested. A router could also selectively delay packets belonging to some flows. There are different algorithms that can be used by a router to delay packets. If the objective of the router is to fairly distribute to bandwidth of an output link among competing flows, one possibility is to organize the buffers of the router as a set of queues. For simplicity, let us assume that the router is capable of supporting a fixed number of concurrent flows, say `N`. One of the queues of the router is associated to each flow and when a packet arrives, it is placed at the tail of the corresponding queue. All the queues are controlled by a `scheduler`. A `scheduler` is an algorithm that is run each time there is an opportunity to transmit a packet on the outgoing link. Various schedulers have been proposed in the scientific literature and some are used in real routers. .. figure:: figures/png/scheduler.png - - A round-robin scheduler + + A round-robin scheduler A very simple scheduler is the `round-robin scheduler`. This scheduler serves all the queues in a round-robin fashion. If all flows send packets of the same size, then the round-robin scheduler allocates the bandwidth fairly among the different flows. Otherwise, it favors flows that are using larger packets. Extensions to the `round-robin scheduler` have been proposed to provide a fair distribution of the bandwidth with variable-length packets [SV1995]_ but these are outside the scope of this chapter. .. code-block:: python - + # N queues # state variable : next_queue next_queue=0 @@ -252,7 +252,7 @@ Distributing the load across the network Delays, packet discards, packet markings and control packets are the main types of information that the network can exchange with the end hosts. Discarding packets is the main action that a network node can perform if the congestion is too severe. Besides tackling congestion at each node, it is also possible to divert some traffic flows from heavily loaded links to reduce congestion. Early routing algorithms [MRR1980]_ have used delay measurements to detect congestion between network nodes and update the link weights dynamically. By reflecting the delay perceived by applications in the link weights used for the shortest paths computation, these routing algorithms managed to dynamically change the forwarding paths in reaction to congestion. However, deployment experience showed that these dynamic routing algorithms could cause oscillations and did not necessarily lower congestion. Deployed datagram networks rarely use dynamic routing algorithms, except in some wireless networks. In datagram networks, the state of the art reaction to long term congestion, i.e. congestion lasting hours, days or more, is to measure the traffic demand and then select the link weights [FRT2002]_ that allow to minimize the maximum link loads. If the congestion lasts longer, changing the weights is not sufficient anymore and the network needs to be upgraded with few or faster links. However, in Wide Area Networks, adding new links can take months. -In virtual circuit networks, another way to manage or prevent congestion is to limit the number of circuits that use the network at any time. This technique is usually called `connection admission control`. When a host requests the creation of a new circuit in the network, it specifies the destination and in some networking technologies the required bandwidth. With this information, the network can check whether there are enough resources available to reach this particular destination. If yes, the circuit is established. If not, the request is denied and the host will have to defer the creation of its virtual circuit. `Connection admission control` schemes are widely used in the telephone networks. In these networks, a busy tone corresponds to an unavailable destination or a congested network. +In virtual circuit networks, another way to manage or prevent congestion is to limit the number of circuits that use the network at any time. This technique is usually called `connection admission control`. When a host requests the creation of a new circuit in the network, it specifies the destination and in some networking technologies the required bandwidth. With this information, the network can check whether there are enough resources available to reach this particular destination. If yes, the circuit is established. If not, the request is denied and the host will have to defer the creation of its virtual circuit. `Connection admission control` schemes are widely used in the telephone networks. In these networks, a busy tone corresponds to an unavailable destination or a congested network. In datagram networks, this technique cannot be easily used since the basic assumption of such a network is that a host can send any packet towards any destination at any time. A host does not need to request the authorization of the network to send packets towards a particular destination. @@ -260,8 +260,8 @@ Based on the feedback received from the network, the hosts can adjust their tran Another way to share the network resources is to distribute the load across multiple links. Many techniques have been designed to spread the load over the network. As an illustration, let us briefly consider how load can be shared when accessing some content. Consider a large and popular file such as the image of a Linux distribution or the upgrade of a commercial operating system that will be downloaded by many users. There are many ways to distribute this large file. A naive solution is to place one copy of the file on a server and allow all users to download this file from the server. If the file is popular and millions of users want to download it, the server will quickly become overloaded. There are two classes of solutions that can be used to serve a large number of users. A first approach is to store the file on servers whose name is known by the clients. Before retrieving the file, each client will query the name service to obtain the address of the server. If the file is available from many servers, the name service can provide different addresses to different clients. This will automatically spread the load since different clients will download the file from different servers. Most large content providers use such a solution to distribute large files or videos. -There is another solution that allows to spread the load among many sources without relying on the name service. The popular bittorent service -is an example of this approach. With this solution, each file is divided in blocks of a fixed size. To retrieve a file, a client needs to retrieve all the blocks that compose the file. However, nothing forces the client to retrieve all the blocks in sequence and from the same server. Each file is associated with metadata that indicates for each block a list of addresses of hosts that store this block. To retrieve a complete file, a client first downloads the metadata. Then, it tries to retrieve each block from one of the hosts that store the block. In practice, implementations often try to download several blocks in parallel. Once one block has been successfully downloaded, the next block can be requested. If a host is slow to provide one block or becomes unavailable, the client can contact another host listed in the metadata. Most deployments of bittorrent allow the clients to participate to the distribution of blocks. Once a client has downloaded one block, it contacts the server which stores the metadata to indicate that it can also provide this block. With this scheme, when a file is popular, its blocks are downloaded by many hosts that automatically participate in the distribution of the blocks. Thus, the number of `servers` that are capable of providing blocks from a popular file automatically increases with the file's popularity. +There is another solution that allows to spread the load among many sources without relying on the name service. The popular bittorent service +is an example of this approach. With this solution, each file is divided in blocks of a fixed size. To retrieve a file, a client needs to retrieve all the blocks that compose the file. However, nothing forces the client to retrieve all the blocks in sequence and from the same server. Each file is associated with metadata that indicates for each block a list of addresses of hosts that store this block. To retrieve a complete file, a client first downloads the metadata. Then, it tries to retrieve each block from one of the hosts that store the block. In practice, implementations often try to download several blocks in parallel. Once one block has been successfully downloaded, the next block can be requested. If a host is slow to provide one block or becomes unavailable, the client can contact another host listed in the metadata. Most deployments of bittorrent allow the clients to participate to the distribution of blocks. Once a client has downloaded one block, it contacts the server which stores the metadata to indicate that it can also provide this block. With this scheme, when a file is popular, its blocks are downloaded by many hosts that automatically participate in the distribution of the blocks. Thus, the number of `servers` that are capable of providing blocks from a popular file automatically increases with the file's popularity. Now that we have provided a broad overview of the techniques that can be used to spread the load and allocate resources in the network, let us analyze two techniques in more details : Medium Access Control and Congestion control. @@ -289,7 +289,7 @@ We first discuss a simple deterministic MAC algorithm and then we describe sever Static allocation methods ------------------------- -A first solution to share the available resources among all the devices attached to one Local Area Network is to define, `a priori`, the distribution of the transmission resources among the different devices. If `N` devices need to share the transmission capacities of a LAN operating at `b` Mbps, each device could be allocated a bandwidth of :math:`\frac{b}{N}` Mbps. +A first solution to share the available resources among all the devices attached to one Local Area Network is to define, `a priori`, the distribution of the transmission resources among the different devices. If `N` devices need to share the transmission capacities of a LAN operating at `b` Mbps, each device could be allocated a bandwidth of :math:`\frac{b}{N}` Mbps. .. index:: Frequency Division Multiplexing, FDM @@ -309,11 +309,11 @@ Limited resources need to be shared in other environments than Local Area Networ .. figure:: ../../book/lan/png/lan-fig-012-c.png :align: center :scale: 70 - - Time-division multiplexing + Time-division multiplexing -TDM as shown above can be completely static, i.e. the same conversations always share the link, or dynamic. In the latter case, the two endpoints of the link must exchange messages specifying which conversation uses which byte inside each slot. Thanks to these signalling messages, it is possible to dynamically add and remove voice conversations from a given link. + +TDM as shown above can be completely static, i.e. the same conversations always share the link, or dynamic. In the latter case, the two endpoints of the link must exchange messages specifying which conversation uses which byte inside each slot. Thanks to these signalling messages, it is possible to dynamically add and remove voice conversations from a given link. TDM and FDM are widely used in telephone networks to support fixed bandwidth conversations. Using them in Local Area Networks that support computers would probably be inefficient. Computers usually do not send information at a fixed rate. Instead, they often have an on-off behaviour. During the on period, the computer tries to send at the highest possible rate, e.g. to transfer a file. During the off period, which is often much longer than the on period, the computer does not transmit any packet. Using a static allocation scheme for computers attached to a LAN would lead to huge inefficiencies, as they would only be able to transmit at :math:`\frac{1}{N}` of the total bandwidth during their on period, despite the fact that the other computers are in their off period and thus do not need to transmit any information. The dynamic MAC algorithms discussed in the remainder of this chapter aim solve this problem. @@ -323,13 +323,13 @@ ALOHA .. index:: packet radio -In the 1960s, computers were mainly mainframes with a few dozen terminals attached to them. These terminals were usually in the same building as the mainframe and were directly connected to it. In some cases, the terminals were installed in remote locations and connected through a :term:`modem` attached to a :term:`dial-up line`. The university of Hawaii chose a different organisation. Instead of using telephone lines to connect the distant terminals, they developed the first `packet radio` technology [Abramson1970]_. Until then, computer networks were built on top of either the telephone network or physical cables. ALOHANet showed that it was possible to use radio signals to interconnect computers. +In the 1960s, computers were mainly mainframes with a few dozen terminals attached to them. These terminals were usually in the same building as the mainframe and were directly connected to it. In some cases, the terminals were installed in remote locations and connected through a :term:`modem` attached to a :term:`dial-up line`. The university of Hawaii chose a different organization. Instead of using telephone lines to connect the distant terminals, they developed the first `packet radio` technology [Abramson1970]_. Until then, computer networks were built on top of either the telephone network or physical cables. ALOHANet showed that it was possible to use radio signals to interconnect computers. .. index:: ALOHA -The first version of ALOHANet, described in [Abramson1970]_, operated as follows: First, the terminals and the mainframe exchanged fixed-length frames composed of 704 bits. Each frame contained 80 8-bit characters, some control bits and parity information to detect transmission errors. Two channels in the 400 MHz range were reserved for the operation of ALOHANet. The first channel was used by the mainframe to send frames to all terminals. The second channel was shared among all terminals to send frames to the mainframe. As all terminals share the same transmission channel, there is a risk of collision. To deal with this problem as well as transmission errors, the mainframe verified the parity bits of the received frame and sent an acknowledgement on its channel for each correctly received frame. The terminals on the other hand had to retransmit the unacknowledged frames. As for TCP, retransmitting these frames immediately upon expiration of a fixed timeout is not a good approach as several terminals may retransmit their frames at the same time leading to a network collapse. A better approach, but still far from perfect, is for each terminal to wait a random amount of time after the expiration of its retransmission timeout. This avoids synchronisation among multiple retransmitting terminals. +The first version of ALOHANet, described in [Abramson1970]_, operated as follows: First, the terminals and the mainframe exchanged fixed-length frames composed of 704 bits. Each frame contained 80 8-bit characters, some control bits and parity information to detect transmission errors. Two channels in the 400 MHz range were reserved for the operation of ALOHANet. The first channel was used by the mainframe to send frames to all terminals. The second channel was shared among all terminals to send frames to the mainframe. As all terminals share the same transmission channel, there is a risk of collision. To deal with this problem as well as transmission errors, the mainframe verified the parity bits of the received frame and sent an acknowledgement on its channel for each correctly received frame. The terminals on the other hand had to retransmit the unacknowledged frames. As for TCP, retransmitting these frames immediately upon expiration of a fixed timeout is not a good approach as several terminals may retransmit their frames at the same time leading to a network collapse. A better approach, but still far from perfect, is for each terminal to wait a random amount of time after the expiration of its retransmission timeout. This avoids synchronisation among multiple retransmitting terminals. -The pseudo-code below shows the operation of an ALOHANet terminal. We use this python syntax for all Medium Access Control algorithms described in this chapter. The algorithm is applied to each new frame that needs to be transmitted. It attempts to transmit a frame at most `max` times (`while loop`). Each transmission attempt is performed as follows: First, the frame is sent. Each frame is protected by a timeout. Then, the terminal waits for either a valid acknowledgement frame or the expiration of its timeout. If the terminal receives an acknowledgement, the frame has been delivered correctly and the algorithm terminates. Otherwise, the terminal waits for a random time and attempts to retransmit the frame. +The pseudo-code below shows the operation of an ALOHANet terminal. We use this python syntax for all Medium Access Control algorithms described in this chapter. The algorithm is applied to each new frame that needs to be transmitted. It attempts to transmit a frame at most `max` times (`while loop`). Each transmission attempt is performed as follows: First, the frame is sent. Each frame is protected by a timeout. Then, the terminal waits for either a valid acknowledgement frame or the expiration of its timeout. If the terminal receives an acknowledgement, the frame has been delivered correctly and the algorithm terminates. Otherwise, the terminal waits for a random time and attempts to retransmit the frame. .. code-block:: python @@ -341,10 +341,10 @@ The pseudo-code below shows the operation of an ALOHANet terminal. We use this p if (ack_on_return_channel): break # transmission was successful else: - # timeout + # timeout wait(random_time) N=N+1 - else: + else: # Too many transmission attempts @@ -353,12 +353,12 @@ The pseudo-code below shows the operation of an ALOHANet terminal. We use this p .. note:: Amateur packet radio - Packet radio technologies have evolved in various directions since the first experiments performed at the University of Hawaii. The Amateur packet radio service developed by amateur radio operators is one of the descendants ALOHANet. Many amateur radio operators are very interested in new technologies and they often spend countless hours developing new antennas or transceivers. When the first personal computers appeared, several amateur radio operators designed radio modems and their own datalink layer protocols [KPD1985]_ [BNT1997]_. This network grew and it was possible to connect to servers in several European countries by only using packet radio relays. Some amateur radio operators also developed TCP/IP protocol stacks that were used over the packet radio service. Some parts of the `amateur packet radio network `_ are connected to the global Internet and use the `44.0.0.0/8` prefix. + Packet radio technologies have evolved in various directions since the first experiments performed at the University of Hawaii. The Amateur packet radio service developed by amateur radio operators is one of the descendants ALOHANet. Many amateur radio operators are very interested in new technologies and they often spend countless hours developing new antennas or transceivers. When the first personal computers appeared, several amateur radio operators designed radio modems and their own datalink layer protocols [KPD1985]_ [BNT1997]_. This network grew and it was possible to connect to servers in several European countries by only using packet radio relays. Some amateur radio operators also developed TCP/IP protocol stacks that were used over the packet radio service. Some parts of the `amateur packet radio network `_ are connected to the global Internet and use the `44.0.0.0/8` prefix. .. index:: slotted ALOHA -Many improvements to ALOHANet have been proposed since the publication of [Abramson1970]_, and this technique, or some of its variants, are still found in wireless networks today. The slotted technique proposed in [Roberts1975]_ is important because it shows that a simple modification can significantly improve channel utilization. Instead of allowing all terminals to transmit at any time, [Roberts1975]_ proposed to divide time into slots and allow terminals to transmit only at the beginning of each slot. Each slot corresponds to the time required to transmit one fixed size frame. In practice, these slots can be imposed by a single clock that is received by all terminals. In ALOHANet, it could have been located on the central mainframe. The analysis in [Roberts1975]_ reveals that this simple modification improves the channel utilization by a factor of two. - +Many improvements to ALOHANet have been proposed since the publication of [Abramson1970]_, and this technique, or some of its variants, are still found in wireless networks today. The slotted technique proposed in [Roberts1975]_ is important because it shows that a simple modification can significantly improve channel utilization. Instead of allowing all terminals to transmit at any time, [Roberts1975]_ proposed to divide time into slots and allow terminals to transmit only at the beginning of each slot. Each slot corresponds to the time required to transmit one fixed size frame. In practice, these slots can be imposed by a single clock that is received by all terminals. In ALOHANet, it could have been located on the central mainframe. The analysis in [Roberts1975]_ reveals that this simple modification improves the channel utilization by a factor of two. + .. index:: CSMA, Carrier Sense Multiple Access @@ -366,7 +366,7 @@ Carrier Sense Multiple Access ----------------------------- -ALOHA and slotted ALOHA can easily be implemented, but unfortunately, they can only be used in networks that are very lightly loaded. Designing a network for a very low utilisation is possible, but it clearly increases the cost of the network. To overcome the problems of ALOHA, many Medium Access Control mechanisms have been proposed which improve channel utilization. Carrier Sense Multiple Access (CSMA) is a significant improvement compared to ALOHA. CSMA requires all nodes to listen to the transmission channel to verify that it is free before transmitting a frame [KT1975]_. When a node senses the channel to be busy, it defers its transmission until the channel becomes free again. The pseudo-code below provides a more detailed description of the operation of CSMA. +ALOHA and slotted ALOHA can easily be implemented, but unfortunately, they can only be used in networks that are very lightly loaded. Designing a network for a very low utilisation is possible, but it clearly increases the cost of the network. To overcome the problems of ALOHA, many Medium Access Control mechanisms have been proposed which improve channel utilization. Carrier Sense Multiple Access (CSMA) is a significant improvement compared to ALOHA. CSMA requires all nodes to listen to the transmission channel to verify that it is free before transmitting a frame [KT1975]_. When a node senses the channel to be busy, it defers its transmission until the channel becomes free again. The pseudo-code below provides a more detailed description of the operation of CSMA. .. index:: persistent CSMA, CSMA (persistent) @@ -381,13 +381,13 @@ ALOHA and slotted ALOHA can easily be implemented, but unfortunately, they can o if ack : break # transmission was successful else : - # timeout + # timeout N=N+1 - # end of while loop + # end of while loop # Too many transmission attempts -The above pseudo-code is often called `persistent CSMA` [KT1975]_ as the terminal will continuously listen to the channel and transmit its frame as soon as the channel becomes free. Another important variant of CSMA is the `non-persistent CSMA` [KT1975]_. The main difference between persistent and non-persistent CSMA described in the pseudo-code below is that a non-persistent CSMA node does not continuously listen to the channel to determine when it becomes free. When a non-persistent CSMA terminal senses the transmission channel to be busy, it waits for a random time before sensing the channel again. This improves channel utilization compared to persistent CSMA. With persistent CSMA, when two terminals sense the channel to be busy, they will both transmit (and thus cause a collision) as soon as the channel becomes free. With non-persistent CSMA, this synchronisation does not occur, as the terminals wait a random time after having sensed the transmission channel. However, the higher channel utilization achieved by non-persistent CSMA comes at the expense of a slightly higher waiting time in the terminals when the network is lightly loaded. +The above pseudo-code is often called `persistent CSMA` [KT1975]_ as the terminal will continuously listen to the channel and transmit its frame as soon as the channel becomes free. Another important variant of CSMA is the `non-persistent CSMA` [KT1975]_. The main difference between persistent and non-persistent CSMA described in the pseudo-code below is that a non-persistent CSMA node does not continuously listen to the channel to determine when it becomes free. When a non-persistent CSMA terminal senses the transmission channel to be busy, it waits for a random time before sensing the channel again. This improves channel utilization compared to persistent CSMA. With persistent CSMA, when two terminals sense the channel to be busy, they will both transmit (and thus cause a collision) as soon as the channel becomes free. With non-persistent CSMA, this synchronisation does not occur, as the terminals wait a random time after having sensed the transmission channel. However, the higher channel utilization achieved by non-persistent CSMA comes at the expense of a slightly higher waiting time in the terminals when the network is lightly loaded. .. index:: non-persistent CSMA, CSMA (non-persistent) @@ -399,20 +399,20 @@ The above pseudo-code is often called `persistent CSMA` [KT1975]_ as the termina while N<= max : listen(channel) if free(channel): - send(frame) + send(frame) wait(ack or timeout) if received(ack) : break # transmission was successful else : - # timeout + # timeout N=N+1 else: wait(random_time) - # end of while loop + # end of while loop # Too many transmission attempts -[KT1975]_ analyzes in detail the performance of several CSMA variants. Under some assumptions about the transmission channel and the traffic, the analysis compares ALOHA, slotted ALOHA, persistent and non-persistent CSMA. Under these assumptions, ALOHA achieves a channel utilization of only 18.4% of the channel capacity. Slotted ALOHA is able to use 36.6% of this capacity. Persistent CSMA improves the utilization by reaching 52.9% of the capacity while non-persistent CSMA achieves 81.5% of the channel capacity. +[KT1975]_ analyzes in detail the performance of several CSMA variants. Under some assumptions about the transmission channel and the traffic, the analysis compares ALOHA, slotted ALOHA, persistent and non-persistent CSMA. Under these assumptions, ALOHA achieves a channel utilization of only 18.4% of the channel capacity. Slotted ALOHA is able to use 36.6% of this capacity. Persistent CSMA improves the utilization by reaching 52.9% of the capacity while non-persistent CSMA achieves 81.5% of the channel capacity. .. index:: Carrier Sense Multiple Access with Collision Detection, CSMA/CD @@ -427,8 +427,8 @@ CSMA improves channel utilization compared to ALOHA. However, the performance ca .. figure:: ../../book/lan/png/lan-fig-024-c.png :align: center :scale: 70 - - Frame transmission on a shared bus + + Frame transmission on a shared bus Now that we have looked at how a frame is actually transmitted as an electrical signal on a shared bus, it is interesting to look in more detail at what happens when two hosts transmit a frame at almost the same time. This is illustrated in the figure below, where hosts A and B start their transmission at the same time (point `(1)`). At this time, if host C senses the channel, it will consider it to be free. This will not last a long time and at point `(2)` the electrical signals from both host A and host B reach host C. The combined electrical signal (shown graphically as the superposition of the two curves in the figure) cannot be decoded by host C. Host C detects a collision, as it receives a signal that it cannot decode. Since host C cannot decode the frames, it cannot determine which hosts are sending the colliding frames. Note that host A (and host B) will detect the collision after host C (point `(3)` in the figure below). @@ -436,8 +436,8 @@ Now that we have looked at how a frame is actually transmitted as an electrical .. figure:: ../../book/lan/png/lan-fig-025-c.png :align: center :scale: 70 - - Frame collision on a shared bus + + Frame collision on a shared bus @@ -446,13 +446,13 @@ Now that we have looked at how a frame is actually transmitted as an electrical As shown above, hosts detect collisions when they receive an electrical signal that they cannot decode. In a wired network, a host is able to detect such a collision both while it is listening (e.g. like host C in the figure above) and also while it is sending its own frame. When a host transmits a frame, it can compare the electrical signal that it transmits with the electrical signal that it senses on the wire. At points `(1)` and `(2)` in the figure above, host A senses only its own signal. At point `(3)`, it senses an electrical signal that differs from its own signal and can thus detects the collision. At this point, its frame is corrupted and it can stop its transmission. The ability to detect collisions while transmitting is the starting point for the `Carrier Sense Multiple Access with Collision Detection (CSMA/CD)` Medium Access Control algorithm, which is used in Ethernet networks [Metcalfe1976]_ [IEEE802.3]_ . When an Ethernet host detects a collision while it is transmitting, it immediately stops its transmission. Compared with pure CSMA, CSMA/CD is an important improvement since when collisions occur, they only last until colliding hosts have detected it and stopped their transmission. In practice, when a host detects a collision, it sends a special jamming signal on the cable to ensure that all hosts have detected the collision. -To better understand these collisions, it is useful to analyse what would be the worst collision on a shared bus network. Let us consider a wire with two hosts attached at both ends, as shown in the figure below. Host A starts to transmit its frame and its electrical signal is propagated on the cable. Its propagation time depends on the physical length of the cable and the speed of the electrical signal. Let us use :math:`\tau` to represent this propagation delay in seconds. Slightly less than :math:`\tau` seconds after the beginning of the transmission of A's frame, B decides to start transmitting its own frame. After :math:`\epsilon` seconds, B senses A's frame, detects the collision and stops transmitting. The beginning of B's frame travels on the cable until it reaches host A. Host A can thus detect the collision at time :math:`\tau-\epsilon+\tau \approx 2\times\tau`. An important point to note is that a collision can only occur during the first :math:`2\times\tau` seconds of its transmission. If a collision did not occur during this period, it cannot occur afterwards since the transmission channel is busy after :math:`\tau` seconds and CSMA/CD hosts sense the transmission channel before transmitting their frame. +To better understand these collisions, it is useful to analyse what would be the worst collision on a shared bus network. Let us consider a wire with two hosts attached at both ends, as shown in the figure below. Host A starts to transmit its frame and its electrical signal is propagated on the cable. Its propagation time depends on the physical length of the cable and the speed of the electrical signal. Let us use :math:`\tau` to represent this propagation delay in seconds. Slightly less than :math:`\tau` seconds after the beginning of the transmission of A's frame, B decides to start transmitting its own frame. After :math:`\epsilon` seconds, B senses A's frame, detects the collision and stops transmitting. The beginning of B's frame travels on the cable until it reaches host A. Host A can thus detect the collision at time :math:`\tau-\epsilon+\tau \approx 2\times\tau`. An important point to note is that a collision can only occur during the first :math:`2\times\tau` seconds of its transmission. If a collision did not occur during this period, it cannot occur afterwards since the transmission channel is busy after :math:`\tau` seconds and CSMA/CD hosts sense the transmission channel before transmitting their frame. .. figure:: ../../book/lan/png/lan-fig-027-c.png :align: center :scale: 70 - + The worst collision on a shared bus @@ -465,14 +465,14 @@ Removing acknowledgements is an interesting optimisation as it reduces the numbe .. figure:: ../../book/lan/png/lan-fig-026-c.png :align: center :scale: 70 - + The short-frame collision problem .. index:: slot time (Ethernet) -To solve this problem, networks using CSMA/CD require hosts to transmit for at least :math:`2\times\tau` seconds. Since the network transmission speed is fixed for a given network technology, this implies that a technology that uses CSMA/CD enforces a minimum frame size. In the most popular CSMA/CD technology, Ethernet, :math:`2\times\tau` is called the `slot time` [#fslottime]_. +To solve this problem, networks using CSMA/CD require hosts to transmit for at least :math:`2\times\tau` seconds. Since the network transmission speed is fixed for a given network technology, this implies that a technology that uses CSMA/CD enforces a minimum frame size. In the most popular CSMA/CD technology, Ethernet, :math:`2\times\tau` is called the `slot time` [#fslottime]_. @@ -487,34 +487,34 @@ To understand `binary exponential back-off`, let us consider a collision caused 3. The second host retransmits immediately and the first defers its retransmission 4. Both hosts defer their retransmission and a new collision occurs -In the second and third cases, both hosts have flipped different coins. The delay chosen by the host that defers its retransmission should be long enough to ensure that its retransmission will not collide with the immediate retransmission of the other host. However the delay should not be longer than the time necessary to avoid the collision, because if both hosts decide to defer their transmission, the network will be idle during this delay. The `slot time` is the optimal delay since it is the shortest delay that ensures that the first host will be able to retransmit its frame completely without any collision. +In the second and third cases, both hosts have flipped different coins. The delay chosen by the host that defers its retransmission should be long enough to ensure that its retransmission will not collide with the immediate retransmission of the other host. However the delay should not be longer than the time necessary to avoid the collision, because if both hosts decide to defer their transmission, the network will be idle during this delay. The `slot time` is the optimal delay since it is the shortest delay that ensures that the first host will be able to retransmit its frame completely without any collision. -If two hosts are competing, the algorithm above will avoid a second collision 50% of the time. However, if the network is heavily loaded, several hosts may be competing at the same time. In this case, the hosts should be able to automatically adapt their retransmission delay. The `binary exponential back-off` performs this adaptation based on the number of collisions that have affected a frame. After the first collision, the host flips a coin and waits 0 or 1 `slot time`. After the second collision, it generates a random number and waits 0, 1, 2 or 3 `slot times`, etc. The duration of the waiting time is doubled after each collision. The complete pseudo-code for the CSMA/CD algorithm is shown in the figure below. +If two hosts are competing, the algorithm above will avoid a second collision 50% of the time. However, if the network is heavily loaded, several hosts may be competing at the same time. In this case, the hosts should be able to automatically adapt their retransmission delay. The `binary exponential back-off` performs this adaptation based on the number of collisions that have affected a frame. After the first collision, the host flips a coin and waits 0 or 1 `slot time`. After the second collision, it generates a random number and waits 0, 1, 2 or 3 `slot times`, etc. The duration of the waiting time is doubled after each collision. The complete pseudo-code for the CSMA/CD algorithm is shown in the figure below. -.. code-block:: python +.. code-block:: python # CSMA/CD pseudo-code N=1 while N<= max : wait(channel_becomes_free) - send(frame) - wait_until (end_of_frame) or (collision) + send(frame) + wait_until (end_of_frame) or (collision) if collision detected: stop transmitting send(jamming) k = min (10, N) - r = random(0, 2**k - 1) + r = random(0, 2**k - 1) wait(r*slotTime) N=N+1 - else : + else : wait(inter-frame_delay) break - # end of while loop + # end of while loop # Too many transmission attempts - -The inter-frame delay used in this pseudo-code is a short delay corresponding to the time required by a network adapter to switch from transmit to receive mode. It is also used to prevent a host from sending a continuous stream of frames without leaving any transmission opportunities for other hosts on the network. This contributes to the fairness of CSMA/CD. Despite this delay, there are still conditions where CSMA/CD is not completely fair [RY1994]_. Consider for example a network with two hosts : a server sending long frames and a client sending acknowledgments. Measurements reported in [RY1994]_ have shown that there are situations where the client could suffer from repeated collisions that lead it to wait for long periods of time due to the exponential back-off algorithm. + +The inter-frame delay used in this pseudo-code is a short delay corresponding to the time required by a network adapter to switch from transmit to receive mode. It is also used to prevent a host from sending a continuous stream of frames without leaving any transmission opportunities for other hosts on the network. This contributes to the fairness of CSMA/CD. Despite this delay, there are still conditions where CSMA/CD is not completely fair [RY1994]_. Consider for example a network with two hosts : a server sending long frames and a client sending acknowledgments. Measurements reported in [RY1994]_ have shown that there are situations where the client could suffer from repeated collisions that lead it to wait for long periods of time due to the exponential back-off algorithm. .. [#fslottime] This name should not be confused with the duration of a transmission slot in slotted ALOHA. In CSMA/CD networks, the slot time is the time during which a collision can occur at the beginning of the transmission of a frame. In slotted ALOHA, the duration of a slot is the transmission time of an entire fixed-size frame. @@ -535,7 +535,7 @@ CSMA/CA uses acknowledgements like CSMA. Each frame contains a sequence number a .. index:: Distributed Coordination Function Inter Frame Space, DIFS, Extended Inter Frame Space, EIFS -Compared to CSMA, CSMA/CA defines more precisely when a device is allowed to send a frame. First, CSMA/CA defines two delays : `DIFS` and `EIFS`. To send a frame, a device must first wait until the channel has been idle for at least the `Distributed Coordination Function Inter Frame Space` (DIFS) if the previous frame was received correctly. However, if the previously received frame was corrupted, this indicates that there are collisions and the device must sense the channel idle for at least the `Extended Inter Frame Space` (EIFS), with :math:`SIFS=DIFS) else: wait(channel_free_during_t >=EIFS) - + back-off_time = int(random[0,min(255,7*(2^(N-1)))])*slotTime wait(channel free during backoff_time) # backoff timer is frozen while channel is sensed to be busy - send(frame) + send(frame) wait(ack or timeout) if received(ack) # frame received correctly @@ -592,18 +592,18 @@ The pseudo-code below summarizes the operation of a CSMA/CA device. The values o else: # retransmission required N=N+1 - # end of while loop + # end of while loop .. index:: hidden station problem -Another problem faced by wireless networks is often called the `hidden station problem`. In a wireless network, radio signals are not always propagated same way in all directions. For example, two devices separated by a wall may not be able to receive each other's signal while they could both be receiving the signal produced by a third host. This is illustrated in the figure below, but it can happen in other environments. For example, two devices that are on different sides of a hill may not be able to receive each other's signal while they are both able to receive the signal sent by a station at the top of the hill. Furthermore, the radio propagation conditions may change with time. For example, a truck may temporarily block the communication between two nearby devices. +Another problem faced by wireless networks is often called the `hidden station problem`. In a wireless network, radio signals are not always propagated same way in all directions. For example, two devices separated by a wall may not be able to receive each other's signal while they could both be receiving the signal produced by a third host. This is illustrated in the figure below, but it can happen in other environments. For example, two devices that are on different sides of a hill may not be able to receive each other's signal while they are both able to receive the signal sent by a station at the top of the hill. Furthermore, the radio propagation conditions may change with time. For example, a truck may temporarily block the communication between two nearby devices. .. figure:: ../../book/lan/svg/datalink-fig-009-c.png :align: center :scale: 70 - - The hidden station problem + + The hidden station problem @@ -614,16 +614,16 @@ To avoid collisions in these situations, CSMA/CA allows devices to reserve the t .. figure:: ../../book/lan/svg/datalink-fig-010-c.png :align: center :scale: 70 - + Reservations with CSMA/CA The utilization of the reservations with CSMA/CA is an optimisation that is useful when collisions are frequent. If there are few collisions, the time required to transmit the RTS and CTS frames can become significant and in particular when short frames are exchanged. Some devices only turn on RTS/CTS after transmission errors. - + Deterministic Medium Access Control algorithms ---------------------------------------------- -During the 1970s and 1980s, there were huge debates in the networking community about the best suited Medium Access Control algorithms for Local Area Networks. The optimistic algorithms that we have described until now were relatively easy to implement when they were designed. From a performance perspective, mathematical models and simulations showed the ability of these optimistic techniques to sustain load. However, none of the optimistic techniques are able to guarantee that a frame will be delivered within a given delay bound and some applications require predictable transmission delays. The deterministic MAC algorithms were considered by a fraction of the networking community as the best solution to fulfill the needs of Local Area Networks. +During the 1970s and 1980s, there were huge debates in the networking community about the best suited Medium Access Control algorithms for Local Area Networks. The optimistic algorithms that we have described until now were relatively easy to implement when they were designed. From a performance perspective, mathematical models and simulations showed the ability of these optimistic techniques to sustain load. However, none of the optimistic techniques are able to guarantee that a frame will be delivered within a given delay bound and some applications require predictable transmission delays. The deterministic MAC algorithms were considered by a fraction of the networking community as the best solution to fulfill the needs of Local Area Networks. Both the proponents of the deterministic and the opportunistic techniques lobbied to develop standards for Local Area networks that would incorporate their solution. Instead of trying to find an impossible compromise between these diverging views, the IEEE 802 committee that was chartered to develop Local Area Network standards chose to work in parallel on three different LAN technologies and created three working groups. The `IEEE 802.3 working group `_ became responsible for CSMA/CD. The proponents of deterministic MAC algorithms agreed on the basic principle of exchanging special frames called tokens between devices to regulate the access to the transmission medium. However, they did not agree on the most suitable physical layout for the network. IBM argued in favor of Ring-shaped networks while the manufacturing industry, led by General Motors, argued in favor of a bus-shaped network. This led to the creation of the `IEEE 802.4 working group` to standardise Token Bus networks and the `IEEE 802.5 working group `_ to standardise Token Ring networks. Although these techniques are not widely used anymore today, the principles behind a token-based protocol are still important. @@ -637,7 +637,7 @@ A Token Ring network is composed of a set of stations that are attached to a uni .. figure:: ../../book/lan/svg/datalink-fig-011-c.png :align: center :scale: 70 - + A Token Ring network @@ -679,9 +679,9 @@ Now that we have explained how the token can be forwarded on the ring, let us an 802.5 data frame format -To capture a token, a station must operate in `Listen` mode. In this mode, the station receives bits from its upstream neighbour. If the bits correspond to a data frame, they must be forwarded to the downstream neighbour. If they correspond to a token, the station can capture it and transmit its data frame. Both the data frame and the token are encoded as a bit string beginning with the `Starting Delimiter` followed by the `Access Control` field. When the station receives the first bit of a `Starting Delimiter`, it cannot know whether this is a data frame or a token and must forward the entire delimiter to its downstream neighbour. It is only when it receives the fourth bit of the `Access Control` field (i.e. the `Token` bit) that the station knows whether the frame is a data frame or a token. If the `Token` bit is reset, it indicates a data frame and the remaining bits of the data frame must be forwarded to the downstream station. Otherwise (`Token` bit is set), this is a token and the station can capture it by resetting the bit that is currently in its buffer. Thanks to this modification, the beginning of the token is now the beginning of a data frame and the station can switch to `Transmit` mode and send its data frame starting at the fifth bit of the `Access Control` field. Thus, the one-bit delay introduced by each Token Ring station plays a key role in enabling the stations to efficiently capture the token. +To capture a token, a station must operate in `Listen` mode. In this mode, the station receives bits from its upstream neighbour. If the bits correspond to a data frame, they must be forwarded to the downstream neighbour. If they correspond to a token, the station can capture it and transmit its data frame. Both the data frame and the token are encoded as a bit string beginning with the `Starting Delimiter` followed by the `Access Control` field. When the station receives the first bit of a `Starting Delimiter`, it cannot know whether this is a data frame or a token and must forward the entire delimiter to its downstream neighbour. It is only when it receives the fourth bit of the `Access Control` field (i.e. the `Token` bit) that the station knows whether the frame is a data frame or a token. If the `Token` bit is reset, it indicates a data frame and the remaining bits of the data frame must be forwarded to the downstream station. Otherwise (`Token` bit is set), this is a token and the station can capture it by resetting the bit that is currently in its buffer. Thanks to this modification, the beginning of the token is now the beginning of a data frame and the station can switch to `Transmit` mode and send its data frame starting at the fifth bit of the `Access Control` field. Thus, the one-bit delay introduced by each Token Ring station plays a key role in enabling the stations to efficiently capture the token. -After having transmitted its data frame, the station must remain in `Transmit` mode until it has received the last bit of its own data frame. This ensures that the bits sent by a station do not remain in the network forever. A data frame sent by a station in a Token Ring network passes in front of all stations attached to the network. Each station can detect the data frame and analyse the destination address to possibly capture the frame. +After having transmitted its data frame, the station must remain in `Transmit` mode until it has received the last bit of its own data frame. This ensures that the bits sent by a station do not remain in the network forever. A data frame sent by a station in a Token Ring network passes in front of all stations attached to the network. Each station can detect the data frame and analyse the destination address to possibly capture the frame. .. The `Frame Status` field that appears after the `Ending Delimiter` is used to provide acknowledgements without requiring special frames. The `Frame Status` contains two flags : `A` and `C`. Both flags are reset when a station sends a data frame. These flags can be modified by their recipients. When a station senses its address as the destination address of a frame, it can capture the frame, check its CRC and place it in its own buffers. The destination of a frame must set the `A` bit (resp. `C` bit) of the `Frame Status` field once it has seen (resp. copied) a data frame. By inspecting the `Frame Status` of the returning frame, the sender can verify whether its frame has been received correctly by its destination. @@ -703,18 +703,18 @@ Most networks contain links having different bandwidth. Some hosts can use low b To understand this problem better, let us consider the scenario shown in the figure below, where a server (`A`) attached to a `10 Mbps` link needs to reliably transfer segments to another computer (`C`) through a path that contains a `2 Mbps` link. -.. figure:: ../../book/transport/svg/tcp-2mbps.png +.. figure:: ../../book/transport/svg/tcp-2mbps.png :align: center - :scale: 70 + :scale: 70 - Reliable transport with heterogeneous links + Reliable transport with heterogeneous links In this network, the segments sent by the server reach router `R1`. `R1` forwards the segments towards router `R2`. Router `R1` can potentially receive segments at `10 Mbps`, but it can only forward them at `2 Mbps` to router `R2` and then to host `C`. Router `R1` includes buffers that allow it to store the packets that cannot immediately be forwarded to their destination. To understand the operation of a reliable transport protocol in this environment, let us consider a simplified model of this network where host `A` is attached to a `10 Mbps` link to a queue that represents the buffers of router `R1`. This queue is emptied at a rate of `2 Mbps`. .. figure:: ../../book/transport/svg/tcp-self-clocking.png :align: center - :scale: 70 + :scale: 70 Self clocking @@ -725,20 +725,20 @@ However, transport protocols are not only used in this environment. In the globa .. index:: congestion collapse -.. figure:: ../../book/transport/png/transport-fig-083-c.png +.. figure:: ../../book/transport/png/transport-fig-083-c.png :align: center - :scale: 70 + :scale: 70 The congestion collapse problem -If many senders are attached to the left part of the network above, they all send a window full of segments. These segments are stored in the buffers of the router before being transmitted towards their destination. If there are many senders on the left part of the network, the occupancy of the buffers quickly grows. A consequence of the buffer occupancy is that the round-trip-time, measured by the transport protocol, between the sender and the receiver increases. Consider a network where 10,000 bits segments are sent. When the buffer is empty, such a segment requires 1 millisecond to be transmitted on the `10 Mbps` link and 5 milliseconds to be the transmitted on the `2 Mbps` link. Thus, the measured round-trip-time measured is roughly 6 milliseconds if we ignore the propagation delay on the links. If the buffer contains 100 segments, the round-trip-time becomes :math:`1+100 \times 5+ 5` milliseconds as new segments are only transmitted on the `2 Mbps` link once all previous segments have been transmitted. Unfortunately, if the reliable transport protocol uses a retransmission timer and performs `go-back-n` to recover from transmission errors it will retransmit a full window of segments. This increases the occupancy of the buffer and the delay through the buffer... Furthermore, the buffer may store and send on the low bandwidth links several retransmissions of the same segment. This problem is called `congestion collapse`. It occurred several times during the late 1980s on the Internet [Jacobson1988]_. +If many senders are attached to the left part of the network above, they all send a window full of segments. These segments are stored in the buffers of the router before being transmitted towards their destination. If there are many senders on the left part of the network, the occupancy of the buffers quickly grows. A consequence of the buffer occupancy is that the round-trip-time, measured by the transport protocol, between the sender and the receiver increases. Consider a network where 10,000 bits segments are sent. When the buffer is empty, such a segment requires 1 millisecond to be transmitted on the `10 Mbps` link and 5 milliseconds to be the transmitted on the `2 Mbps` link. Thus, the measured round-trip-time is roughly 6 milliseconds if we ignore the propagation delay on the links. If the buffer contains 100 segments, the round-trip-time becomes :math:`1+100 \times 5+ 5` milliseconds as new segments are only transmitted on the `2 Mbps` link once all previous segments have been transmitted. Unfortunately, if the reliable transport protocol uses a retransmission timer and performs `go-back-n` to recover from transmission errors it will retransmit a full window of segments. This increases the occupancy of the buffer and the delay through the buffer... Furthermore, the buffer may store and send on the low bandwidth links several retransmissions of the same segment. This problem is called `congestion collapse`. It occurred several times during the late 1980s on the Internet [Jacobson1988]_. The `congestion collapse` is a problem that all heterogeneous networks face. Different mechanisms have been proposed in the scientific literature to avoid or control network congestion. Some of them have been implemented and deployed in real networks. To understand this problem in more detail, let us first consider a simple network with two hosts attached to a high bandwidth link that are sending segments to destination `C` attached to a low bandwidth link as depicted below. -.. figure:: ../../book/transport/svg/congestion-problem.png +.. figure:: ../../book/transport/svg/congestion-problem.png :align: center - :scale: 70 + :scale: 70 The congestion problem @@ -749,40 +749,40 @@ To avoid `congestion collapse`, the hosts must regulate their transmission rate Let us first consider the simple problem of a set of :math:`i` hosts that share a single bottleneck link as shown in the example above. In this network, the congestion control scheme must achieve the following objectives [CJ1989]_ : - #. The congestion control scheme must `avoid congestion`. In practice, this means that the bottleneck link cannot be overloaded. If :math:`r_i(t)` is the transmission rate allocated to host :math:`i` at time :math:`t` and :math:`R` the bandwidth of the bottleneck link, then the congestion control scheme should ensure that, on average, :math:`\forall{t} \sum{r_i(t)} \le R`. + #. The congestion control scheme must `avoid congestion`. In practice, this means that the bottleneck link cannot be overloaded. If :math:`r_i(t)` is the transmission rate allocated to host :math:`i` at time :math:`t` and :math:`R` the bandwidth of the bottleneck link, then the congestion control scheme should ensure that, on average, :math:`\forall{t} \sum{r_i(t)} \le R`. #. The congestion control scheme must be `efficient`. The bottleneck link is usually both a shared and an expensive resource. Usually, bottleneck links are wide area links that are much more expensive to upgrade than the local area networks. The congestion control scheme should ensure that such links are efficiently used. Mathematically, the control scheme should ensure that :math:`\forall{t} \sum{r_i(t)} \approx R`. #. The congestion control scheme should be `fair`. Most congestion schemes aim at achieving `max-min fairness`. An allocation of transmission rates to sources is said to be `max-min fair` if : - - no link in the network is congested - - the rate allocated to source :math:`j` cannot be increased without decreasing the rate allocated to a source :math:`i` whose allocation is smaller than the rate allocated to source :math:`j` [Leboudec2008]_ . + - no link in the network is congested + - the rate allocated to source :math:`j` cannot be increased without decreasing the rate allocated to a source :math:`i` whose allocation is smaller than the rate allocated to source :math:`j` [Leboudec2008]_ . Depending on the network, a `max-min fair allocation` may not always exist. In practice, `max-min fairness` is an ideal objective that cannot necessarily be achieved. When there is a single bottleneck link as in the example above, `max-min fairness` implies that each source should be allocated the same transmission rate. -To visualise the different rate allocations, it is useful to consider the graph shown below. In this graph, we plot on the `x-axis` (resp. `y-axis`) the rate allocated to host `B` (resp. `A`). A point in the graph :math:`(r_B,r_A)` corresponds to a possible allocation of the transmission rates. Since there is a `2 Mbps` bottleneck link in this network, the graph can be divided into two regions. The lower left part of the graph contains all allocations :math:`(r_B,r_A)` such that the bottleneck link is not congested (:math:`r_A+r_B<2`). The right border of this region is the `efficiency line`, i.e. the set of allocations that completely utilise the bottleneck link (:math:`r_A+r_B=2`). Finally, the `fairness line` is the set of fair allocations. +To visualise the different rate allocations, it is useful to consider the graph shown below. In this graph, we plot on the `x-axis` (resp. `y-axis`) the rate allocated to host `B` (resp. `A`). A point in the graph :math:`(r_B,r_A)` corresponds to a possible allocation of the transmission rates. Since there is a `2 Mbps` bottleneck link in this network, the graph can be divided into two regions. The lower left part of the graph contains all allocations :math:`(r_B,r_A)` such that the bottleneck link is not congested (:math:`r_A+r_B<2`). The right border of this region is the `efficiency line`, i.e. the set of allocations that completely utilise the bottleneck link (:math:`r_A+r_B=2`). Finally, the `fairness line` is the set of fair allocations. -.. figure:: ../../book/transport/png/transport-fig-092-c.png +.. figure:: ../../book/transport/png/transport-fig-092-c.png :align: center - :scale: 70 + :scale: 70 Possible allocated transmission rates As shown in the graph above, a rate allocation may be fair but not efficient (e.g. :math:`r_A=0.7,r_B=0.7`), fair and efficient ( e.g. :math:`r_A=1,r_B=1`) or efficient but not fair (e.g. :math:`r_A=1.5,r_B=0.5`). Ideally, the allocation should be both fair and efficient. Unfortunately, maintaining such an allocation with fluctuations in the number of flows that use the network is a challenging problem. Furthermore, there might be several thousands flows that pass through the same link [#fflowslink]_. -To deal with these fluctuations in demand, which result in fluctuations in the available bandwidth, computer networks use a congestion control scheme. This congestion control scheme should achieve the three objectives listed above. Some congestion control schemes rely on a close cooperation between the endhosts and the routers, while others are mainly implemented on the endhosts with limited support from the routers. +To deal with these fluctuations in demand, which result in fluctuations in the available bandwidth, computer networks use a congestion control scheme. This congestion control scheme should achieve the three objectives listed above. Some congestion control schemes rely on a close cooperation between the endhosts and the routers, while others are mainly implemented on the endhosts with limited support from the routers. -A congestion control scheme can be modelled as an algorithm that adapts the transmission rate (:math:`r_i(t)`) of host :math:`i` based on the feedback received from the network. Different types of feedbacks are possible. The simplest scheme is a binary feedback [CJ1989]_ [Jacobson1988]_ where the hosts simply learn whether the network is congested or not. Some congestion control schemes allow the network to regularly send an allocated transmission rate in Mbps to each host [BF1995]_. +A congestion control scheme can be modelled as an algorithm that adapts the transmission rate (:math:`r_i(t)`) of host :math:`i` based on the feedback received from the network. Different types of feedbacks are possible. The simplest scheme is a binary feedback [CJ1989]_ [Jacobson1988]_ where the hosts simply learn whether the network is congested or not. Some congestion control schemes allow the network to regularly send an allocated transmission rate in Mbps to each host [BF1995]_. .. index:: Additive Increase Multiplicative Decrease (AIMD) Let us focus on the binary feedback scheme which is the most widely used today. Intuitively, the congestion control scheme should decrease the transmission rate of a host when congestion has been detected in the network, in order to avoid congestion collapse. Furthermore, the hosts should increase their transmission rate when the network is not congested. Otherwise, the hosts would not be able to efficiently utilise the network. The rate allocated to each host fluctuates with time, depending on the feedback received from the network. The figure below illustrates the evolution of the transmission rates allocated to two hosts in our simple network. Initially, two hosts have a low allocation, but this is not efficient. The allocations increase until the network becomes congested. At this point, the hosts decrease their transmission rate to avoid congestion collapse. If the congestion control scheme works well, after some time the allocations should become both fair and efficient. -.. figure:: ../../book/transport/png/transport-fig-093-c.png +.. figure:: ../../book/transport/png/transport-fig-093-c.png :align: center - :scale: 70 + :scale: 70 - Evolution of the transmission rates + Evolution of the transmission rates Various types of rate adaption algorithms are possible. `Dah Ming Chiu`_ and `Raj Jain`_ have analysed, in [CJ1989]_, different types of algorithms that can be used by a source to adapt its transmission rate to the feedback received from the network. Intuitively, such a rate adaptation algorithm increases the transmission rate when the network is not congested (ensure that the network is efficiently used) and decrease the transmission rate when the network is congested (to avoid congestion collapse). @@ -792,13 +792,13 @@ The simplest form of feedback that the network can send to a source is a binary - :math:`rate(t+1)=\alpha_C + \beta_C rate(t)` when the network is congested - :math:`rate(t+1)=\alpha_N + \beta_N rate(t)` when the network is *not* congested -With a linear adaption algorithm, :math:`\alpha_C,\alpha_N, \beta_C` and :math:`\beta_N` are constants. -The analysis of [CJ1989]_ shows that to be fair and efficient, such a binary rate adaption mechanism must rely on `Additive Increase and Multiplicative Decrease`. When the network is not congested, the hosts should slowly increase their transmission rate (:math:`\beta_N=1~and~\alpha_N>0`). When the network is congested, the hosts must multiplicatively decrease their transmission rate (:math:`\beta_C < 1~and~\alpha_C = 0`). Such an AIMD rate adaptation algorithm can be implemented by the pseudo-code below. +With a linear adaption algorithm, :math:`\alpha_C,\alpha_N, \beta_C` and :math:`\beta_N` are constants. +The analysis of [CJ1989]_ shows that to be fair and efficient, such a binary rate adaption mechanism must rely on `Additive Increase and Multiplicative Decrease`. When the network is not congested, the hosts should slowly increase their transmission rate (:math:`\beta_N=1~and~\alpha_N>0`). When the network is congested, the hosts must multiplicatively decrease their transmission rate (:math:`\beta_C < 1~and~\alpha_C = 0`). Such an AIMD rate adaptation algorithm can be implemented by the pseudo-code below. .. code-block:: python - # Additive Increase Multiplicative Decrease + # Additive Increase Multiplicative Decrease if congestion : rate=rate*betaC # multiplicative decrease, betaC<1 else @@ -808,7 +808,7 @@ The analysis of [CJ1989]_ shows that to be fair and efficient, such a binary rat .. note:: Which binary feedback ? - Two types of binary feedback are possible in computer networks. A first solution is to rely on implicit feedback. This is the solution chosen for TCP. TCP's congestion control scheme [Jacobson1988]_ does not require any cooperation from the router. It only assumes that they use buffers and that they discard packets when there is congestion. TCP uses the segment losses as an indication of congestion. When there are no losses, the network is assumed to be not congested. This implies that congestion is the main cause of packet losses. This is true in wired networks, but unfortunately not always true in wireless networks. + Two types of binary feedback are possible in computer networks. A first solution is to rely on implicit feedback. This is the solution chosen for TCP. TCP's congestion control scheme [Jacobson1988]_ does not require any cooperation from the router. It only assumes that they use buffers and that they discard packets when there is congestion. TCP uses the segment losses as an indication of congestion. When there are no losses, the network is assumed to be not congested. This implies that congestion is the main cause of packet losses. This is true in wired networks, but unfortunately not always true in wireless networks. Another solution is to rely on explicit feedback. This is the solution proposed in the DECBit congestion control scheme [RJ1995]_ and used in Frame Relay and ATM networks. This explicit feedback can be implemented in two ways. A first solution would be to define a special message that could be sent by routers to hosts when they are congested. Unfortunately, generating such messages may increase the amount of congestion in the network. Such a congestion indication packet is thus discouraged :rfc:`1812`. A better approach is to allow the intermediate routers to indicate, in the packets that they forward, their current congestion status. Binary feedback can be encoded by using one bit in the packet header. With such a scheme, congested routers set a special bit in the packets that they forward while non-congested routers leave this bit unmodified. The destination host returns the congestion status of the network in the acknowledgements that it sends. Details about such a solution in IP networks may be found in :rfc:`3168`. Unfortunately, as of this writing, this solution is still not deployed despite its potential benefits. @@ -838,7 +838,7 @@ AIMD controls congestion by adjusting the transmission rate of the sources in re The links between the hosts and the routers have a bandwidth of 1 Mbps while the link between the two routers has a bandwidth of 500 Kbps. There is no significant propagation delay in this network. For simplicity, assume that hosts `A` and `B` send 1000 bits packets. The transmission of such a packet on a `host-router` (resp. `router-router` ) link requires 1 msec (resp. 2 msec). If there is no traffic in the network, round-trip-time measured by host `A` is slightly larger than 4 msec. Let us observe the flow of packets with different window sizes to understand the relationship between sending window and transmission rate. -Consider first a window of one segment. This segment takes 4 msec to reach host `D`. The destination replies with an acknowledgement and the next segment can be transmitted. With such a sending window, the transmission rate is roughly 250 segments per second of 250 Kbps. +Consider first a window of one segment. This segment takes 4 msec to reach host `D`. The destination replies with an acknowledgement and the next segment can be transmitted. With such a sending window, the transmission rate is roughly 250 segments per second of 250 Kbps. .. code-block:: console @@ -863,7 +863,7 @@ Consider first a window of one segment. This segment takes 4 msec to reach host +-----+----------+ +----------+ |t0+8 | data(2) | | +-----+----------+---------------------- - + Consider now a window of two segments. Host `A` can send two segments within 2 msec on its 1 Mbps link. If the first segment is sent at time :math:`t_{0}`, it reaches host `D` at :math:`t_{0}+4`. Host `D` replies with an acknowledgement that opens the sending window on host `A` and enables it to transmit a new segment. In the meantime, the second segment was buffered by router `R1`. It reaches host `D` at :math:`t_{0}+6` and an acknowledgement is returned. With a window of two segments, host `A` transmits at roughly 500 Kbps, i.e. the transmission rate of the bottleneck link. @@ -889,8 +889,8 @@ Consider now a window of two segments. Host `A` can send two segments within 2 m +-----+----------+----------+----------+ - -Our last example is a window of four segments. These segments are sent at :math:`t_{0}`, :math:`t_{0}+1`, :math:`t_{0}+2` and :math:`t_{0}+3`. The first segment reaches host `D` at :math:`t_{0}+4`. Host `D` replies to this segment by sending an acknowledgement that enables host `A` to transmit its fifth segment. This segment reaches router `R1` at :math:`t_{0}+5`. At that time, router `R1` is transmitting the third segment to router `R2` and the fourth segment is still in its buffers. At time :math:`t_{0}+6`, host `D` receives the second segment and returns the corresponding acknowledgement. This acknowledgement enables host `A` to send its sixth segment. This segment reaches router `R1` at roughly :math:`t_{0}+7`. At that time, the router starts to transmit the fourth segment to router `R2`. Since link `R1-R2` can only sustain 500 Kbps, packets will accumulate in the buffers of `R1`. On average, there will be two packets waiting in the buffers of `R1`. The presence of these two packets will induce an increase of the round-trip-time as measured by the transport protocol. While the first segment was acknowledged within 4 msec, the fifth segment (`data(4)`) that was transmitted at time :math:`t_{0}+4` is only acknowledged at time :math:`t_{0}+11`. On average, the sender transmits at 500 Kbps, but the utilisation of a large window induces a longer delay through the network. + +Our last example is a window of four segments. These segments are sent at :math:`t_{0}`, :math:`t_{0}+1`, :math:`t_{0}+2` and :math:`t_{0}+3`. The first segment reaches host `D` at :math:`t_{0}+4`. Host `D` replies to this segment by sending an acknowledgement that enables host `A` to transmit its fifth segment. This segment reaches router `R1` at :math:`t_{0}+5`. At that time, router `R1` is transmitting the third segment to router `R2` and the fourth segment is still in its buffers. At time :math:`t_{0}+6`, host `D` receives the second segment and returns the corresponding acknowledgement. This acknowledgement enables host `A` to send its sixth segment. This segment reaches router `R1` at roughly :math:`t_{0}+7`. At that time, the router starts to transmit the fourth segment to router `R2`. Since link `R1-R2` can only sustain 500 Kbps, packets will accumulate in the buffers of `R1`. On average, there will be two packets waiting in the buffers of `R1`. The presence of these two packets will induce an increase of the round-trip-time as measured by the transport protocol. While the first segment was acknowledged within 4 msec, the fifth segment (`data(4)`) that was transmitted at time :math:`t_{0}+4` is only acknowledged at time :math:`t_{0}+11`. On average, the sender transmits at 500 Kbps, but the utilisation of a large window induces a longer delay through the network. .. code-block:: console @@ -918,40 +918,40 @@ Our last example is a window of four segments. These segments are sent at :math: |t0+9 | | | data(3) | +-----+----------+ data(4) +----------+ |t0+10| data(7) | | | - +-----+----------+----------+----------+ + +-----+----------+----------+----------+ |t0+11| | | data(4) | +-----+----------+ data(5) +----------+ |t0+12| data(8) | | | - +-----+----------+----------+----------+ + +-----+----------+----------+----------+ .. index:: congestion window -From the above example, we can adjust the transmission rate by adjusting the sending window of a reliable transport protocol. A reliable transport protocol cannot send data faster than :math:`\frac{window}{rtt}` where :math:`window` is current sending window. To control the transmission rate, we introduce a `congestion window`. This congestion window limits the sending window. A any time, the sending window is restricted to :math:`\min(swin,cwin)`, where `swin` is the sending window and `cwin` the current `congestion window`. Of course, the window is further constrained by the receive window advertised by the remote peer. With the utilization of a congestion window, a simple reliable transport protocol that uses fixed size segments could implement `AIMD` as follows. +From the above example, we can adjust the transmission rate by adjusting the sending window of a reliable transport protocol. A reliable transport protocol cannot send data faster than :math:`\frac{window}{rtt}` where :math:`window` is current sending window. To control the transmission rate, we introduce a `congestion window`. This congestion window limits the sending window. At any time, the sending window is restricted to :math:`\min(swin,cwin)`, where `swin` is the sending window and `cwin` the current `congestion window`. Of course, the window is further constrained by the receive window advertised by the remote peer. With the utilization of a congestion window, a simple reliable transport protocol that uses fixed size segments could implement `AIMD` as follows. -For the `Additive Increase` part our simple protocol would simply increase its `congestion window` by one segment every round-trip-time. The -`Multiplicative Decrease` part of `AIMD` could be implemented by halving the congestion window when congestion is detected. For simplicity, we assume that congestion is detected thanks to a binary feedback and that no segments are lost. We will discuss in more details how losses affect a real transport protocol like TCP. +For the `Additive Increase` part our simple protocol would simply increase its `congestion window` by one segment every round-trip-time. The +`Multiplicative Decrease` part of `AIMD` could be implemented by halving the congestion window when congestion is detected. For simplicity, we assume that congestion is detected thanks to a binary feedback and that no segments are lost. We will discuss in more details how losses affect a real transport protocol like TCP. A congestion control scheme for our simple transport protocol could be implemented as follows. .. code-block:: python - # Initialisation + # Initialisation cwin = 1 # congestion window measured in segments - - # Ack arrival + + # Ack arrival if newack : # new ack, no congestion # increase cwin by one every rtt cwin = cwin+ (1/cwin) - else: + else: # no increase - - Congestion detected: - cwnd=cwin/2 # only once per rtt + Congestion detected: + cwin=cwin/2 # only once per rtt -In the above pseudocode, `cwin` contains the congestion window stored as a real in segments. This congestion window is updated upon the arrival of each acknowledgment and when congestion is detected. For simplicity, we assume that `cwin` is stored as a floating point number but only full segments can be transmitted. + +In the above pseudocode, `cwin` contains the congestion window stored as a real in segments. This congestion window is updated upon the arrival of each acknowledgment and when congestion is detected. For simplicity, we assume that `cwin` is stored as a floating point number but only full segments can be transmitted. As an illustration, let us consider the network scenario above and assume that the router implements the DECBit binary feedback scheme [RJ1995]_. This scheme uses a form of Forward Explicit Congestion Notification and a router marks the congestion bit in arriving packets when its buffer contains one or more packets. In the figure below, we use a `*` to indicate a marked packet. @@ -1015,16 +1015,15 @@ When the connection starts, its congestion window is set to one segment. Segment .. rubric:: Footnotes -.. [#fbufferbloat] There are still some vendors that try to put as many buffers as possible on their network nodes. A recent example is the buffer bloat problem that plagues some low-end Internet routers [GN2011]_. +.. [#fbufferbloat] There are still some vendors that try to put as many buffers as possible on their network nodes. A recent example is the buffer bloat problem that plagues some low-end Internet routers [GN2011]_. .. [#fpps] Some examples of the performance of various types of commercial networks nodes may be found in http://www.cisco.com/web/partners/downloads/765/tools/quickreference/routerperformance.pdf and http://www.cisco.com/web/partners/downloads/765/tools/quickreference/switchperformance.pdf -.. [#fadjust] Some networking technologies allow to adjust dynamically the bandwidth of links. For example, some devices can reduce their bandwidth to preserve energy. We ignore these technologies in this basic course and assume that all links used inside the network have a fixed bandwidth. +.. [#fadjust] Some networking technologies allow to adjust dynamically the bandwidth of links. For example, some devices can reduce their bandwidth to preserve energy. We ignore these technologies in this basic course and assume that all links used inside the network have a fixed bandwidth. -.. [#fcredit] In this section, we focus on congestion control mechanisms that regulate the transmission rate of the hosts. Other types of mechanisms have been proposed in the literature. For example, `credit-based` flow-control has been proposed to avoid congestion in ATM networks [KR1995]_. With a credit-based mechanism, hosts can only send packets once they have received credits from the routers and the credits depend on the occupancy of the router's buffers. +.. [#fcredit] In this section, we focus on congestion control mechanisms that regulate the transmission rate of the hosts. Other types of mechanisms have been proposed in the literature. For example, `credit-based` flow-control has been proposed to avoid congestion in ATM networks [KR1995]_. With a credit-based mechanism, hosts can only send packets once they have received credits from the routers and the credits depend on the occupancy of the router's buffers. -.. [#fflowslink] For example, the measurements performed in the Sprint network in 2004 reported more than 10k active TCP connections on a link, see https://research.sprintlabs.com/packstat/packetoverview.php. More recent information about backbone links may be obtained from caida_ 's realtime measurements, see e.g. http://www.caida.org/data/realtime/passive/ +.. [#fflowslink] For example, the measurements performed in the Sprint network in 2004 reported more than 10k active TCP connections on a link, see https://research.sprintlabs.com/packstat/packetoverview.php. More recent information about backbone links may be obtained from caida_ 's realtime measurements, see e.g. http://www.caida.org/data/realtime/passive/ .. include:: /links.rst - diff --git a/book-2nd/principles/transport.rst b/book-2nd/principles/transport.rst index e25970e..7c9c2b8 100644 --- a/book-2nd/principles/transport.rst +++ b/book-2nd/principles/transport.rst @@ -2,25 +2,25 @@ .. Some portions of this text come from the first edition of this ebook .. This file is licensed under a `creative commons licence `_ -.. warning:: +.. warning:: This is an unpolished draft of the second edition of this ebook. If you find any error or have suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues?milestone=3 Applications ============ -The are two important models used to organise a networked application. The first and oldest model is the client-server model. In this model, a server provides services to clients that exchange information with it. This model is highly asymmetrical : clients send requests and servers perform actions and return responses. It is illustrated in the figure below. +The are two important models used to organize a networked application. The first and oldest model is the client-server model. In this model, a server provides services to clients that exchange information with it. This model is highly asymmetrical : clients send requests and servers perform actions and return responses. It is illustrated in the figure below. .. figure:: ../../book/application/png/app-fig-001-c.png :align: center - :scale: 50 + :scale: 50 The client-server model The client-server model was the first model to be used to develop networked applications. This model comes naturally from the mainframes and minicomputers that were the only networked computers used until the 1980s. A minicomputer_ is a multi-user system that is used by tens or more users at the same time. Each user interacts with the minicomputer by using a terminal. Those terminals, were mainly a screen, a keyboard and a cable directly connected to the minicomputer. -There are various types of servers as well as various types of clients. A web server provides information in response to the query sent by its clients. A print server prints documents sent as queries by the client. An email server will forward towards their recipient the email messages sent as queries while a music server will deliver the music requested by the client. From the viewpoint of the application developer, the client and the server applications directly exchange messages (the horizontal arrows labelled `Queries` and `Responses` in the above figure), but in practice these messages are exchanged thanks to the underlying layers (the vertical arrows in the above figure). In this chapter, we focus on these horizontal exchanges of messages. +There are various types of servers as well as various types of clients. A web server provides information in response to the query sent by its clients. A print server prints documents sent as queries by the client. An email server will forward towards their recipient the email messages sent as queries while a music server will deliver the music requested by the client. From the viewpoint of the application developer, the client and the server applications directly exchange messages (the horizontal arrows labelled `Queries` and `Responses` in the above figure), but in practice these messages are exchanged thanks to the underlying layers (the vertical arrows in the above figure). In this chapter, we focus on these horizontal exchanges of messages. Networked applications do not exchange random messages. In order to ensure that the server is able to understand the queries sent by a client, and also that the client is able to understand the responses sent by the server, they must both agree on a set of syntactical and semantic rules. These rules define the format of the messages exchanged as well as their ordering. This set of rules is called an application-level `protocol`. @@ -31,30 +31,30 @@ An `application-level protocol` is similar to a structured conversation between - Alice : `What time is it ?` - Bob : `11:55` - Alice : `Thank you` - - Bob : `You're welcome` + - Bob : `You're welcome` Such a conversation succeeds if both Alice and Bob speak the same language. If Alice meets Tchang who only speaks Chinese, she won't be able to ask him the current time. A conversation between humans can be more complex. For example, assume that Bob is a security guard whose duty is to only allow trusted secret agents to enter a meeting room. If all agents know a secret password, the conversation between Bob and Trudy could be as follows : - Bob : `What is the secret password ?` - Trudy : `1234` - Bob : `This is the correct password, you're welcome` - + If Alice wants to enter the meeting room but does not know the password, her conversation could be as follows : - Bob : `What is the secret password ?` - Alice : `3.1415` - Bob : `This is not the correct password.` -Human conversations can be very formal, e.g. when soldiers communicate with their hierarchy, or informal such as when friends discuss. Computers that communicate are more akin to soldiers and require well-defined rules to ensure an successful exchange of information. There are two types of rules that define how information can be exchanged between computers : +Human conversations can be very formal, e.g. when soldiers communicate with their hierarchy, or informal such as when friends discuss. Computers that communicate are more akin to soldiers and require well-defined rules to ensure a successful exchange of information. There are two types of rules that define how information can be exchanged between computers : - - syntactical rules that precisely define the format of the messages that are exchanged. As computers only process bits, the syntactical rules specify how information is encoded as bit strings + - syntactical rules that precisely define the format of the messages that are exchanged. As computers only process bits, the syntactical rules specify how information is encoded as bit strings - organisation of the information flow. For many applications, the flow of information must be structured and there are precedence relationships between the different types of information. In the time example above, Alice must greet Bob before asking for the current time. Alice would not ask for the current time first and greet Bob afterwards. Such precedence relationships exist in networked applications as well. For example, a server must receive a username and a valid password before accepting more complex commands from its clients. Let us first discuss the syntactical rules. We will later explain how the information flow can be organised by analysing real networked applications. Application-layer protocols exchange two types of messages. Some protocols such as those used to support electronic mail exchange messages expressed as strings or lines of characters. As the transport layer allows hosts to exchange bytes, they need to agree on a common representation of the characters. The first and simplest method to encode characters is to use the :term:`ASCII` table. :rfc:`20` provides the ASCII table that is used by many protocols on the Internet. For example, the table defines the following binary representations : - - `A` : `1000011b` + - `A` : `1000011b` - `0` : `0110000b` - `z` : `1111010b` - `@` : `1000000b` @@ -72,7 +72,7 @@ Most applications exchange strings that are composed of fixed or variable number .. figure:: ../../book/application/pkt/bnf.png :align: center - :scale: 100 + :scale: 100 A simple BNF specification @@ -83,13 +83,13 @@ Besides character strings, some applications also need to exchange 16 bits and 3 - send the most significant byte followed by the least significant byte - send the least significant byte followed by the most significant byte -The first possibility was named `big-endian` in a note written by Cohen [Cohen1980]_ while the second was named `little-endian`. Vendors of CPUs that used `big-endian` in memory insisted on using `big-endian` encoding in networked applications while vendors of CPUs that used `little-endian` recommended the opposite. Several studies were written on the relative merits of each type of encoding, but the discussion became almost a religious issue [Cohen1980]_. Eventually, the Internet chose the `big-endian` encoding, i.e. multi-byte fields are always transmitted by sending the most significant byte first, :rfc:`791` refers to this encoding as the :term:`network-byte order`. Most libraries [#fhtonl]_ used to write networked applications contain functions to convert multi-byte fields from memory to the network byte order and vice versa. +The first possibility was named `big-endian` in a note written by Cohen [Cohen1980]_ while the second was named `little-endian`. Vendors of CPUs that used `big-endian` in memory insisted on using `big-endian` encoding in networked applications while vendors of CPUs that used `little-endian` recommended the opposite. Several studies were written on the relative merits of each type of encoding, but the discussion became almost a religious issue [Cohen1980]_. Eventually, the Internet chose the `big-endian` encoding, i.e. multi-byte fields are always transmitted by sending the most significant byte first, :rfc:`791` refers to this encoding as the :term:`network-byte order`. Most libraries [#fhtonl]_ used to write networked applications contain functions to convert multi-byte fields from memory to the network byte order and vice versa. -Besides 16 and 32 bit words, some applications need to exchange data structures containing bit fields of various lengths. For example, a message may be composed of a 16 bits field followed by eight, one bit flags, a 24 bits field and two 8 bits bytes. Internet protocol specifications will define such a message by using a representation such as the one below. In this representation, each line corresponds to 32 bits and the vertical lines are used to delineate fields. The numbers above the lines indicate the bit positions in the 32-bits word, with the high order bit at position `0`. +Besides 16 and 32 bit words, some applications need to exchange data structures containing bit fields of various lengths. For example, a message may be composed of a 16 bits field followed by eight, one bit flags, a 24 bits field and two 8 bits bytes. Internet protocol specifications will define such a message by using a representation such as the one below. In this representation, each line corresponds to 32 bits and the vertical lines are used to delineate fields. The numbers above the lines indicate the bit positions in the 32-bits word, with the high order bit at position `0`. .. figure:: ../../book/application/pkt/message.png :align: center - :scale: 100 + :scale: 100 Message format @@ -107,12 +107,12 @@ The peer-to-peer model emerged during the last ten years as another possible arc .. unstructured p2P like gnutella or freenet .. structured like chord as example -.. Surveys : +.. Surveys : .. Chord : [SMKKB2001]_ -.. The peer-to-peer model +.. The peer-to-peer model @@ -125,23 +125,23 @@ A network is always designed and built to enable applications running on hosts t .. index:: connectionless service -The network layer ensures the delivery of packets on a hop-by-hop basis through intermediate nodes. As such, it provides a service to the upper layer. In practice, this layer is usually the `transport layer` that improves the service provided by the `network layer` to make it useable by applications. +The network layer ensures the delivery of packets on a hop-by-hop basis through intermediate nodes. As such, it provides a service to the upper layer. In practice, this layer is usually the `transport layer` that improves the service provided by the `network layer` to make it useable by applications. .. figure:: ../../book/intro/svg/intro-figures-030-c.png :align: center - :scale: 80 + :scale: 80 The transport layer -Most networks use a datagram organisation and provide a simple service which is called the `connectionless service`. +Most networks use a datagram organisation and provide a simple service which is called the `connectionless service`. The figure below provides a representation of the connectionless service as a `time-sequence diagram`. The user on the left, having address `S`, issues a `Data.request` primitive containing Service Data Unit (SDU) `M` that must be delivered by the service provider to destination `D`. The dashed line between the two primitives indicates that the `Data.indication` primitive that is delivered to the user on the right corresponds to the `Data.request` primitive sent by the user on the left. .. figure:: ../../book/intro/svg/intro-figures-017-c.png :align: center - :scale: 80 + :scale: 80 A simple connectionless service @@ -153,32 +153,32 @@ An `unreliable connectionless` service may suffer from various types of problems .. figure:: ../../book/intro/svg/intro-figures-034-c.png :align: center - :scale: 80 + :scale: 80 An unreliable connectionless service may loose SDUs -In practice, an `unreliable connectionless service` will usually deliver a large fraction of the SDUs. However, since the delivery of SDUs is not guaranteed, the user must be able to recover from the loss of any SDU. +In practice, an `unreliable connectionless service` will usually deliver a large fraction of the SDUs. However, since the delivery of SDUs is not guaranteed, the user must be able to recover from the loss of any SDU. A second imperfection that may affect an `unreliable connectionless service` is that it may duplicate SDUs. Some packets may be duplicated in a network and be delivered twice to their destination. This is illustrated by the time-sequence diagram below. .. figure:: ../../book/intro/svg/intro-figures-033-c.png :align: center - :scale: 80 + :scale: 80 An unreliable connectionless service may duplicate SDUs -Finally, some unreliable connectionless service providers may deliver to a destination a different SDU than the one that was supplied in the `Data.request`. This is illustrated in the figure below. +Finally, some unreliable connectionless service providers may deliver to a destination a different SDU than the one that was supplied in the `Data.request`. This is illustrated in the figure below. .. figure:: ../../book/intro/svg/intro-figures-035-c.png :align: center - :scale: 80 + :scale: 80 An unreliable connectionless service may deliver erroneous SDUs As the transport layer is built on top of the network layer, it is important to know the key features of the network layer service. In this book, we only consider the `connectionless network layer service` which is the most widespread. Its main characteristics are : - - the `connectionless network layer service` can only transfer SDUs of *limited size* + - the `connectionless network layer service` can only transfer SDUs of *limited size* - the `connectionless network layer service` may discard SDUs - the `connectionless network layer service` may corrupt SDUs - the `connectionless network layer service` may delay, reorder or even duplicate SDUs @@ -186,9 +186,9 @@ As the transport layer is built on top of the network layer, it is important to .. figure:: ../../book/transport/png/transport-fig-001-c.png :align: center - :scale: 80 + :scale: 80 - The transport layer + The transport layer These imperfections of the `connectionless network layer service` are caused by the operations of the `network layer`. This `layer` is able to deliver packets to their intended destination, but it cannot guarantee this delivery. The main cause of packet losses and errors are the buffers used on the network nodes. If the buffers of one of these nodes becomes full, all arriving packets must be discarded. This situation happens frequently in practice. Transmission errors can also affect packet transmissions on links where reliable transmission techniques are not enabled or because of errors in the buffers of the network nodes. @@ -206,7 +206,7 @@ When two applications need to communicate, they need to structure their exchange The connectionless service ^^^^^^^^^^^^^^^^^^^^^^^^^^ -The `connectionless service` that we have described earlier is frequently used by users who need to exchange small SDUs. It can be easily built on top of the connectionless network layer service that we have described earlier. Users needing to either send or receive several different and potentially large SDUs, or who need structured exchanges often prefer the `connection-oriented service`. +The `connectionless service` that we have described earlier is frequently used by users who need to exchange small SDUs. It can be easily built on top of the connectionless network layer service that we have described earlier. Users needing to either send or receive several different and potentially large SDUs, or who need structured exchanges often prefer the `connection-oriented service`. The connection-oriented service ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -220,7 +220,7 @@ The establishment of a connection can be modelled by using four primitives : `Co .. figure:: ../../book/intro/svg/intro-figures-019-c.png :align: center - :scale: 80 + :scale: 80 Connection establishment @@ -229,7 +229,7 @@ The example above shows a successful connection establishment. However, in pract .. figure:: ../../book/intro/svg/intro-figures-020-c.png :align: center - :scale: 80 + :scale: 80 Two types of rejection for a connection establishment attempt @@ -240,7 +240,7 @@ Once the connection has been established, the service provider supplies two data .. figure:: ../../book/intro/svg/intro-figures-021-c.png :align: center - :scale: 80 + :scale: 80 Message-mode transfer in a connection oriented service @@ -251,7 +251,7 @@ Unfortunately, the `message-mode` transfer is not widely used on the Internet. O .. figure:: ../../book/intro/svg/intro-figures-022-c.png :align: center - :scale: 80 + :scale: 80 Stream-mode transfer in a connection oriented service @@ -263,7 +263,7 @@ The third phase of a connection is when it needs to be released. As a connection .. figure:: ../../book/intro/svg/intro-figures-038-c.png :align: center - :scale: 80 + :scale: 80 Abrupt connection release initiated by the service provider @@ -273,7 +273,7 @@ An abrupt connection release can also be triggered by one of the users. If a use .. figure:: ../../book/intro/svg/intro-figures-023-c.png :align: center - :scale: 80 + :scale: 80 Abrupt connection release initiated by a user @@ -284,14 +284,14 @@ To ensure a reliable delivery of the SDUs sent by each user over a connection, w .. figure:: ../../book/intro/svg/intro-figures-024-c.png :align: center - :scale: 80 + :scale: 80 Graceful connection release .. note:: Reliability of the connection-oriented service - An important point to note about the connection-oriented service is its reliability. A `connection-oriented` service can only guarantee the correct delivery of all SDUs provided that the connection has been released gracefully. This implies that while the connection is active, there is no guarantee for the actual delivery of the SDUs exchanged as the connection may need to be released abruptly at any time. + An important point to note about the connection-oriented service is its reliability. A `connection-oriented` service can only guarantee the correct delivery of all SDUs provided that the connection has been released gracefully. This implies that while the connection is active, there is no guarantee for the actual delivery of the SDUs exchanged as the connection may need to be released abruptly at any time. .. index:: request-response service @@ -300,7 +300,7 @@ The request-response service .. index:: Remote Procedure Call, RPC -The `request-response service` is a compromise between the `connectionless service` and the `connection-oriented service`. Many applications need to send a small amount of data and receive a small amount of information back. This is similar to procedure calls in programming languages. A call to a procedure takes a few arguments and returns a simple answer. In a network, it is sometimes useful to execute a procedure on a different host and receive the result of the computation. Executing a procedure on another host is often called Remote Procedure Call. It is possible to use the `connectionless service` for this application. However, since this service is usually unreliable, this would force the application to deal with any type of error that could occur. Using the `connection oriented service` is another alternative. This service ensures the reliable delivery of the data, but a connection must be created before the beginning of the data transfert. This overhead can be important for applications that only exchange a small amount of data. +The `request-response service` is a compromise between the `connectionless service` and the `connection-oriented service`. Many applications need to send a small amount of data and receive a small amount of information back. This is similar to procedure calls in programming languages. A call to a procedure takes a few arguments and returns a simple answer. In a network, it is sometimes useful to execute a procedure on a different host and receive the result of the computation. Executing a procedure on another host is often called Remote Procedure Call. It is possible to use the `connectionless service` for this application. However, since this service is usually unreliable, this would force the application to deal with any type of error that could occur. Using the `connection oriented service` is another alternative. This service ensures the reliable delivery of the data, but a connection must be created before the beginning of the data transfert. This overhead can be important for applications that only exchange a small amount of data. The `request-response service` allows to efficiently exchange small amounts of information in a request and associate it with the corresponding response. This service can be depicted by using the time-sequence diagram below. @@ -330,7 +330,7 @@ The `request-response service` allows to efficiently exchange small amounts of i The transport layer ------------------- -The transport layer entity interacts with both a user in the application layer and the network layer. It improves the network layer service to make it useable by applications. From the application's viewpoint, the main limitations of the network layer service that its service is unreliable: +The transport layer entity interacts with both a user in the application layer and the network layer. It improves the network layer service to make it useable by applications. From the application's viewpoint, the main limitations of the network layer service are that its service is unreliable: - the network layer may corrupt data - the network layer may loose data @@ -338,11 +338,11 @@ The transport layer entity interacts with both a user in the application layer a - the network layer has an upper bound on maximum length of the data - the network layer may duplicate data -To deal with these issues, the transport layer includes several mechanisms that depend on the service that it provides. It interacts with both the applications and the underlying network layer. +To deal with these issues, the transport layer includes several mechanisms that depend on the service that it provides. It interacts with both the applications and the underlying network layer. .. figure:: ../../book/transport/svg/transport-fig-007-c.png :align: center - :scale: 80 + :scale: 80 Interactions between the transport layer, its user, and its network layer provider @@ -353,7 +353,7 @@ Connectionless transport ^^^^^^^^^^^^^^^^^^^^^^^^ The simplest service that can be provided in the transport layer is the connectionless transport service. Compared to the connectionless network layer service, this transport service includes two additional features : - + - an `error detection` mechanism that allows to detect corrupted data - a `multiplexing technique` that enables several applications running on one host to exchange information with another host @@ -361,17 +361,17 @@ The simplest service that can be provided in the transport layer is the connecti To exchange data, the transport protocol encapsulates the SDU produced by its user inside a `segment`. The `segment` is the unit of transfert of information in the transport layer. Transport layer entities always exchange segments. When a transport layer entity creates a segment, this segment is encapsulated by the network layer into a packet which contains the segment as its payload and a network header. The packet is then encapsulated in a frame to be transmitted in the datalink layer. -A `segment` also contains control information, usually stored inside a `header` and the payload that comes from the application. To detect transmission errors, transport protocols rely on checksums or CRCs like the datalink layer protocols. +A `segment` also contains control information, usually stored inside a `header` and the payload that comes from the application. To detect transmission errors, transport protocols rely on checksums or CRCs like the datalink layer protocols. -Compared to the connectionless network layer service, the transport layer service allows several applications running on a host to exchange SDUs with several other applications running on remote hosts. Let us consider two hosts, e.g. a client and a server. The network layer service allows the client to send information to the server, but if an application running on the client wants to contact a particular application running on the server, then an additional addressing mechanism is required other than the network layer address that identifies a host, in order to differentiate the application running on a host. This additional addressing can be provided by using `port numbers`. When a server application is launched on a host, it registers a `port number`. This `port number` will be used by the clients to contact the server process. +Compared to the connectionless network layer service, the transport layer service allows several applications running on a host to exchange SDUs with several other applications running on remote hosts. Let us consider two hosts, e.g. a client and a server. The network layer service allows the client to send information to the server, but if an application running on the client wants to contact a particular application running on the server, then an additional addressing mechanism is required other than the network layer address that identifies a host, in order to differentiate the application running on a host. This additional addressing can be provided by using `port numbers`. When a server application is launched on a host, it registers a `port number`. This `port number` will be used by the clients to contact the server process. The figure below shows a typical usage of port numbers. The client process uses port number `1234` while the server process uses port number `5678`. When the client sends a request, it is identified as originating from port number `1234` on the client host and destined to port number `5678` on the server host. When the server process replies to this request, the server's transport layer returns the reply as originating from port `5678` on the server host and destined to port `1234` on the client host. .. figure:: ../../book/transport/svg/udp-ports.png :align: center - :scale: 70 + :scale: 70 - Utilisation of port numbers + Utilisation of port numbers To support the connection-oriented service, the transport layer needs to include several mechanisms to enrich the connectionless network-layer service. We discuss these mechanisms in the following sections. @@ -379,25 +379,25 @@ To support the connection-oriented service, the transport layer needs to include Connection establishment ^^^^^^^^^^^^^^^^^^^^^^^^ -Like the connectionless service, the connection-oriented service allows several applications running on a given host to exchange data with other hosts. The port numbers described above for the connectionless service are also used by the connection-oriented service to multiplex several applications. Similarly, connection-oriented protocols used checksums/CRCs to detect transmission errors and discard segments containing an invalid checksum/CRC. +Like the connectionless service, the connection-oriented service allows several applications running on a given host to exchange data with other hosts. The port numbers described above for the connectionless service are also used by the connection-oriented service to multiplex several applications. Similarly, connection-oriented protocols use checksums/CRCs to detect transmission errors and discard segments containing an invalid checksum/CRC. -An important difference between the connectionless service and the connection-oriented one is that the transport entities in the latter maintain some state during lifetime of the connection. This state is created when a connection is established and is removed when it is released. +An important difference between the connectionless service and the connection-oriented one is that the transport entities in the latter maintain some state during the lifetime of the connection. This state is created when a connection is established and is removed when it is released. The simplest approach to establish a transport connection would be to define two special control segments : `CR` and `CA`. The `CR` segment is sent by the transport entity that wishes to initiate a connection. If the remote entity wishes to accept the connection, it replies by sending a `CA` segment. The `CR` and `CA` segments contain `port numbers` that allow to identify the communicating applications. The transport connection is considered to be established once the `CA` segment has been received. At that point, data segments can be sent in both directions. - + .. figure:: ../../book/transport/png/transport-fig-045-c.png :align: center - :scale: 70 + :scale: 70 - Naive transport connection establishment + Naive transport connection establishment -Unfortunately, this scheme is not sufficient given the unreliable network layer. Since the network layer is imperfect, the `CR` or `CA` segments can be lost, delayed, or suffer from transmission errors. To deal with these problems, the control segments must be protected by using a CRC or checksum to detect transmission errors. Furthermore, since the `CA` segment acknowledges the reception of the `CR` segment, the `CR` segment can be protected by using a retransmission timer. +Unfortunately, this scheme is not sufficient given the unreliable network layer. Since the network layer is imperfect, the `CR` or `CA` segments can be lost, delayed, or suffer from transmission errors. To deal with these problems, the control segments must be protected by using a CRC or checksum to detect transmission errors. Furthermore, since the `CA` segment acknowledges the reception of the `CR` segment, the `CR` segment can be protected by using a retransmission timer. -Unfortunately, this scheme is not sufficient to ensure the reliability of the transport service. Consider for example a short-lived transport connection where a single, but important transfer (e.g. money transfer from a bank account) is sent. Such a short-lived connection starts with a `CR` segment acknowledged by a `CA` segment, then the data segment is sent, acknowledged and the connection terminates. Unfortunately, as the network layer service is unreliable, delays combined to retransmissions may lead to the situation depicted in the figure below, where a delayed `CR` and data segments from a former connection are accepted by the receiving entity as valid segments, and the corresponding data is delivered to the user. Duplicating SDUs is not acceptable, and the transport protocol must solve this problem. +Unfortunately, this scheme is not sufficient to ensure the reliability of the transport service. Consider for example a short-lived transport connection where a single, but important transfer (e.g. money transfer from a bank account) is sent. Such a short-lived connection starts with a `CR` segment acknowledged by a `CA` segment, then the data segment is sent, acknowledged and the connection terminates. Unfortunately, as the network layer service is unreliable, delays combined to retransmissions may lead to the situation depicted in the figure below, where a delayed `CR` and data segments from a former connection are accepted by the receiving entity as valid segments, and the corresponding data is delivered to the user. Duplicating SDUs is not acceptable, and the transport protocol must solve this problem. .. figure:: ../../book/transport/png/transport-fig-047-c.png :align: center - :scale: 70 + :scale: 70 Duplicate transport connections ? @@ -405,7 +405,7 @@ Unfortunately, this scheme is not sufficient to ensure the reliability of the tr .. index:: Maximum Segment Lifetime (MSL), transport clock -To avoid these duplicates, transport protocols require the network layer to bound the `Maximum Segment Lifetime (MSL)`. The organisation of the network must guarantee that no segment remains in the network for longer than `MSL` seconds. For example, on today's Internet, `MSL` is expected to be 2 minutes. To avoid duplicate transport connections, transport protocol entities must be able to safely distinguish between a duplicate `CR` segment and a new `CR` segment, without forcing each transport entity to remember all the transport connections that it has established in the past. +To avoid these duplicates, transport protocols require the network layer to bound the `Maximum Segment Lifetime (MSL)`. The organisation of the network must guarantee that no segment remains in the network for longer than `MSL` seconds. For example, on today's Internet, `MSL` is expected to be 2 minutes. To avoid duplicate transport connections, transport protocol entities must be able to safely distinguish between a duplicate `CR` segment and a new `CR` segment, without forcing each transport entity to remember all the transport connections that it has established in the past. A classical solution to avoid remembering the previous transport connections to detect duplicates is to use a clock inside each transport entity. This `transport clock` has the following characteristics : @@ -414,7 +414,7 @@ A classical solution to avoid remembering the previous transport connections to .. figure:: ../../book/transport/png/transport-fig-048-c.png :align: center - :scale: 70 + :scale: 70 Transport clock @@ -425,7 +425,7 @@ This `transport clock` can now be combined with an exchange of three segments, c #. The initiating transport entity sends a `CR` segment. This segment requests the establishment of a transport connection. It contains a port number (not shown in the figure) and a sequence number (`seq=x` in the figure below) whose value is extracted from the `transport clock`. The transmission of the `CR` segment is protected by a retransmission timer. - #. The remote transport entity processes the `CR` segment and creates state for the connection attempt. At this stage, the remote entity does not yet know whether this is a new connection attempt or a duplicate segment. It returns a `CA` segment that contains an acknowledgement number to confirm the reception of the `CR` segment (`ack=x` in the figure below) and a sequence number (`seq=y` in the figure below) whose value is extracted from its transport clock. At this stage, the connection is not yet established. + #. The remote transport entity processes the `CR` segment and creates a state for the connection attempt. At this stage, the remote entity does not yet know whether this is a new connection attempt or a duplicate segment. It returns a `CA` segment that contains an acknowledgement number to confirm the reception of the `CR` segment (`ack=x` in the figure below) and a sequence number (`seq=y` in the figure below) whose value is extracted from its transport clock. At this stage, the connection is not yet established. #. The initiating entity receives the `CA` segment. The acknowledgement number of this segment confirms that the remote entity has correctly received the `CR` segment. The transport connection is considered to be established by the initiating entity and the numbering of the data segments starts at sequence number `x`. Before sending data segments, the initiating entity must acknowledge the received `CA` segments by sending another `CA` segment. @@ -435,17 +435,17 @@ This `transport clock` can now be combined with an exchange of three segments, c .. figure:: ../../book/transport/png/transport-fig-049-c.png :align: center - :scale: 70 + :scale: 70 Three-way handshake Thanks to the three way handshake, transport entities avoid duplicate transport connections. This is illustrated by considering the three scenarios below. -The first scenario is when the remote entity receives an old `CR` segment. It considers this `CR` segment as a connection establishment attempt and replies by sending a `CA` segment. However, the initiating host cannot match the received `CA` segment with a previous connection attempt. It sends a control segment (`REJECT` in the figure below) to cancel the spurious connection attempt. The remote entity cancels the connection attempt upon reception of this control segment. +The first scenario is when the remote entity receives an old `CR` segment. It considers this `CR` segment as a connection establishment attempt and replies by sending a `CA` segment. However, the initiating host cannot match the received `CA` segment with a previous connection attempt. It sends a control segment (`REJECT` in the figure below) to cancel the spurious connection attempt. The remote entity cancels the connection attempt upon reception of this control segment. .. figure:: ../../book/transport/png/transport-fig-050-c.png :align: center - :scale: 70 + :scale: 70 Three-way handshake : recovery from a duplicate `CR` @@ -454,15 +454,15 @@ A second scenario is when the initiating entity sends a `CR` segment that does n .. figure:: ../../book/transport/png/transport-fig-051-c.png :align: center - :scale: 70 + :scale: 70 Three-way handshake : recovery from a duplicate `CA` -The last scenario is less likely, but it it important to consider it as well. The remote entity receives an old `CR` segment. It notes the connection attempt and acknowledges it by sending a `CA` segment. The initiating entity does not have a matching connection attempt and replies by sending a `REJECT`. Unfortunately, this segment never reaches the remote entity. Instead, the remote entity receives a retransmission of an older `CA` segment that contains the same sequence number as the first `CR` segment. This `CA` segment cannot be accepted by the remote entity as a confirmation of the transport connection as its acknowledgement number cannot have the same value as the sequence number of the first `CA` segment. +The last scenario is less likely, but it it important to consider it as well. The remote entity receives an old `CR` segment. It notes the connection attempt and acknowledges it by sending a `CA` segment. The initiating entity does not have a matching connection attempt and replies by sending a `REJECT`. Unfortunately, this segment never reaches the remote entity. Instead, the remote entity receives a retransmission of an older `CA` segment that contains the same sequence number as the first `CR` segment. This `CA` segment cannot be accepted by the remote entity as a confirmation of the transport connection as its acknowledgement number cannot have the same value as the sequence number of the first `CA` segment. .. figure:: ../../book/transport/png/transport-fig-052-c.png :align: center - :scale: 70 + :scale: 70 Three-way handshake : recovery from duplicates `CR` and `CA` @@ -514,15 +514,15 @@ Using sequence numbers to count bytes has also one advantage when the transport Compared to reliable protocols in the datalink layer, reliable transport protocols encode their sequence numbers in more bits. 32 bits and 64 bits sequence numbers are frequent in the transport layer while some datalink layer protocols encode their sequence numbers in an 8 bits field. This large sequence number space is motivated by two reasons. First, since the sequence number is incremented for each transmitted byte, a single segment may consume one or several thousands of sequence numbers. Second, a reliable transport protocol must be able to detect delayed segments. This can only be done if the number of bytes transmitted during the MSL period is smaller than the sequence number space. Otherwise, there is a risk of accepting duplicate segments. -`Go-back-n` and `selective repeat` can be used in the transport layer as in the datalink layer. Since the network layer does not guarantee an in-order delivery of the packets, a transport entity should always store the segments that it receives out-of-sequence. For this reason, most transport protocols will opt for some form of selective repeat mechanism. +`Go-back-n` and `selective repeat` can be used in the transport layer as in the datalink layer. Since the network layer does not guarantee an in-order delivery of the packets, a transport entity should always store the segments that it receives out-of-sequence. For this reason, most transport protocols will opt for some form of selective repeat mechanism. -In the datalink layer, the sliding window has usually a fixed size which depends on the amount of buffers allocated to the datalink layer entity. Such a datalink layer entity usually serves one or a few network layer entities. In the transport layer, the situation is different. A single transport layer entity serves a large and varying number of application processes. Each transport layer entity manages a pool of buffers that needs to be shared between all these processes. Transport entity are usually implemented inside the operating system kernel and shares memory with other parts of the system. Furthermore, a transport layer entity must support several (possibly hundreds or thousands) of transport connections at the same time. This implies that the memory which can be used to support the sending or the receiving buffer of a transport connection may change during the lifetime of the connection [#fautotune]_ . Thus, a transport protocol must allow the sender and the receiver to adjust their window sizes. +In the datalink layer, the sliding window has usually a fixed size which depends on the amount of buffers allocated to the datalink layer entity. Such a datalink layer entity usually serves one or a few network layer entities. In the transport layer, the situation is different. A single transport layer entity serves a large and varying number of application processes. Each transport layer entity manages a pool of buffers that needs to be shared between all these processes. Transport entities are usually implemented inside the operating system kernel and share memory with other parts of the system. Furthermore, a transport layer entity must support several (possibly hundreds or thousands) of transport connections at the same time. This implies that the memory which can be used to support the sending or the receiving buffer of a transport connection may change during the lifetime of the connection [#fautotune]_ . Thus, a transport protocol must allow the sender and the receiver to adjust their window sizes. To deal with this issue, transport protocols allow the receiver to advertise the current size of its receiving window in all the acknowledgements that it sends. The receiving window advertised by the receiver bounds the size of the sending buffer used by the sender. In practice, the sender maintains two state variables : `swin`, the size of its sending window (that may be adjusted by the system) and `rwin`, the size of the receiving window advertised by the receiver. At any time, the number of unacknowledged segments cannot be larger than :math:`\min(swin,rwin)` [#facklost]_ . The utilisation of dynamic windows is illustrated in the figure below. .. figure:: ../../book/transport/svg/transport-fig-039.png :align: center - :scale: 90 + :scale: 90 Dynamic receiving window @@ -530,7 +530,7 @@ The receiver may adjust its advertised receive window based on its current memor .. figure:: ../../book/transport/png/transport-fig-040-c.png :align: center - :scale: 70 + :scale: 70 Risk of deadlock with dynamic windows @@ -544,7 +544,7 @@ To conclude our description of the basic mechanisms found in transport protocols .. figure:: ../../book/transport/png/transport-fig-041-c.png :align: center - :scale: 70 + :scale: 70 Ambiguities caused by excessive delays @@ -558,18 +558,18 @@ Connection release .. index:: abrupt connection release -When we discussed the connection-oriented service, we mentioned that there are two types of connection releases : `abrupt release` and `graceful release`. +When we discussed the connection-oriented service, we mentioned that there are two types of connection releases : `abrupt release` and `graceful release`. The first solution to release a transport connection is to define a new control segment (e.g. the `DR` segment) and consider the connection to be released once this segment has been sent or received. This is illustrated in the figure below. .. figure:: ../../book/transport/png/transport-fig-053-c.png :align: center - :scale: 70 + :scale: 70 Abrupt connection release -As the entity that sends the `DR` segment cannot know whether the other entity has already sent all its data on the connection, SDUs can be lost during such an `abrupt connection release`. +As the entity that sends the `DR` segment cannot know whether the other entity has already sent all its data on the connection, SDUs can be lost during such an `abrupt connection release`. .. index:: graceful connection release @@ -577,7 +577,7 @@ The second method to release a transport connection is to release independently .. figure:: ../../book/transport/png/transport-fig-054-c.png :align: center - :scale: 70 + :scale: 70 Graceful connection release @@ -599,5 +599,3 @@ The second method to release a transport connection is to release independently .. include:: /links.rst - - From 6d43b51d7b55b2d9306e40050f05481250e27210 Mon Sep 17 00:00:00 2001 From: magalii Date: Tue, 29 Jan 2019 12:57:30 +0100 Subject: [PATCH 2/3] Corrections of misprints, language and ponctuation in Glossary + part Protocols --- book-2nd/glossary.rst | 3 - book-2nd/protocols/bgp.rst | 164 +++++++++++------------ book-2nd/protocols/congestion.rst | 60 ++++----- book-2nd/protocols/dns.rst | 20 +-- book-2nd/protocols/dnssec.rst | 112 ++++++++-------- book-2nd/protocols/ethernet.rst | 85 ++++++------ book-2nd/protocols/http.rst | 90 ++++++------- book-2nd/protocols/ipv6b.rst | 34 ++--- book-2nd/protocols/ppp.rst | 10 +- book-2nd/protocols/routing.rst | 40 +++--- book-2nd/protocols/rpc.rst | 22 +-- book-2nd/protocols/sctp.rst | 28 ++-- book-2nd/protocols/ssh.rst | 92 ++++++------- book-2nd/protocols/tcp.rst | 158 +++++++++++----------- book-2nd/protocols/transport-service.rst | 25 ++-- 15 files changed, 470 insertions(+), 473 deletions(-) diff --git a/book-2nd/glossary.rst b/book-2nd/glossary.rst index b6dd107..7cb48eb 100644 --- a/book-2nd/glossary.rst +++ b/book-2nd/glossary.rst @@ -192,9 +192,6 @@ Glossary POP The Post Office Protocol is defined in :rfc:`1939` - IMAP - The Internet Message Access Protocol is defined in :rfc:`3501` - FTP The File Transfer Protocol is defined in :rfc:`959` diff --git a/book-2nd/protocols/bgp.rst b/book-2nd/protocols/bgp.rst index 77023ab..de570d8 100644 --- a/book-2nd/protocols/bgp.rst +++ b/book-2nd/protocols/bgp.rst @@ -1,14 +1,14 @@ .. Copyright |copy| 2010 by Olivier Bonaventure .. This file is licensed under a `creative commons licence `_ -.. warning:: +.. warning:: This is an unpolished draft of the second edition of this ebook. If you find any error or have suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues/new Interdomain routing =================== -As explained earlier, the Internet is composed of more than 45,000 different networks [#fasnum]_ called `domains`. Each domain is composed of a group of routers and hosts that are managed by the same organisation. Example domains include belnet_, sprint_, level3_, geant_, abilene_, cisco_ or google_ ... +As explained earlier, the Internet is composed of more than 45,000 different networks [#fasnum]_ called `domains`. Each domain is composed of a group of routers and hosts that are managed by the same organisation. Example domains include belnet_, sprint_, level3_, geant_, abilene_, cisco_ or google_ ... .. index:: stub domain, transit domain @@ -17,79 +17,79 @@ Each domain contains a set of routers. From a routing point of view, these domai .. figure:: /../book/network/png/network-fig-089-c.png :align: center :scale: 70 - - Transit and stub domains -The stub domains can be further classified by considering whether they mainly send or receive packets. An `access-rich` stub domain is a domain that contains hosts that mainly receive packets. Typical examples include small ADSL- or cable modem-based Internet Service Providers or enterprise networks. On the other hand, a `content-rich` stub domain is a domain that mainly produces packets. Examples of `content-rich` stub domains include google_, yahoo_, microsoft_, facebook_ or content distribution networks such as akamai_ or limelight_ For the last few years, we have seen a rapid growth of these `content-rich` stub domains. Recent measurements [ATLAS2009]_ indicate that a growing fraction of all the packets exchanged on the Internet are produced in the data centers managed by these content providers. + Transit and stub domains -Domains need to be interconnected to allow a host inside a domain to exchange IP packets with hosts located in other domains. From a physical perspective, domains can be interconnected in two different ways. The first solution is to directly connect a router belonging to the first domain with a router inside the second domain. Such links between domains are called private interdomain links or `private peering links`. In practice, for redundancy or performance reasons, distinct physical links are usually established between different routers in the two domains that are interconnected. +The stub domains can be further classified by considering whether they mainly send or receive packets. An `access-rich` stub domain is a domain that contains hosts that mainly receive packets. Typical examples include small ADSL- or cable modem-based Internet Service Providers or enterprise networks. On the other hand, a `content-rich` stub domain is a domain that mainly produces packets. Examples of `content-rich` stub domains include google_, yahoo_, microsoft_, facebook_ or content distribution networks such as akamai_ or limelight_ . For the last few years, we have seen a rapid growth of these `content-rich` stub domains. Recent measurements [ATLAS2009]_ indicate that a growing fraction of all the packets exchanged on the Internet are produced in the data centers managed by these content providers. + +Domains need to be interconnected to allow a host inside a domain to exchange IP packets with hosts located in other domains. From a physical perspective, domains can be interconnected in two different ways. The first solution is to directly connect a router belonging to the first domain with a router inside the second domain. Such links between domains are called private interdomain links or `private peering links`. In practice, for redundancy or performance reasons, distinct physical links are usually established between different routers in the two domains that are interconnected. .. figure:: /../book/network/png/network-fig-104-c.png :align: center :scale: 70 - - Interconnection of two domains via a private peering link -Such `private peering links` are useful when, for example, an enterprise or university network needs to be connected to its Internet Service Provider. However, some domains are connected to hundreds of other domains [#fasrank]_ . For some of these domains, using only private peering links would be too costly. A better solution to allow many domains to interconnect cheaply are the `Internet eXchange Points` (:term:`IXP`). An :term:`IXP` is usually some space in a data center that hosts routers belonging to different domains. A domain willing to exchange packets with other domains present at the :term:`IXP` installs one of its routers on the :term:`IXP` and connects it to other routers inside its own network. The IXP contains a Local Area Network to which all the participating routers are connected. When two domains that are present at the IXP wish [#fwish]_ to exchange packets, they simply use the Local Area Network. IXPs are very popular in Europe and many Internet Service Providers and Content providers are present in these IXPs. + Interconnection of two domains via a private peering link + +Such `private peering links` are useful when, for example, an enterprise or university network needs to be connected to its Internet Service Provider. However, some domains are connected to hundreds of other domains [#fasrank]_ . For some of these domains, using only private peering links would be too costly. A better solution to allow many domains to interconnect cheaply are the `Internet eXchange Points` (:term:`IXP`). An :term:`IXP` is usually some space in a data center that hosts routers belonging to different domains. A domain willing to exchange packets with other domains present at the :term:`IXP` installs one of its routers on the :term:`IXP` and connects it to other routers inside its own network. The IXP contains a Local Area Network to which all the participating routers are connected. When two domains that are present at the IXP wish [#fwish]_ to exchange packets, they simply use the Local Area Network. IXPs are very popular in Europe and many Internet Service Providers and Content providers are present in these IXPs. .. figure:: /../book/network/png/network-fig-103-c.png :align: center :scale: 70 - + Interconnection of two domains at an Internet eXchange Point In the early days of the Internet, domains would simply exchange all the routes they know to allow a host inside one domain to reach any host in the global Internet. However, in today's highly commercial Internet, this is no longer true as interdomain routing mainly needs to take into account the economical relationships between the domains. Furthermore, while intradomain routing usually prefers some routes over others based on their technical merits (e.g. prefer route with the minimum number of hops, prefer route with the minimum delay, prefer high bandwidth routes over low bandwidth ones, etc) interdomain routing mainly deals with economical issues. For interdomain routing, the cost of using a route is often more important than the quality of the route measured by its delay or bandwidth. -There are different types of economical relationships that can exist between domains. Interdomain routing converts these relationships into peering relationships between domains that are connected via peering links. +There are different types of economical relationships that can exist between domains. Interdomain routing converts these relationships into peering relationships between domains that are connected via peering links. .. index:: customer-provider peering relationship -The first category of peering relationship is the `customer->provider` relationship. Such a relationship is used when a customer domain pays an Internet Service Provider to be able to exchange packets with the global Internet over an interdomain link. A similar relationship is used when a small Internet Service Provider pays a larger Internet Service Provider to exchange packets with the global Internet. +The first category of peering relationship is the `customer->provider` relationship. Such a relationship is used when a customer domain pays an Internet Service Provider to be able to exchange packets with the global Internet over an interdomain link. A similar relationship is used when a small Internet Service Provider pays a larger Internet Service Provider to exchange packets with the global Internet. .. figure:: /../book/network/png/network-fig-106-c.png :align: center :scale: 70 - + A simple Internet with peering relationships -To understand the `customer->provider` relationship, let us consider the simple internetwork shown in the figure above. In this internetwork, `AS7` is a stub domain that is connected to one provider : `AS4`. The contract between `AS4` and `AS7` allows a host inside `AS7` to exchange packets with any host in the internetwork. To enable this exchange of packets, `AS7` must know a route towards any domain and all the domains of the internetwork must know a route via `AS4` that allows them to reach hosts inside `AS7`. From a routing perspective, the commercial contract between `AS7` and `AS4` leads to the following routes being exchanged : +To understand the `customer->provider` relationship, let us consider the simple internetwork shown in the figure above. In this internetwork, `AS7` is a stub domain that is connected to one provider : `AS4`. The contract between `AS4` and `AS7` allows a host inside `AS7` to exchange packets with any host in the internetwork. To enable this exchange of packets, `AS7` must know a route towards any domain and all the domains of the internetwork must know a route via `AS4` that allows them to reach hosts inside `AS7`. From a routing perspective, the commercial contract between `AS7` and `AS4` leads to the following routes being exchanged : - - over a `customer->provider` relationship, the `customer` domain advertises to its `provider` all its routes and all the routes that it has learned from its own customers. - - over a `provider->customer` relationship, the `provider` advertises all the routes that it knows to its `customer`. + - over a `customer->provider` relationship, the `customer` domain advertises to its `provider` all its routes and all the routes that it has learned from its own customers. + - over a `provider->customer` relationship, the `provider` advertises all the routes that it knows to its `customer`. The second rule ensures that the customer domain receives a route towards all destinations that are reachable via its provider. The first rule allows the routes of the customer domain to be distributed throughout the Internet. -Coming back to the figure above, `AS4` advertises to its two providers `AS1` and `AS2` its own routes and the routes learned from its customer, `AS7`. On the other hand, `AS4` advertises to `AS7` all the routes that it knows. +Coming back to the figure above, `AS4` advertises to its two providers `AS1` and `AS2` its own routes and the routes learned from its customer, `AS7`. On the other hand, `AS4` advertises to `AS7` all the routes that it knows. .. index:: shared-cost peering relationship -The second type of peering relationship is the `shared-cost` peering relationship. Such a relationship usually does not involve a payment from one domain to the other in contrast with the `customer->provider` relationship. A `shared-cost` peering relationship is usually established between domains having a similar size and geographic coverage. For example, consider the figure above. If `AS3` and `AS4` exchange many packets via `AS1`, they both need to pay `AS1`. A cheaper alternative for `AS3` and `AS4` would be to establish a `shared-cost` peering. Such a peering can be established at IXPs where both `AS3` and `AS4` are present or by using private peering links. This `shared-cost` peering should be used to exchange packets between hosts inside `AS3` and hosts inside `AS4`. However, `AS3` does not want to receive on the `AS3-AS4` `shared-cost` peering links packets whose destination belongs to `AS1` as `AS3` would have to pay to send these packets to `AS1`. +The second type of peering relationship is the `shared-cost` peering relationship. Such a relationship usually does not involve a payment from one domain to the other in contrast with the `customer->provider` relationship. A `shared-cost` peering relationship is usually established between domains having a similar size and geographic coverage. For example, consider the figure above. If `AS3` and `AS4` exchange many packets via `AS1`, they both need to pay `AS1`. A cheaper alternative for `AS3` and `AS4` would be to establish a `shared-cost` peering. Such a peering can be established at IXPs where both `AS3` and `AS4` are present or by using private peering links. This `shared-cost` peering should be used to exchange packets between hosts inside `AS3` and hosts inside `AS4`. However, `AS3` does not want to receive on the `AS3-AS4` `shared-cost` peering links packets whose destination belongs to `AS1` as `AS3` would have to pay to send these packets to `AS1`. -From a routing perspective, over a `shared-cost` peering relationship a domain only advertises its internal routes and the routes that it has learned from its customers. This restriction ensures that only packets destined to the local domain or one of its customers is received over the `shared-cost` peering relationship. This implies that the routes that have been learned from a provider or from another `shared-cost` peer is not advertised over a `shared-cost` peering relationship. This is motivated by economical reasons. If a domain were to advertise the routes that it learned from a provider over a `shared-cost` peering relationship that does not bring revenue, it would have allowed its `shared-cost` peer to use the link with its provider without any payment. If a domain were to advertise the routes it learned over a `shared cost` peering over another `shared-cost` peering relationship, it would have allowed these `shared-cost` peers to use its own network (which may span one or more continents) freely to exchange packets. +From a routing perspective, over a `shared-cost` peering relationship a domain only advertises its internal routes and the routes that it has learned from its customers. This restriction ensures that only packets destined to the local domain or one of its customers is received over the `shared-cost` peering relationship. This implies that the routes that have been learned from a provider or from another `shared-cost` peer is not advertised over a `shared-cost` peering relationship. This is motivated by economical reasons. If a domain were to advertise the routes that it learned from a provider over a `shared-cost` peering relationship that does not bring revenue, it would have allowed its `shared-cost` peer to use the link with its provider without any payment. If a domain were to advertise the routes it learned over a `shared cost` peering over another `shared-cost` peering relationship, it would have allowed these `shared-cost` peers to use its own network (which may span one or more continents) freely to exchange packets. .. index:: sibling peering relationship -Finally, the last type of peering relationship is the `sibling`. Such a relationship is used when two domains exchange all their routes in both directions. In practice, such a relationship is only used between domains that belong to the same company. +Finally, the last type of peering relationship is the `sibling`. Such a relationship is used when two domains exchange all their routes in both directions. In practice, such a relationship is only used between domains that belong to the same company. .. index:: interdomain routing policy -These different types of relationships are implemented in the `interdomain routing policies` defined by each domain. The `interdomain routing policy` of a domain is composed of three main parts : +These different types of relationships are implemented in the `interdomain routing policies` defined by each domain. The `interdomain routing policy` of a domain is composed of three main parts : - - the `import filter` that specifies, for each peering relationship, the routes that can be accepted from the neighbouring domain (the non-acceptable routes are ignored and the domain never uses them to forward packets) - - the `export filter` that specifies, for each peering relationship, the routes that can be advertised to the neighbouring domain - - the `ranking` algorithm that is used to select the best route among all the routes that the domain has received towards the same destination prefix + - the `import filter` that specifies, for each peering relationship, the routes that can be accepted from the neighbouring domain (the non-acceptable routes are ignored and the domain never uses them to forward packets) + - the `export filter` that specifies, for each peering relationship, the routes that can be advertised to the neighbouring domain + - the `ranking` algorithm that is used to select the best route among all the routes that the domain has received towards the same destination prefix .. index:: import policy, export policy -A domain's import and export filters can be defined by using the Route Policy Specification Language (RPSL) specified in :rfc:`2622` [GAVE1999]_ . Some Internet Service Providers, notably in Europe, use RPSL to document [#fripedb]_ their import and export policies. Several tools help to easily convert a RPSL policy into router commands. +A domain's import and export filters can be defined by using the Route Policy Specification Language (RPSL) specified in :rfc:`2622` [GAVE1999]_ . Some Internet Service Providers, notably in Europe, use RPSL to document [#fripedb]_ their import and export policies. Several tools help to easily convert a RPSL policy into router commands. -The figure below provides a simple example of import and export filters for two domains in a simple internetwork. In RPSL, the keyword `ANY` is used to replace any route from any domain. It is typically used by a provider to indicate that it announces all its routes to a customer over a `provider->customer` relationship. This is the case for `AS4`'s export policy. The example below clearly shows the difference between a `provider->customer` and a `shared-cost` peering relationship. `AS4`'s export filter indicates that it announces only its internal routes (`AS4`) and the routes learned from its clients (`AS7`) over its `shared-cost` peering with `AS3`, while it advertises all the routes that it uses (including the routes learned from `AS3`) to `AS7`. +The figure below provides a simple example of import and export filters for two domains in a simple internetwork. In RPSL, the keyword `ANY` is used to replace any route from any domain. It is typically used by a provider to indicate that it announces all its routes to a customer over a `provider->customer` relationship. This is the case for `AS4`'s export policy. The example below clearly shows the difference between a `provider->customer` and a `shared-cost` peering relationship. `AS4`'s export filter indicates that it announces only its internal routes (`AS4`) and the routes learned from its clients (`AS7`) over its `shared-cost` peering with `AS3`, while it advertises all the routes that it uses (including the routes learned from `AS3`) to `AS7`. .. figure:: /../book/network/png/network-fig-109-c.png :align: center :scale: 70 - - Import and export policies + + Import and export policies .. index:: BGP, Border Gateway Protocol @@ -103,8 +103,8 @@ The figure below shows a simple example of the BGP routes that are exchanged bet .. figure:: /protocols/figures/bgp-example.* :align: center :scale: 70 - - Simple exchange of BGP routes + + Simple exchange of BGP routes .. index:: BGP peer @@ -113,10 +113,10 @@ BGP routers exchange routes over BGP sessions. A BGP session is established betw .. figure:: /../book/network/svg/bgp-peering.png :align: center :scale: 70 - + A BGP peering session between two directly connected routers -In practice, to establish a BGP session between routers `R1` and `R2` in the figure above, the network administrator of `AS3` must first configure on `R1` the IP address of `R2` on the `R1-R2` link and the AS number of `R2`. Router `R1` then regularly tries to establish the BGP session with `R2`. `R2` only agrees to establish the BGP session with `R1` once it has been configured with the IP address of `R1` and its AS number. For security reasons, a router never establishes a BGP session that has not been manually configured on the router. +In practice, to establish a BGP session between routers `R1` and `R2` in the figure above, the network administrator of `AS3` must first configure on `R1` the IP address of `R2` on the `R1-R2` link and the AS number of `R2`. Router `R1` then regularly tries to establish the BGP session with `R2`. `R2` only agrees to establish the BGP session with `R1` once it has been configured with the IP address of `R1` and its AS number. For security reasons, a router never establishes a BGP session that has not been manually configured on the router. .. index:: BGP OPEN, BGP NOTIFICATION, BGP KEEPALIVE, BGP UPDATE @@ -125,7 +125,7 @@ The BGP protocol :rfc:`4271` defines several types of messages that can be excha - `OPEN` : this message is sent as soon as the TCP connection between the two routers has been established. It initialises the BGP session and allows the negotiation of some options. Details about this message may be found in :rfc:`4271` - `NOTIFICATION` : this message is used to terminate a BGP session, usually because an error has been detected by the BGP peer. A router that sends or receives a `NOTIFICATION` message immediately shutdowns the corresponding BGP session. - `UPDATE`: this message is used to advertise new or modified routes or to withdraw previously advertised routes. - - `KEEPALIVE` : this message is used to ensure a regular exchange of messages on the BGP session, even when no route changes. When a BGP router has not sent an `UPDATE` message during the last 30 seconds, it shall send a `KEEPALIVE` message to confirm to the other peer that it is still up. If a peer does not receive any BGP message during a period of 90 seconds [#fdefaultkeepalive]_, the BGP session is considered to be down and all the routes learned over this session are withdrawn. + - `KEEPALIVE` : this message is used to ensure a regular exchange of messages on the BGP session, even when no route changes. When a BGP router has not sent an `UPDATE` message during the last 30 seconds, it shall send a `KEEPALIVE` message to confirm to the other peer that it is still up. If a peer does not receive any BGP message during a period of 90 seconds [#fdefaultkeepalive]_, the BGP session is considered to be down and all the routes learned over this session are withdrawn. As explained earlier, BGP relies on incremental updates. This implies that when a BGP session starts, each router first sends BGP `UPDATE` messages to advertise to the other peer all the exportable routes that it knows. Once all these routes have been advertised, the BGP router only sends BGP `UPDATE` messages about a prefix if the route is new, one of its attributes has changed or the route became unreachable and must be withdrawn. The BGP `UPDATE` message allows BGP routers to efficiently exchange such information while minimising the number of bytes exchanged. Each `UPDATE` message contains : @@ -135,8 +135,8 @@ As explained earlier, BGP relies on incremental updates. This implies that when In the remainder of this chapter, and although all routing information is exchanged using BGP `UPDATE` messages, we assume for simplicity that a BGP message contains only information about one prefix and we use the words : - - `Withdraw message` to indicate a BGP `UPDATE` message containing one route that is withdrawn - - `Update message` to indicate a BGP `UPDATE` containing a new or updated route towards one destination prefix with its attributes + - `Withdraw message` to indicate a BGP `UPDATE` message containing one route that is withdrawn + - `Update message` to indicate a BGP `UPDATE` containing a new or updated route towards one destination prefix with its attributes .. index:: BGP Adj-RIB-In, BGP Adj-RIB-Out, BGP RIB @@ -148,10 +148,10 @@ From a conceptual point of view, a BGP router connected to `N` BGP peers, can be .. figure:: /../book/network/png/network-fig-113-c.png :align: center :scale: 70 - - Organisation of a BGP router -In this figure, the router receives BGP messages on the left part of the figure, processes these messages and possibly sends BGP messages on the right part of the figure. A BGP router contains three important data structures : + Organisation of a BGP router + +In this figure, the router receives BGP messages on the left part of the figure, processes these messages and possibly sends BGP messages on the right part of the figure. A BGP router contains four important data structures : - the `Adj-RIB-In` contains the BGP routes that have been received from each BGP peer. The routes in the `Adj-RIB-In` are filtered by the `import filter` before being placed in the `BGP-Loc-RIB`. There is one `import filter` per BGP peer. - the `Local Routing Information Base` (`Loc-RIB`) contains all the routes that are considered as acceptable by the router. The `Loc-RIB` may contain several routes, learned from different BGP peers, towards the same destination prefix. @@ -165,7 +165,7 @@ When a BGP session starts, the routers first exchange `OPEN` messages to negotia def initialize_BGP_session( RemoteAS, RemoteIP): # Initialize and start BGP session # Send BGP OPEN Message to RemoteIP on port 179 - # Follow BGP state machine + # Follow BGP state machine # advertise local routes and routes learned from peers*/ for d in BGPLocRIB : B=build_BGP_Update(d) @@ -185,10 +185,10 @@ In the above pseudo-code, the `build\_BGP\_UPDATE(d)` procedure extracts from th # check if RemoteAS already received route if RemoteAS is BGPMsg.ASPath : BGPMsg=None - # Many additional export policies can be configured : - # Accept or refuse the BGPMsg - # Modify selected attributes inside BGPMsg - return BGPMsg + # Many additional export policies can be configured : + # Accept or refuse the BGPMsg + # Modify selected attributes inside BGPMsg + return BGPMsg At this point, the remote router has received all the exportable BGP routes. After this initial exchange, the router only sends `BGP UPDATE` messages when there is a change (addition of a route, removal of a route or change in the attributes of a route) in one of these exportable routes. Such a change can happen when the router receives a BGP message. The pseudo-code below summarizes the processing of these BGP messages. @@ -196,33 +196,33 @@ At this point, the remote router has received all the exportable BGP routes. Aft def Recvd_BGPMsg(Msg, RemoteAS) : B=apply_import_filter(Msg,RemoteAS) - if (B== None): # Msg not acceptable + if (B== None): # Msg not acceptable return if IsUPDATE(Msg): - Old_Route=BestRoute(Msg.prefix) + Old_Route=BestRoute(Msg.prefix) Insert_in_RIB(Msg) - Run_Decision_Process(RIB) + Run_Decision_Process(RIB) if (BestRoute(Msg.prefix) != Old_Route) : - # best route changed + # best route changed B=build_BGP_Message(Msg.prefix); S=apply_export_filter(RemoteAS,B); - if (S!=None) : # announce best route - send_UPDATE(S,RemoteAS,RemoteIP); + if (S!=None) : # announce best route + send_UPDATE(S,RemoteAS,RemoteIP); else if (Old_Route != None) : - send_WITHDRAW(Msg.prefix,RemoteAS, RemoteIP) + send_WITHDRAW(Msg.prefix,RemoteAS, RemoteIP) else : # Msg is WITHDRAW - Old_Route=BestRoute(Msg.prefix) + Old_Route=BestRoute(Msg.prefix) Remove_from_RIB(Msg) Run_Decision_Process(RIB) if (Best_Route(Msg.prefix) !=Old_Route): - # best route changed + # best route changed B=build_BGP_Message(Msg.prefix) S=apply_export_filter(RemoteAS,B) if (S != None) : # still one best route towards Msg.prefix send_UPDATE(S,RemoteAS, RemoteIP); - else if(Old_Route != None) : # No best route anymore + else if(Old_Route != None) : # No best route anymore send_WITHDRAW(Msg.prefix,RemoteAS,RemoteIP); - + When a BGP message is received, the router first applies the peer's `import filter` to verify whether the message is acceptable or not. If the message is not acceptable, the processing stops. The pseudo-code below shows a simple `import filter`. This `import filter` accepts all routes, except those that already contain the local AS in their AS-Path. If such a route was used, it would cause a routing loop. Another example of an `import filter` would be a filter used by an Internet Service Provider on a session with a customer to only accept routes towards the IP prefixes assigned to the customer by the provider. On real routers, `import filters` can be much more complex and some `import filters` modify the attributes of the received BGP `UPDATE` [WMS2004]_ . .. code-block:: python @@ -230,15 +230,15 @@ When a BGP message is received, the router first applies the peer's `import filt def apply_import_filter(RemoteAS, BGPMsg): if MysAS in BGPMsg.ASPath : BGPMsg=None - # Many additional import policies can be configured : - # Accept or refuse the BGPMsg - # Modify selected attributes inside BGPMsg + # Many additional import policies can be configured : + # Accept or refuse the BGPMsg + # Modify selected attributes inside BGPMsg return BGPMsg - + .. note:: The bogon filters - Another example of frequently used `import filters` are the filters that Internet Service Providers use to ignore bogon routes. In the ISP community, a bogon route is a route that should not be advertised on the global Internet. Typical examples include the documentation IPv6 prefix (`2001:db8::/32` used for most examples in this book), the loopback address (`::1/128`) or the IPv6 prefixes that have not yet been allocated by IANA. A well managed BGP router should ensure that it never advertises bogons on the global Internet. Detailed information about these bogons may be found in [IMHM2013]_. + Another example of frequently used `import filters` are the filters that Internet Service Providers use to ignore bogon routes. In the ISP community, a bogon route is a route that should not be advertised on the global Internet. Typical examples include the documentation IPv6 prefix (`2001:db8::/32` used for most examples in this book), the loopback address (`::1/128`) or the IPv6 prefixes that have not yet been allocated by IANA. A well managed BGP router should ensure that it never advertises bogons on the global Internet. Detailed information about these bogons may be found in [IMHM2013]_. .. http://www.team-cymru.org/Services/Bogons/ @@ -250,7 +250,7 @@ Let us now discuss in more detail the operation of BGP in an IPv6 network. For t .. figure:: /protocols/figures/bgp-nexthop.png :align: center :scale: 70 - + Utilisation of the BGP nexthop attribute .. todo:: ipv6 @@ -264,7 +264,7 @@ Let us assume that the `R1-R2` BGP session is the first to be established. A `BG - the advertised prefix - the `BGP nexthop` - - the attributes including the AS-Path + - the attributes including the AS-Path We use the notation `U(prefix, nexthop, attributes)` to represent such a `BGP Update` message in this section. Similarly, `W(prefix)` represents a `BGP withdraw` for the specified prefix. Once the `R1-R2` session has been established, `R1` sends `U(2001:db8:1234::/48,2001:db8::5,AS10)` to `R2` and `R2` sends `U(2001:db8:5678:/48,2001:db8::6,AS20)`. At this point, `R1` can reach `2001:db8:5678::/48` via `2001:db8::6` and `R2` can reach `2001:db8:1234::/48` via `2001:db8::5`. @@ -289,7 +289,7 @@ If the link between `R2` and `R3` fails, `R3` detects the failure as it did not The BGP decision process ........................ -Besides the import and export filters, a key difference between BGP and the intradomain routing protocols is that each domain can define is own ranking algorithm to determine which route is chosen to forward packets when several routes have been learned towards the same prefix. This ranking depends on several BGP attributes that can be attached to a BGP route. +Besides the import and export filters, a key difference between BGP and the intradomain routing protocols is that each domain can define its own ranking algorithm to determine which route is chosen to forward packets when several routes have been learned towards the same prefix. This ranking depends on several BGP attributes that can be attached to a BGP route. .. index:: BGP local-preference @@ -298,16 +298,16 @@ The first BGP attribute that is used to rank BGP routes is the `local-preference When comparing routes towards the same destination prefix, a BGP router always prefers the routes with the highest `local-pref`. If the BGP router knows several routes with the same `local-pref`, it prefers among the routes having this `local-pref` the ones with the shortest AS-Path. -The `local-pref` attribute is often used to prefer some routes over others. +The `local-pref` attribute is often used to prefer some routes over others. -.. This attribute is always present inside `BGP Updates` exchanged over `iBGP sessions`, but never present in the messages exchanged over `eBGP sessions`. +.. This attribute is always present inside `BGP Updates` exchanged over `iBGP sessions`, but never present in the messages exchanged over `eBGP sessions`. A common utilisation of `local-pref` is to support backup links. Consider the situation depicted in the figure below. `AS1` would always like to use the high bandwidth link to send and receive packets via `AS2` and only use the backup link upon failure of the primary one. .. figure:: /../book/network/svg/bgp-backup.png :align: center :scale: 70 - + How to create a backup link with BGP ? As BGP routers always prefer the routes with the highest `local-pref` attribute, this policy can be implemented using the following import filter on `R1` @@ -334,8 +334,8 @@ Sometimes, the `local-pref` attribute is used to prefer a `cheap` link compared .. figure:: /../book/network/svg/bgp-prefer.png :align: center :scale: 70 - - How to prefer a cheap link over an more expensive one ? + + How to prefer a cheap link over an more expensive one ? `AS1` can install the following import filter on `R1` to ensure that it always sends packets via `R2` when it has learned a route via `AS2` and another via `AS4`. @@ -359,7 +359,7 @@ With such an import filter, the routers of a domain always prefer to reach desti .. figure:: /../book/network/svg/asymetry.png :align: center :scale: 70 - + Asymmetry of Internet paths Consider in this internetwork the routes available inside `AS1` to reach `AS5`. `AS1` learns the `AS4:AS6:AS7:AS5` path from `AS4`, the `AS3:AS8:AS5` path from `AS3` and the `AS2:AS5` path from `AS2`. The first path is chosen since it was learned from a customer. `AS5` on the other hand receives three paths towards `AS1` via its providers. It may select any of these paths to reach `AS1` , depending on how it prefers one provider over the others. @@ -376,23 +376,23 @@ In the previous sections, we have explained the operation of BGP routers. Compar .. figure:: /../book/network/svg/disagree.png :align: center :scale: 70 - - The disagree internetwork + + The disagree internetwork In this internetwork, we focus on the route towards `2001:db8::1234/48` which is advertised by `AS1`. Let us also assume that `AS3` (resp. `AS4`) prefers, e.g. for economic reasons, a route learned from `AS4` (`AS3`) over a route learned from `AS1`. When `AS1` sends `U(2001:db8::1234/48,AS1)` to `AS3` and `AS4`, three sequences of exchanges of BGP messages are possible : #. `AS3` sends first `U(2001:db8:1234/48,AS3:AS1)` to `AS4`. `AS4` has learned two routes towards `2001:db8:1234/48`. It runs its BGP decision process and selects the route via `AS3` and does not advertise a route to `AS3` #. `AS4` first sends `U(2001:db8:1234/48,AS3:AS1)` to `AS3`. `AS3` has learned two routes towards `2001:db8:1234/48`. It runs its BGP decision process and selects the route via `AS4` and does not advertise a route to `AS4` - #. `AS3` sends `U(2001:db8:1234/48,AS3:AS1)` to `AS4` and, at the same time, `AS4` sends `U(2001:db8:1234/48,AS4:AS1)`. `AS3` prefers the route via `AS4` and thus sends `W(2001:db8:1234/48)` to `AS4`. In the mean time, `AS4` prefers the route via `AS3` and thus sends `W(2001:db8:1234/48)` to `AS3`. Upon reception of the `BGP Withdraws`, `AS3` and `AS4` only know the direct route towards `2001:db8:1234/48`. `AS3` (resp. `AS4`) sends `U(2001:db8:1234/48,AS3:AS1)` (resp. `U(2001:db8:1234/48,AS4:AS1)`) to `AS4` (resp. `AS3`). `AS3` and `AS4` could in theory continue to exchange BGP messages for ever. In practice, one of them sends one message faster than the other and BGP converges. + #. `AS3` sends `U(2001:db8:1234/48,AS3:AS1)` to `AS4` and, at the same time, `AS4` sends `U(2001:db8:1234/48,AS4:AS1)`. `AS3` prefers the route via `AS4` and thus sends `W(2001:db8:1234/48)` to `AS4`. In the mean time, `AS4` prefers the route via `AS3` and thus sends `W(2001:db8:1234/48)` to `AS3`. Upon reception of the `BGP Withdraws`, `AS3` and `AS4` only know the direct route towards `2001:db8:1234/48`. `AS3` (resp. `AS4`) sends `U(2001:db8:1234/48,AS3:AS1)` (resp. `U(2001:db8:1234/48,AS4:AS1)`) to `AS4` (resp. `AS3`). `AS3` and `AS4` could in theory continue to exchange BGP messages for ever. In practice, one of them sends one message faster than the other and BGP converges. -The example above has shown that the routes selected by BGP routers may sometimes depend on the ordering of the BGP messages that are exchanged. Other similar scenarios may be found in :rfc:`4264`. +The example above has shown that the routes selected by BGP routers may sometimes depend on the ordering of the BGP messages that are exchanged. Other similar scenarios may be found in :rfc:`4264`. From an operational perspective, the above configuration is annoying since the network operators cannot easily predict which paths are chosen. Unfortunately, there are even more annoying BGP configurations. For example, let us consider the configuration below which is often named `Bad Gadget` [GW1999]_ .. figure:: /../book/network/svg/bad-gadget.png :align: center :scale: 70 - + The bad gadget internetwork @@ -403,7 +403,7 @@ In this internetwork, there are four ASes. `AS0` advertises one route towards on - `AS4` prefers the path `AS1:AS0` over all other paths `AS0` sends `U(p,AS0)` to `AS1`, `AS3` and `AS4`. As this is the only route known by `AS1`, `AS3` and `AS4` towards `p`, they all select the direct path. Let us now consider one possible exchange of BGP messages : - + #. `AS1` sends `U(p, AS1:AS0)` to `AS3` and `AS4`. `AS4` selects the path via `AS1` since this is its preferred path. `AS3` still uses the direct path. #. `AS4` advertises `U(p,AS4:AS1:AS0)` to `AS3`. #. `AS3` sends `U(p, AS3:AS0)` to `AS1` and `AS4`. `AS1` selects the path via `AS3` since this is its preferred path. `AS4` still uses the path via `AS1`. @@ -415,7 +415,7 @@ This example shows that the convergence of BGP is unfortunately not always guara Fortunately, there are some operational guidelines [GR2001]_ [GGR2001]_ that can guarantee BGP convergence in the global Internet. To ensure that BGP will converge, these guidelines consider that there are two types of peering relationships : `customer->provider` and `shared-cost`. In this case, BGP convergence is guaranteed provided that the following conditions are fulfilled : - #. The topology composed of all the directed `customer->provider` peering links is an acyclic graph + #. The topology composed of all the directed `customer->provider` peering links is an acyclic graph. #. An AS always prefers a route received from a `customer` over a route received from a `shared-cost` peer or a `provider`. @@ -423,7 +423,7 @@ The first guideline implies that the provider of the provider of `ASx` cannot be The second guideline also corresponds to economic preferences. Since a provider earns money when sending packets to one of its customers, it makes sense to prefer such customer learned routes over routes learned from providers. [GR2001]_ also shows that BGP convergence is guaranteed even if an AS associates the same preference to routes learned from a `shared-cost` peer and routes learned from a customer. -From a theoretical perspective, these guidelines should be verified automatically to ensure that BGP will always converge in the global Internet. However, such a verification cannot be performed in practice because this would force all domains to disclose their routing policies (and few are willing to do so) and furthermore the problem is known to be NP-hard [GW1999]. +From a theoretical perspective, these guidelines should be verified automatically to ensure that BGP will always converge in the global Internet. However, such a verification cannot be performed in practice because this would force all domains to disclose their routing policies (and few are willing to do so) and furthermore the problem is known to be NP-hard [GW1999]. In practice, researchers and operators expect that these guidelines are verified [#fgranularity]_ in most domains. Thanks to the large amount of BGP data that has been collected by operators and researchers [#fbgpdata]_, several studies have analysed the AS-level topology of the Internet. [SARK2002]_ is one of the first analysis. More recent studies include [COZ2008]_ and [DKF+2007]_ @@ -432,19 +432,19 @@ Based on these studies and [ATLAS2009]_, the AS-level Internet topology can be s .. figure:: /../book/network/svg/bgp-hierarchy.png :align: center :scale: 70 - + The layered structure of the global Internet .. index:: Tier-1 ISP -The domains on the Internet can be divided in about four categories according to their role and their position in the AS-level topology. +The domains on the Internet can be divided in about four categories according to their role and their position in the AS-level topology. - the core of the Internet is composed of a dozen-twenty `Tier-1` ISPs. A `Tier-1` is a domain that has no `provider`. Such an ISP has `shared-cost` peering relationships with all other `Tier-1` ISPs and `provider->customer` relationships with smaller ISPs. Examples of `Tier-1` ISPs include sprint_, level3_ or opentransit_ - the `Tier-2` ISPs are national or continental ISPs that are customers of `Tier-1` ISPs. These `Tier-2` ISPs have smaller customers and `shared-cost` peering relationships with other `Tier-2` ISPs. Example of `Tier-2` ISPs include France Telecom, Proximus, British Telecom, ... - - the `Tier-3` networks are either stub domains such as enterprise or campus networks networks and smaller ISPs. They are customers of Tier-1 and Tier-2 ISPs and have sometimes `shared-cost` peering relationships + - the `Tier-3` networks are either stub domains such as enterprise or campus networks and smaller ISPs. They are customers of Tier-1 and Tier-2 ISPs and have sometimes `shared-cost` peering relationships - the large content providers that are managing large datacenters. These content providers are producing a growing fraction of the packets exchanged on the global Internet [ATLAS2009]_. Some of these content providers are customers of Tier-1 or Tier-2 ISPs, but they often try to establish `shared-cost` peering relationships, e.g. at IXPs, with many Tier-1 and Tier-2 ISPs. -Due to this organisation of the Internet and due to the BGP decision process, most AS-level paths on the Internet have a length of 3-5 AS hops. +Due to this organisation of the Internet and due to the BGP decision process, most AS-level paths on the Internet have a length of 3-5 AS hops. .. no note:: BGP security @@ -462,9 +462,9 @@ Due to this organisation of the Internet and due to the BGP decision process, mo .. [#fripedb] See ftp://ftp.ripe.net/ripe/dbase for the RIPE database that contains the import and export policies of many European ISPs -.. [#fasdomain] In this text, we consider Autonomous System and domain as synonyms. In practice, a domain may be divided into several Autonomous Systems, but we ignore this detail. +.. [#fasdomain] In this text, we consider Autonomous System and domain as synonyms. In practice, a domain may be divided into several Autonomous Systems, but we ignore this detail. -.. [#flifetimebgp] The BGP sessions and the underlying TCP connection are typically established by the routers when they boot based on information found in their configuration. The BGP sessions are rarely released, except if the corresponding peering link fails or one of the endpoints crashes or needs to be rebooted. +.. [#flifetimebgp] The BGP sessions and the underlying TCP connection are typically established by the routers when they boot based on information found in their configuration. The BGP sessions are rarely released, except if the corresponding peering link fails or one of the endpoints crashes or needs to be rebooted. .. [#fdefaultkeepalive] 90 seconds is the default delay recommended by :rfc:`4271`. However, two BGP peers can negotiate a different timer during the establishment of their BGP session. Using a too small interval to detect BGP session failures is not recommended. BFD [KW2009]_ can be used to replace BGP's KEEPALIVE mechanism if fast detection of interdomain link failures is required. diff --git a/book-2nd/protocols/congestion.rst b/book-2nd/protocols/congestion.rst index 45c44ae..4841443 100644 --- a/book-2nd/protocols/congestion.rst +++ b/book-2nd/protocols/congestion.rst @@ -25,18 +25,18 @@ A key question that must be answered by any congestion control scheme is how con The figure below illustrates the evolution of the congestion window when there is severe congestion. At the beginning of the connection, the sender performs `slow-start` until the first segments are lost and the retransmission timer expires. At this time, the `ssthresh` is set to half of the current congestion window and the congestion window is reset at one segment. The lost segments are retransmitted as the sender again performs slow-start until the congestion window reaches the `sshtresh`. It then switches to congestion avoidance and the congestion window increases linearly until segments are lost and the retransmission timer expires ... -.. figure:: /../book/transport/png/transport-fig-088-c.png +.. figure:: /../book/transport/png/transport-fig-088-c.png :align: center - :scale: 70 + :scale: 70 Evaluation of the TCP congestion window with severe congestion The figure below illustrates the evolution of the congestion window when the network is lightly congested and all lost segments can be retransmitted using fast retransmit. The sender begins with a slow-start. A segment is lost but successfully retransmitted by a fast retransmit. The congestion window is divided by 2 and the sender immediately enters congestion avoidance as this was a mild congestion. -.. figure:: /../book/transport/png/transport-fig-094-c.png +.. figure:: /../book/transport/png/transport-fig-094-c.png :align: center - :scale: 70 + :scale: 70 Evaluation of the TCP congestion window when the network is lightly congested @@ -45,11 +45,11 @@ Most TCP implementations update the congestion window when they receive an ackno .. code-block:: python - # Initialization + # Initialization cwnd = MSS # congestion window in bytes ssthresh= swin # in bytes - - # Ack arrival + + # Ack arrival if tcp.ack > snd.una : # new ack, no congestion if cwnd < ssthresh : # slow-start : increase quickly cwnd @@ -58,23 +58,23 @@ Most TCP implementations update the congestion window when they receive an ackno else: # congestion avoidance : increase slowly cwnd # increase cwnd by one mss every rtt - cwnd = cwnd+ mss*(mss/cwnd) + cwnd = cwnd+ MSS*(MSS/cwnd) else: # duplicate or old ack if tcp.ack==snd.una: # duplicate acknowledgement dupacks++ if dupacks==3: - retransmitsegment(snd.una) + retransmit segment(snd.una) ssthresh=max(cwnd/2,2*MSS) - cwnd=ssthresh + cwnd=ssthresh else: # ack for old segment, ignored dupacks=0 - + Expiration of the retransmission timer: send(snd.una) # retransmit first lost segment sshtresh=max(cwnd/2,2*MSS) cwnd=MSS - - + + Furthermore when a TCP connection has been idle for more than its current retransmission timer, it should reset its congestion window to the congestion window size that it uses when the connection begins, as it no longer knows the current congestion state of the network. .. note:: Initial congestion window @@ -94,7 +94,7 @@ As explained earlier, Explicit Congestion Notification :rfc:`3168`, improves the The first difficulty in adding Explicit Congestion Notification (ECN) in TCP/IP network was to modify the format of the network packet and transport segment headers to carry the required information. In the network layer, one bit was required to allow the routers to mark the packets they forward during congestion periods. In the IP network layer, this bit is called the `Congestion Experienced` (`CE`) bit and is part of the packet header. However, using a single bit to mark packets is not sufficient. Consider a simple scenario with two sources, one congested router and one destination. Assume that the first sender and the destination support ECN, but not the second sender. If the router is congested it will mark packets from both senders. The first sender will react to the packet markings by reducing its transmission rate. However since the second sender does not support ECN, it will not react to the markings. Furthermore, this sender could continue to increase its transmission rate, which would lead to more packets being marked and the first source would decrease again its transmission rate, ... In the end, the sources that implement ECN are penalized compared to the sources that do not implement it. This unfairness issue is a major hurdle to widely deploy ECN on the public Internet [#fprivate]_. The solution proposed in :rfc:`3168` to deal with this problem is to use a second bit in the network packet header. This bit, called the `ECN-capable transport` (ECT) bit, indicates whether the packet contains a segment produced by a transport protocol that supports ECN or not. Transport protocols that support ECN set the ECT bit in all packets. When a router is congested, it first verifies whether the ECT bit is set. In this case, the CE bit of the packet is set to indicate congestion. Otherwise, the packet is discarded. This improves the deployability of ECN [#fecnnonce]_. -The second difficulty is how to allow the receiver to inform the sender of the reception of network packets marked with the `CE` bit. In reliable transport protocols like TCP and SCTP, the acknowledgements can be used to provide this feedback. For TCP, two options were possible : change some bits in the TCP segment header or define a new TCP option to carry this information. The designers of ECN opted for reusing spare bits in the TCP header. More precisely, two TCP flags have been added in the TCP header to support ECN. The `ECN-Echo` (ECE) is set in the acknowledgements when the `CE` was set in packets received on the forward path. +The second difficulty is how to allow the receiver to inform the sender of the reception of network packets marked with the `CE` bit. In reliable transport protocols like TCP and SCTP, the acknowledgements can be used to provide this feedback. For TCP, two options were possible : change some bits in the TCP segment header or define a new TCP option to carry this information. The designers of ECN opted for reusing spare bits in the TCP header. More precisely, two TCP flags have been added in the TCP header to support ECN. The `ECN-Echo` (ECE) is set in the acknowledgements when the `CE` was set in packets received on the forward path. .. figure:: /protocols/pkt/tcp-enc.png @@ -103,7 +103,7 @@ The second difficulty is how to allow the receiver to inform the sender of the r The third difficulty is to allow an ECN-capable sender to detect whether the remote host also supports ECN. This is a classical negotiation of extensions to a transport protocol. In TCP, this could have been solved by defining a new TCP option used during the three-way handshake. To avoid wasting space in the TCP options, the designers of ECN opted in :rfc:`3168` for using the `ECN-Echo` and `CWR` bits in the TCP header to perform this negotiation. In the end, the result is the same with fewer bits exchanged. SCTP defines in [STD2013]_ the `ECN Support parameter` which can be included in the ``INIT`` and ``INIT-ACK`` chunks to negotiate the utilization of ECN. The solution adopted for SCTP is cleaner than the solution adopted for TCP. -Thanks to the `ECT`, `CE` and `ECE`, routers can mark packets during congestion and receivers can return the congestion information back to the TCP senders. However, these three bits are not sufficient to allow a server to reliably send the `ECE` bit to a TCP sender. TCP acknowledgements are not sent reliably. A TCP acknowledgement always contains the next expected sequence number. Since TCP acknowledgements are cumulative, the loss of one acknowledgement is recovered by the correct reception of a subsequent acknowledgement. +Thanks to the `ECT`, `CE` and `ECE`, routers can mark packets during congestion and receivers can return the congestion information back to the TCP senders. However, these three bits are not sufficient to allow a server to reliably send the `ECE` bit to a TCP sender. TCP acknowledgements are not sent reliably. A TCP acknowledgement always contains the next expected sequence number. Since TCP acknowledgements are cumulative, the loss of one acknowledgement is recovered by the correct reception of a subsequent acknowledgement. If TCP acknowledgements are overloaded to carry the `ECE` bit, the situation is different. Consider the example shown in the figure below. A client sends packets to a server through a router. In the example below, the first packet is marked. The server returns an acknowledgement with the `ECE` bit set. Unfortunately, this acknowledgement is lost and never reaches the client. Shortly after, the server sends a data segment that also carries a cumulative acknowledgement. This acknowledgement confirms the reception of the data to the client, but it did not receive the congestion information through the `ECE` bit. @@ -116,18 +116,18 @@ If TCP acknowledgements are overloaded to carry the `ECE` bit, the situation is client=>router [ label = "data[seq=1,ECT=1,CE=0]", arcskip="1" ]; router=>server [ label = "data[seq=1,ECT=1,CE=1]", arcskip="1"]; - |||; + |||; server=>router [ label = "ack=2,ECE=1", arcskip="1" ]; router -x client [label="ack=2,ECE=1", arcskip="1" ]; |||; server=>router [ label = "data[seq=x,ack=2,ECE=0,ECT=1,CE=0]", arcskip="1" ]; router=>client [ label = "data[seq=x,ack=2,ECE=0,ECT=1,CE=0]", arcskip="1"]; - |||; + |||; client->server [linecolour=white]; -To solve this problem, :rfc:`3168` uses an additional bit in the TCP header : the `Congestion Window Reduced` (CWR) bit. +To solve this problem, :rfc:`3168` uses an additional bit in the TCP header : the `Congestion Window Reduced` (CWR) bit. .. msc:: @@ -136,7 +136,7 @@ To solve this problem, :rfc:`3168` uses an additional bit in the TCP header : th server [label="server", linecolour=black]; client=>router [ label = "data[seq=1,ECT=1,CE=0]", arcskip="1" ]; router=>server [ label = "data[seq=1,ECT=1,CE=1]", arcskip="1"]; - |||; + |||; server=>router [ label = "ack=2,ECE=1", arcskip="1" ]; router -x client [label="ack=2,ECE=1", arcskip="1" ]; |||; @@ -148,12 +148,12 @@ To solve this problem, :rfc:`3168` uses an additional bit in the TCP header : th |||; client->server [linecolour=white]; - + The `CWR` bit of the TCP header provides some form of acknowledgement for the `ECE` bit. When a TCP receiver detects a packet marked with the `CE` bit, it sets the `ECE` bit in all segments that it returns to the sender. Upon reception of an acknowledgement with the `ECE` bit set, the sender reduces its congestion window to reflect a mild congestion and sets the `CWR` bit. This bit remains set as long as the segments received contained the `ECE` bit set. A sender should only react once per round-trip-time to marked packets. .. index:: SCTP ECN Echo chunk, SCTP CWR chunk -SCTP uses a different approach to inform the sender once congestion has been detected. Instead of using one bit to carry the congestion notification from the receiver to the sender, SCTP defines an entire ``ECN Echo`` chunk for this. This chunk contains the lowest ``TSN`` that was received in a packet with the `CE` bit set and the number of marked packets received. The SCTP ``CWR`` chunk allows to acknowledge the reception of an ``ECN Echo`` chunk. It echoes the lowest ``TSN`` placed in the ``ECN Echo`` chunk. +SCTP uses a different approach to inform the sender once congestion has been detected. Instead of using one bit to carry the congestion notification from the receiver to the sender, SCTP defines an entire ``ECN Echo`` chunk for this. This chunk contains the lowest ``TSN`` that was received in a packet with the `CE` bit set and the number of marked packets received. The SCTP ``CWR`` chunk allows to acknowledge the reception of an ``ECN Echo`` chunk. It echoes the lowest ``TSN`` placed in the ``ECN Echo`` chunk. The last point that needs to be discussed about Explicit Congestion Notification is the algorithm that is used by routers to detect congestion. On a router, congestion manifests itself by the number of packets that are stored inside the router buffers. As explained earlier, we need to distinguish between two types of routers : @@ -163,7 +163,7 @@ The last point that needs to be discussed about Explicit Congestion Notification Routers that use a single queue measure their buffer occupancy as the number of bytes of packets stored in the queue [#fslot]_. A first method to detect congestion is to measure the instantaneous buffer occupancy and consider the router to be congested as soon as this occupancy is above a threshold. Typical values of the threshold could be 40% of the total buffer. Measuring the instantaneous buffer occupancy is simple since it only requires one counter. However, this value is fragile from a control viewpoint since it changes frequently. A better solution is to measure the *average* buffer occupancy and consider the router to be congested when this average occupancy is too high. Random Early Detection (RED) [FJ1993]_ is an algorithm that was designed to support Explicit Congestion Notification. In addition to measuring the average buffer occupancy, it also uses probabilistic marking. When the router is congested, the arriving packets are marked with a probability that increases with the average buffer occupancy. The main advantage of using probabilistic marking instead of marking all arriving packets is that flows will be marked in proportion of the number of packets that they transmit. If the router marks 10% of the arriving packets when congested, then a large flow that sends hundred packets per second will be marked 10 times while a flow that only sends one packet per second will not be marked. This probabilistic marking allows to mark packets in proportion of their usage of the network ressources. -If the router uses several queues served by a scheduler, the situation is different. If a large and a small flow are competing for bandwidth, the scheduler will already favor the small flow that is not using its fair share of the bandwidth. The queue for the small flow will be almost empty while the queue for the large flow will build up. On routers using such schedulers, a good way of marking the packets is to set a threshold on the occupancy of each queue and mark the packets that arrive in a particular queue as soon as its occupancy is above the configured threshold. +If the router uses several queues served by a scheduler, the situation is different. If a large and a small flow are competing for bandwidth, the scheduler will already favor the small flow that is not using its fair share of the bandwidth. The queue for the small flow will be almost empty while the queue for the large flow will build up. On routers using such schedulers, a good way of marking the packets is to set a threshold on the occupancy of each queue and mark the packets that arrive in a particular queue as soon as its occupancy is above the configured threshold. Modeling TCP congestion control @@ -174,9 +174,9 @@ Thanks to its congestion control scheme, TCP adapts its transmission rate to the This model considers a hypothetical TCP connection that suffers from equally spaced segment losses. If :math:`p` is the segment loss ratio, then the TCP connection successfully transfers :math:`\frac{1}{p}-1` segments and the next segment is lost. If we ignore the slow-start at the beginning of the connection, TCP in this environment is always in congestion avoidance as there are only isolated losses that can be recovered by using fast retransmit. The evolution of the congestion window is thus as shown in the figure below. Note the that `x-axis` of this figure represents time measured in units of one round-trip-time, which is supposed to be constant in the model, and the `y-axis` represents the size of the congestion window measured in MSS-sized segments. -.. figure:: /../book/transport/png/transport-fig-089-c.png +.. figure:: /../book/transport/png/transport-fig-089-c.png :align: center - :scale: 70 + :scale: 70 Evolution of the congestion window with regular losses @@ -185,7 +185,7 @@ As the losses are equally spaced, the congestion window always starts at some va :math:`area=(\frac{W}{2})^2 + \frac{1}{2} \times (\frac{W}{2})^2 = \frac{3 \times W^2}{8}` However, given the regular losses that we consider, the number of segments that are sent between two losses (i.e. during a cycle) is by definition equal to :math:`\frac{1}{p}`. Thus, :math:`W=\sqrt{\frac{8}{3 \times p}}=\frac{k}{\sqrt{p}}`. The throughput (in bytes per second) of the TCP connection is equal to the number of segments transmitted divided by the duration of the cycle : - + :math:`Throughput=\frac{area \times MSS}{time} = \frac{ \frac{3 \times W^2}{8}}{\frac{W}{2} \times rtt}` or, after having eliminated `W`, :math:`Throughput=\sqrt{\frac{3}{2}} \times \frac{MSS}{rtt \times \sqrt{p}}` @@ -193,7 +193,7 @@ However, given the regular losses that we consider, the number of segments that More detailed models and the analysis of simulations have shown that a first order model of the TCP throughput when losses occur was :math:`Throughput \approx \frac{k \times MSS}{rtt \times \sqrt{p}}`. This is an important result which shows that : - TCP connections with a small round-trip-time can achieve a higher throughput than TCP connections having a longer round-trip-time when losses occur. This implies that the TCP congestion control scheme is not completely fair since it favors the connections that have the shorter round-trip-time - - TCP connections that use a large MSS can achieve a higher throughput that the TCP connections that use a shorter MSS. This creates another source of unfairness between TCP connections. However, it should be noted that today most hosts are using almost the same MSS, roughly 1460 bytes. + - TCP connections that use a large MSS can achieve a higher throughput that the TCP connections that use a shorter MSS. This creates another source of unfairness between TCP connections. However, it should be noted that today most hosts are using almost the same MSS, roughly 1460 bytes. In general, the maximum throughput that can be achieved by a TCP connection depends on its maximum window size and the round-trip-time if there are no losses. If there are losses, it depends on the MSS, the round-trip-time and the loss ratio. @@ -204,12 +204,12 @@ In general, the maximum throughput that can be achieved by a TCP connection depe The first TCP congestion control scheme was proposed by `Van Jacobson`_ in [Jacobson1988]_. In addition to writing the scientific paper, `Van Jacobson`_ also implemented the slow-start and congestion avoidance schemes in release 4.3 `Tahoe` of the BSD Unix distributed by the University of Berkeley. Later, he improved the congestion control by adding the fast retransmit and the fast recovery mechanisms in the `Reno` release of 4.3 BSD Unix. Since then, many researchers have proposed, simulated and implemented modifications to the TCP congestion control scheme. Some of these modifications are still used today, e.g. : - - `NewReno` (:rfc:`3782`), which was proposed as an improvement of the fast recovery mechanism in the `Reno` implementation + - `NewReno` (:rfc:`3782`), which was proposed as an improvement of the fast recovery mechanism in the `Reno` implementation - `TCP Vegas`, which uses changes in the round-trip-time to estimate congestion in order to avoid it [BOP1994]_ - `CUBIC`, which was designed for high bandwidth links and is the default congestion control scheme in the Linux 2.6.19 kernel [HRX2008]_ - - `Compound TCP`, which was designed for high bandwidth links is the default congestion control scheme in several Microsoft operating systems [STBT2009]_ + - `Compound TCP`, which was designed for high bandwidth links, is the default congestion control scheme in several Microsoft operating systems [STBT2009]_ - A search of the scientific literature (:rfc:`6077`) will probably reveal more than 100 different variants of the TCP congestion control scheme. Most of them have only been evaluated by simulations. However, the TCP implementation in the recent Linux kernels supports several congestion control schemes and new ones can be easily added. We can expect that new TCP congestion control schemes will always continue to appear. + A search of the scientific literature (:rfc:`6077`) will probably reveal more than 100 different variants of the TCP congestion control scheme. Most of them have only been evaluated by simulations. However, the TCP implementation in the recent Linux kernels supports several congestion control schemes and new ones can be easily added. We can expect that new TCP congestion control schemes will always continue to appear. .. rubric:: Footnotes @@ -220,6 +220,6 @@ In general, the maximum throughput that can be achieved by a TCP connection depe .. [#fecnnonce] With the ECT bit, the deployment issue with ECN is solved provided that all sources cooperate. If some sources do not support ECN but still set the ECT bit in the packets that they sent, they will have an unfair advantage over the sources that correctly react to packet markings. Several solutions have been proposed to deal with this problem :rfc:`3540`, but they are outside the scope of this book. -.. [#fslot] The buffers of a router can be implemented as variable or fixed-length slots. If the router uses variable length slots to store the queued packets, then the occupancy is usually measured in bytes. Some routers have use fixed-length slots with each slot large enough to store a maximum-length packet. In this case, the buffer occupancy is measured in packets. +.. [#fslot] The buffers of a router can be implemented as variable or fixed-length slots. If the router uses variable length slots to store the queued packets, then the occupancy is usually measured in bytes. Some routers use fixed-length slots with each slot large enough to store a maximum-length packet. In this case, the buffer occupancy is measured in packets. .. include:: /links.rst diff --git a/book-2nd/protocols/dns.rst b/book-2nd/protocols/dns.rst index d3fe44d..ec2f953 100644 --- a/book-2nd/protocols/dns.rst +++ b/book-2nd/protocols/dns.rst @@ -21,7 +21,7 @@ The header of DNS messages is composed of 12 bytes and its structure is shown in DNS header -The `ID` (identifier) is a 16-bits random value chosen by the client. When a client sends a question to a DNS server, it remembers the question and its identifier. When a server returns an answer, it returns in the `ID` field the identifier chosen by the client. Thanks to this identifier, the client can match the received answer with the question that it sent. +The `ID` (identifier) is a 16-bits random value chosen by the client. When a client sends a question to a DNS server, it remembers the question and its identifier. When a server returns an answer, it returns in the `ID` field the identifier chosen by the client. Thanks to this identifier, the client can match the received answer with the question that it sent. .. dns attacks http://www.cs.columbia.edu/~smb/papers/dnshack.ps .. http://unixwiz.net/techtips/iguide-kaminsky-dns-vuln.html @@ -30,12 +30,12 @@ The `ID` (identifier) is a 16-bits random value chosen by the client. When a cli The `QR` flag is set to `0` in DNS queries and `1` in DNS answers. The `Opcode` is used to specify the type of query. For instance, a :term:`standard query` is when a client sends a `name` and the server returns the corresponding `data` and an update request is when the client sends a `name` and new `data` and the server then updates its database. -The `AA` bit is set when the server that sent the response has `authority` for the domain name found in the question section. In the original DNS deployments, two types of servers were considered : `authoritative` servers and `non-authoritative` servers. The `authoritative` servers are managed by the system administrators responsible for a given domain. They always store the most recent information about a domain. `Non-authoritative` servers are servers or resolvers that store DNS information about external domains without being managed by the owners of a domain. They may thus provide answers that are out of date. From a security point of view, the `authoritative` bit is not an absolute indication about the validity of an answer. Securing the Domain Name System is a complex problem that was only addressed satisfactorily recently by the utilisation of cryptographic signatures in the DNSSEC extensions to DNS described in :rfc:`4033`. However, these extensions are outside the scope of this chapter. +The `AA` bit is set when the server that sent the response has `authority` for the domain name found in the question section. In the original DNS deployments, two types of servers were considered : `authoritative` servers and `non-authoritative` servers. The `authoritative` servers are managed by the system administrators responsible for a given domain. They always store the most recent information about a domain. `Non-authoritative` servers are servers or resolvers that store DNS information about external domains without being managed by the owners of a domain. They may thus provide answers that are out of date. From a security point of view, the `authoritative` bit is not an absolute indication about the validity of an answer. Securing the Domain Name System is a complex problem that was only addressed satisfactorily recently by the utilisation of cryptographic signatures in the DNSSEC extensions to DNS described in :rfc:`4033`. However, these extensions are outside the scope of this chapter. The `RD` (recursion desired) bit is set by a client when it sends a query to a resolver. Such a query is said to be `recursive` because the resolver will recurse through the DNS hierarchy to retrieve the answer on behalf of the client. In the past, all resolvers were configured to perform recursive queries on behalf of any Internet host. However, this exposes the resolvers to several security risks. The simplest one is that the resolver could become overloaded by having too many recursive queries to process. As of this writing, most resolvers [#f8888]_ only allow recursive queries from clients belonging to their company or network and discard all other recursive queries. The `RA` bit indicates whether the server supports recursion. The `RCODE` is used to distinguish between different types of errors. See :rfc:`1035` for additional details. The last four fields indicate the size of the `Question`, `Answer`, `Authority` and `Additional` sections of the DNS message. -The last four sections of the DNS message contain `Resource Records` (RR). All RRs have the same top level format shown in the figure below. +The last four sections of the DNS message contain `Resource Records` (RR). All RRs have the same top level format shown in the figure below. .. figure:: /../book/application/pkt/dnsrr.png :align: center @@ -43,7 +43,7 @@ The last four sections of the DNS message contain `Resource Records` (RR). All DNS Resource Records -In a `Resource Record` (`RR`), the `Name` indicates the name of the node to which this resource record pertains. The two bytes `Type` field indicate the type of resource record. The `Class` field was used to support the utilisation of the DNS in other environments than the Internet. +In a `Resource Record` (`RR`), the `Name` indicates the name of the node to which this resource record pertains. The two bytes `Type` field indicate the type of resource record. The `Class` field was used to support the utilisation of the DNS in other environments than the Internet. The `TTL` field indicates the lifetime of the `Resource Record` in seconds. This field is set by the server that returns an answer and indicates for how long a client or a resolver can store the `Resource Record` inside its cache. A long `TTL` indicates a stable `RR`. Some companies use short `TTL` values for mobile hosts and also for popular servers. For example, a web hosting company that wants to spread the load over a pool of hundred servers can configure its nameservers to return different answers to different clients. If each answer has a small `TTL`, the clients will be forced to send DNS queries regularly. The nameserver will reply to these queries by supplying the address of the less loaded server. @@ -54,18 +54,18 @@ Several types of DNS RR are used in practice. The `A` type is used to encode the .. figure:: /protocols/pkt/dns6-www-ietf-org.png :align: center - Query for the `AAAA` record of `www.ietf.org` + Query for the `AAAA` record of `www.ietf.org` -This answer contains several pieces of information. First, the name `www.ietf.org` is associated to IP address `2001:1890:123a::1:1e`. Second, the `ietf.org` domain is managed by six different nameservers. Five of these nameservers are reachable via IPv4 and IPv6. +This answer contains several pieces of information. First, the name `www.ietf.org` is associated to IP address `2001:1890:123a::1:1e`. Second, the `ietf.org` domain is managed by six different nameservers. Five of these nameservers are reachable via IPv4 and IPv6. -`CNAME` (or canonical names) are used to define aliases. For example `www.example.com` could be a `CNAME` for `pc12.example.com` that is the actual name of the server on which the web server for `www.example.com` runs. +`CNAME` (or canonical names) are used to define aliases. For example `www.example.com` could be a `CNAME` for `pc12.example.com` that is the actual name of the server on which the web server for `www.example.com` runs. -.. note:: Reverse DNS +.. note:: Reverse DNS The DNS is mainly used to find the address that corresponds to a given name. However, it is sometimes useful to obtain the name that corresponds to an IP address. This done by using the `PTR` (`pointer`) `RR`. The `RData` part of a `PTR` `RR` contains the name while the `Name` part of the `RR` contains the IP address encoded in the `in-addr.arpa` domain. IPv4 addresses are encoded in the `in-addr.arpa` by reversing the four digits that compose the dotted decimal representation of the address. For example, consider IPv4 address `192.0.2.11`. The hostname associated to this address can be found by requesting the `PTR` `RR` that corresponds to `11.2.0.192.in-addr.arpa`. A similar solution is used to support IPv6 addresses :rfc:`3596`, but slightly more complex given the length of the IPv6 addresses. For example, consider IPv6 address `2001:1890:123a::1:1e`. To obtain the name that corresponds to this address, we need first to convert it in a reverse dotted decimal notation : `e.1.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.a.3.2.1.0.9.8.1.1.0.0.2`. In this notation, each character between dots corresponds to one nibble, i.e. four bits. The low-order byte (`e`) appears first and the high order (`2`) last. To obtain the name that corresponds to this address, one needs to append the `ip6.arpa` domain name and query for `e.1.0.0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.a.3.2.1.0.9.8.1.1.0.0.2.ip6.arpa`. In practice, tools and libraries do the conversion automatically and the user does not need to worry about it. -An important point to note regarding the Domain Name System is its extensibility. Thanks to the `Type` and `RDLength` fields, the format of the Resource Records can easily be extended. Furthermore, a DNS implementation that receives a new Resource Record that it does not understand can ignore the record while still being able to process the other parts of the message. This allows, for example, a DNS server that only supports IPv6 can +An important point to note regarding the Domain Name System is its extensibility. Thanks to the `Type` and `RDLength` fields, the format of the Resource Records can easily be extended. Furthermore, a DNS implementation that receives a new Resource Record that it does not understand can ignore the record while still being able to process the other parts of the message. This allows, for example, a DNS server that only supports IPv6 to safely ignore the IPv4 addresses listed in the DNS reply for `www.ietf.org` while still being able to correctly parse the Resource Records that it understands. This extensibility allowed the Domain Name System to evolve over the years while still preserving the backward compatibility with already deployed DNS implementations. @@ -74,7 +74,7 @@ safely ignore the IPv4 addresses listed in the DNS reply for `www.ietf.org` whil .. rubric:: Footnotes -.. [#f8888] Some DNS resolvers allow any host to send queries. Google operates a `public DNS resolver `_ at addresses `2001:4860:4860::8888` and `2001:4860:4860::8844` +.. [#f8888] Some DNS resolvers allow any host to send queries. Google operates a `public DNS resolver `_ at addresses `2001:4860:4860::8888` and `2001:4860:4860::8844` .. include:: /links.rst diff --git a/book-2nd/protocols/dnssec.rst b/book-2nd/protocols/dnssec.rst index efe66b4..8de18ca 100644 --- a/book-2nd/protocols/dnssec.rst +++ b/book-2nd/protocols/dnssec.rst @@ -6,8 +6,8 @@ Securing the Domain Name System =============================== -The Domain Name System provides a critical service in the Internet -infrastructure since it maps the domain names that are used by endusers +The Domain Name System provides a critical service in the Internet +infrastructure since it maps the domain names that are used by endusers onto IP addresses. Since endusers rely on names to identify the servers that they connect to, any incorrect information distributed by the DNS would direct endusers' connections to invalid destinations. Unfortunately, @@ -20,7 +20,7 @@ packets sent to a DNS resolver or a DNS server can gain valuable information about the DNS names that are used by a given enduser. If the attacker can capture all the packets sent to a DNS resolver, he/she can collect a lot of meta data about the domain names used by the enduser. Preventing this type -of attack has not been an objective of the initial design of the DNS. +of attack has not been an objective of the initial design of the DNS. There are currently discussions with the IETF to carry DNS messages over TLS sessions to protect against such attacks. However, these solutions are not yet widely deployed. @@ -35,13 +35,13 @@ to observe (and possibly modify) all the packets sent and received by Alice. In practice, executing this attack is not simple since DNS resolvers are usually installed in protected datacenters. However, if Mallory controls the WiFi access point that Alice uses to access the Internet, he could easily -modify the packets on this access point and some software packages -automate this type of attacks. +modify the packets on this access point and some software packages +automate this type of attacks. If Mallory cannot control a router on the path -between Alice and her resolver, she could still launch a different attack. +between Alice and her resolver, he could still launch a different attack. To understand this attack, it is important to correctly understand how -the DNS protocol operates and the roles of the different fields of +the DNS protocol operates and the roles of the different fields of the DNS header which is reproduced in the figure below. .. figure:: /../book/application/pkt/dnsheader.png @@ -64,14 +64,14 @@ can predict a popular domain for which Alice will regularly send DNS requests, then he can prepare a set of DNS responses that map the name requested by Alice to an IP address controlled by Mallory instead of the legitimate DNS response. Each DNS response has a different `Identification`. Since there -are only 65,536 values for the `Identification` field, it is possible +are only 65,536 values for the `Identification` field, it is possible for Mallory to send them to Alice hoping that one of them will be received while Alice -is waiting for a DNS response with the same identifier. In the past, +is waiting for a DNS response with the same identifier. In the past, it was difficult to send 65,536 DNS responses quickly enough. However, with the high speed links that are available today, this is not an issue anymore. -A second concern for Mallory is that he must be able to send -the DNS responses as if they were coming directly from the DNS resolver. +A second concern for Mallory is that he must be able to send +the DNS responses as if they were coming directly from the DNS resolver. This implies that Mallory must be able to send IP packets that appear to originate from a different address. Although networks should be configured to prevent this type of attack, this is not always the case and there @@ -85,16 +85,16 @@ resolver's cache for a long period of time. Fortunately, DNS implementors have found solutions to mitigate this type of attack. The easiest approach would have been to update the format of the -DNS requests and responses to include a larger `Identifier` field. +DNS requests and responses to include a larger `Identifier` field. Unfortunately, this elegant solution was not possible with the DNS because the DNS messages do not include any version number that would have enabled such a change. Since the DNS messages are exchanged inside UDP segments, -the DNS implementors found an alternate solution to counter this attack. +the DNS implementors found an alternate solution to counter this attack. There are two ways for the DNS library used by Alice to send her DNS requests. A first solution is to bind one UDP source port and always send the DNS requests from this source port (the destination port is always port ``53``). The advantage of this solution is that Alice's DNS library can -easily receive the DNS responses by listening to her chosen port. +easily receive the DNS responses by listening to her chosen port. Unfortunately, once the attacker has found the source port used by Alice, he only needs to send 65,536 DNS responses to inject an invalid response. Fortunately, Alice can send her DNS requests in a different way. Instead @@ -107,7 +107,7 @@ her implementation. From a security viewpoint there is a clear benefit since the attacker needs to guess both the 16 bits `Identifier` and the 16 bits `UDP source port` to inject a fake DNS response. To generate all possible DNS responses, the attacker would need to generate almost -:math:`2^32` different messages, which is excessive in today's networks. +:math:`2^{32}` different messages, which is excessive in today's networks. Most DNS implementations use this second approach to prevent these cache poisoning attacks. @@ -122,12 +122,12 @@ the DNS server that serves the queried domain, and the IP addresses of this server. Some DNS servers return several `NS` records and the associated IP addresses. The `cache poisoning` attack exploits this DNS optimisation. -Let us illustrate it on an example. -Assume that Alice frequently uses the `example.net` domain and in -particular the +Let us illustrate it on an example. +Assume that Alice frequently uses the `example.net` domain and in +particular the web server whose name is `www.example.net`. Mallory would like to redirect the TCP connections established by Alice towards `www.example.net` to one -IP address that he controls. Assume that Mallory controls the +IP address that he controls. Assume that Mallory controls the `mallory.net` domain. Mallory can tune the DNS server of his domain and add special DNS records to the responses that it sends. An attack could go roughly as follows. Mallory forces Alice to visit the `www.mallory.net` web @@ -136,62 +136,62 @@ advertisements on a web site visited by Alice and redirect one of these advertisements to `www.mallory.net`. When visiting the advertisement, Alice's DNS resolver will send a DNS request for `www.mallory.net`. Since Mallory control the DNS server, he can easily add in the response a `AAAA` -record that associates `www.example.net` to the IP address controlled by -Mallory. If Alice's DNS library does not check the returned response, +record that associates `www.example.net` to the IP address controlled by +Mallory. If Alice's DNS library does not check the returned response, the cache entry for `www.example.net` will be replaced by the `AAAA` record -sent by Mallory. +sent by Mallory. To cope with these security threats and improve the security of the DNS, the IETF has defined several extensions that are known as DNSSEC. DNSSEC exploits public-key cryptography to authenticate the content of the DNS records that are sent by DNS servers and resolvers. DNSEC is -defined in three main documents :rfc:`4033`, :rfc:`4034`, :rfc:`4035`. +defined in three main documents :rfc:`4033`, :rfc:`4034`, :rfc:`4035`. With DNSSEC, each DNS zone uses one public-private key pair. This key pair is only used to sign and authenticate DNS records. The DNS records are not encrypted and DNSSEC does not provide any confidentiality. Other DNS -extensions are being developed to ensure the confidentiality of the -information exchanged between a client and its resolvers :rfc:`7626`. +extensions are being developed to ensure the confidentiality of the +information exchanged between a client and its resolvers :rfc:`7626`. Some of these extensions exchange DNS records over a TLS session which -provides the required confidentiality, but they are not yet deployed -and outside the scope of this chapter. +provides the required confidentiality, but they are not yet deployed +and outside the scope of this chapter. DNSSEC defines four new types of DNS records that are used together to -authenticate the information distributed by the DNS. +authenticate the information distributed by the DNS. - - the `DNSKEY` record allows to store the public key associated with + - the `DNSKEY` record allows to store the public key associated with a zone. This record is encoded as a TLV and includes a `Base64` representation of the key and the identification of the public key - algorithm. This allows the `DNSKEY` record to support different public + algorithm. This allows the `DNSKEY` record to support different public key algorithms. - the `RRSIG` record is used to encode the signature of a DNS record. This record contains several subfields. The most important ones are the - algorithm used to generate the signature, the identifier of the public - key used to sign the record, the original TTL of the signed record and - the validity period for the signature. + algorithm used to generate the signature, the identifier of the public + key used to sign the record, the original TTL of the signed record and + the validity period for the signature. - the `DS` record contains a hash of a public key. It is used by a parent zone to certify the public key used by one of its child zones. - - the `NSEC` record is used when non-existent domain names are queried. + - the `NSEC` record is used when non-existent domain names are queried. Its usage will be explained later The simplest way to understand the operation of DNSSEC is to rely on a simple example. Let us consider the `example.org` domain and assume that Alice -wants to retrieve the `AAAA` record for `www.example.org` using DNSSEC. +wants to retrieve the `AAAA` record for `www.example.org` using DNSSEC. .. index:: anchored key The security of DNSSEC relies on `anchored keys`. An `anchored key` is a public key that is considered as trusted by a resolver. In our example, -we assume that Alice's resolver has obtained the public key of the servers +we assume that Alice's resolver has obtained the public key of the servers that manage the root zone in a secure way. This key has been distributed outside of the DNS, e.g. it has been published in a -newspaper or has been received in a sealed letter. +newspaper or has been received in a sealed letter. To obtain an authenticated record for `www.example.org`, Alice's resolver first needs to retrieve the `NS` which is responsible for the `.org` Top-Level Domain (TLD). This record is served by the DNS root server and Alice's resolver can retrieve the signature (`RRSIG` record) for this `NS` record. Since Alice knows the `DNSKEY` of the root, she can verify -the validity of this signature. +the validity of this signature. The next step is to contact `ns.org`, the `NS` responsible for the `.org` TLD to retrieve the `NS` record for the `example.org` domain. @@ -215,36 +215,36 @@ signature. Thanks to the `DS` record, a resolver can validate the public keys of client zones as long as their is a chain of `DS` -> `DNSKEY` records from an anchored key. If the resolver trusts the public key of the root zone, it -can validate all DNS replies for which this chain exists. +can validate all DNS replies for which this chain exists. There are several details of the operation of DNSSEC that are worth -being discussed. First, a server that supports DNSSEC must have a +being discussed. First, a server that supports DNSSEC must have a public-private key pair. The public key is distributed with the `DNSKEY` record. The private key is never distributed and it does not even need to be stored on the server that uses the public key. DNSSEC does not require the DNSSEC servers to perform any operation that requires -a private key in real time. All the `RRSIG` records can be computed +a private key in real time. All the `RRSIG` records can be computed offline, possibly on a different server than the server that returns the DNSSEC replies. The initial motivation for this design choice was the CPU complexity of computing the `RRSIG` signatures for zones that -contain millions of records. In the early days of DNSSEC, this was an +contain millions of records. In the early days of DNSSEC, this was an operational constraint. Today, this is less an issue, but avoiding -costly signature operations in real time has two important benefits. +costly signature operations in real time has two important benefits. First, this reduces the risk of denial of service attacks since an attacker cannot force a DNSSEC server to perform computationally intensive signing operations. Second, the private key can be stored offline, which means that even if an attacker gains access to the DNSSEC server, it cannot retrieve its private key. Using offline signatures for the `RRSIG` records has some practical implications that are reflected in the content of this -record. First, each `RRSIG` record contains the original TTL of the +record. First, each `RRSIG` record contains the original TTL of the signed record. When DNS resolvers cache records, they change the value of the TTL of -these cached records and then return the modified records to their clients. +these cached records and then return the modified records to their clients. When a resolver receives a signed DNS record, it must replace the received TTL of the record with the original TTL (and check that the received TTL is smaller than the original one) before checking the -signature. Second, the `RRSIG` records contain a validity period, i.e. -a starting time and an ending time for the validity of the signature. This +signature. Second, the `RRSIG` records contain a validity period, i.e. +a starting time and an ending time for the validity of the signature. This period is specified as two timestamps. This period is only the validity of the signature. It does not affect the TTL of the signed record and is independant from the TTL. In practice, the validity period is @@ -252,7 +252,7 @@ important to allow DNS server operators to update their public/private keys. When such a key is changed, e.g. because the private could have been compromised, there is some period of time during which records signed with the two keys coexist in the network. The validity period allows to -ensure that old signatures do not remain in DNS caches for ever. +ensure that old signatures do not remain in DNS caches for ever. .. index:: NSEC @@ -266,13 +266,13 @@ However, operational experience showed that queries for invalid domain names are more frequent than initially expected and a large fraction of the load on some servers is caused by repeated queries for invalid names. Typical examples include queries for invalid TLDs to the root -DNS servers or queries caused by configuration errors [WF2003]_. +DNS servers or queries caused by configuration errors [WF2003]_. Current DNS deployments allow resolvers to cache those negative answers -to reduce the load on the entire DNS :rfc:`2308`. +to reduce the load on the entire DNS :rfc:`2308`. The simplest way to allow a DNSSEC server to return signed negative responses would be for the serve to return a signed response that contains the -received query and some information indicating the error. +received query and some information indicating the error. The client could then easily check the validity of the negative response. Unfortunately, this would force the DNSSEC server to generate signatures in real time. This implies that the private key must be stored in the @@ -284,7 +284,7 @@ to a DNSSEC server. Given the above security risks, DNSSEC opted for a different approach that allows the negative replies to be authenticated by using offline signatures. The `NSEC` record exploits the lexicographical ordering of all the domain -names. To understand its usage, consider a simple domain that contains +names. To understand its usage, consider a simple domain that contains three names (the associated `AAAA` and other records that are not shown) : @@ -315,17 +315,17 @@ zone. If this name were present, it would have been placed, in lexicographical order, between the `beta.example.org` and the `gamma.example.org` names. To confirm that the `delta.example.org` name does not exist, the server returns the `NSEC` record for `beta.example.org` that indicates that the -next valid name after `beta.example.org` is `gamma.example.org`. If +next valid name after `beta.example.org` is `gamma.example.org`. If the server receives a query for `pi.example.org`, this is the `NSEC` record for `gamma.example.org` that will be returned. Since this record -contains a name that is before `pi.example.org` in lexicographical -order, this indicates that `pi.example.org` does not exist. +contains a name that is before `pi.example.org` in lexicographical +order, this indicates that `pi.example.org` does not exist. .. rubric:: Footnotes -.. [#fspoof] See http://spoofer.caida.org/summary.php for an ongoing +.. [#fspoof] See http://spoofer.caida.org/summary.php for an ongoing measurement study that analyses the networks where an attacker could send packets with any source IP address. diff --git a/book-2nd/protocols/ethernet.rst b/book-2nd/protocols/ethernet.rst index 8025b75..14cbd63 100644 --- a/book-2nd/protocols/ethernet.rst +++ b/book-2nd/protocols/ethernet.rst @@ -4,7 +4,7 @@ Ethernet ======== -Ethernet was designed in the 1970s at the Palo Alto Research Center [Metcalfe1976]_. The first prototype [#fethernethistory]_ used a coaxial cable as the shared medium and 3 Mbps of bandwidth. Ethernet was improved during the late 1970s and in the 1980s, Digital Equipment, Intel and Xerox published the first official Ethernet specification [DIX]_. This specification defines several important parameters for Ethernet networks. The first decision was to standardise the commercial Ethernet at 10 Mbps. The second decision was the duration of the `slot time`. In Ethernet, a long `slot time` enables networks to span a long distance but forces the host to use a larger minimum frame size. The compromise was a `slot time` of 51.2 microseconds, which corresponds to a minimum frame size of 64 bytes. +Ethernet was designed in the 1970s at the Palo Alto Research Center [Metcalfe1976]_. The first prototype [#fethernethistory]_ used a coaxial cable as the shared medium and 3 Mbps of bandwidth. Ethernet was improved during the late 1970s and in the 1980s, Digital Equipment, Intel and Xerox published the first official Ethernet specification [DIX]_. This specification defines several important parameters for Ethernet networks. The first decision was to standardise the commercial Ethernet at 10 Mbps. The second decision was the duration of the `slot time`. In Ethernet, a long `slot time` enables networks to span a long distance but forces the host to use a larger minimum frame size. The compromise was a `slot time` of 51.2 microseconds, which corresponds to a minimum frame size of 64 bytes. The third decision was the frame format. The experimental 3 Mbps Ethernet network built at Xerox used short frames containing 8 bit source and destination addresses fields, a 16 bit type indication, up to 554 bytes of payload and a 16 bit CRC. Using 8 bit addresses was suitable for an experimental network, but it was clearly too small for commercial deployments. Although the initial Ethernet specification [DIX]_ only allowed up to 1024 hosts on an Ethernet network, it also recommended three important changes compared to the networking technologies that were available at that time. The first change was to require each host attached to an Ethernet network to have a globally unique datalink layer address. Until then, datalink layer addresses were manually configured on each host. [DP1981]_ went against that state of the art and noted "`Suitable installation-specific administrative procedures are also needed for assigning numbers to hosts on a network. If a host is moved from one network to another it may be necessary to change its host number if its former number is in use on the new network. This is easier said than done, as each network must have an administrator who must record the continuously changing state of the system (often on a piece of paper tacked to the wall !). It is anticipated that in future office environments, hosts locations will change as often as telephones are changed in present-day offices.`" The second change introduced by Ethernet was to encode each address as a 48 bits field [DP1981]_. 48 bit addresses were huge compared to the networking technologies available in the 1980s, but the huge address space had several advantages [DP1981]_ including the ability to allocate large blocks of addresses to manufacturers. Eventually, other LAN technologies opted for 48 bits addresses as well [IEEE802]_ . The third change introduced by Ethernet was the definition of `broadcast` and `multicast` addresses. The need for `multicast` Ethernet was foreseen in [DP1981]_ and thanks to the size of the addressing space it was possible to reserve a large block of multicast addresses for each manufacturer. @@ -17,13 +17,13 @@ The datalink layer addresses used in Ethernet networks are often called MAC addr .. figure:: /../book/lan/png/lan-fig-039-c.png :align: center :scale: 70 - + 48 bits Ethernet address format .. index:: EtherType, Ethernet Type field -The original 10 Mbps Ethernet specification [DIX]_ defined a simple frame format where each frame is composed of five fields. The Ethernet frame starts with a preamble (not shown in the figure below) that is used by the physical layer of the receiver to synchronise its clock with the sender's clock. The first field of the frame is the destination address. As this address is placed at the beginning of the frame, an Ethernet interface can quickly verify whether it is the frame recipient and if not, cancel the processing of the arriving frame. The second field is the source address. While the destination address can be either a unicast or a multicast/broadcast address, the source address must always be a unicast address. The third field is a 16 bits integer that indicates which type of network layer packet is carried inside the frame. This field is often called the `EtherType`. Frequently used `EtherType` values [#fethertype]_ include `0x0800` for IPv4, `0x86DD` for IPv6 [#fipv6ether]_ and `0x806` for the Address Resolution Protocol (ARP). +The original 10 Mbps Ethernet specification [DIX]_ defined a simple frame format where each frame is composed of five fields. The Ethernet frame starts with a preamble (not shown in the figure below) that is used by the physical layer of the receiver to synchronise its clock with the sender's clock. The first field of the frame is the destination address. As this address is placed at the beginning of the frame, an Ethernet interface can quickly verify whether it is the frame recipient and if not, cancel the processing of the arriving frame. The second field is the source address. While the destination address can be either a unicast or a multicast/broadcast address, the source address must always be a unicast address. The third field is a 16 bits integer that indicates which type of network layer packet is carried inside the frame. This field is often called the `EtherType`. Frequently used `EtherType` values [#fethertype]_ include `0x0800` for IPv4, `0x86DD` for IPv6 [#fipv6ether]_ and `0x806` for the Address Resolution Protocol (ARP). The fourth part of the Ethernet frame is the payload. The minimum length of the payload is 46 bytes to ensure a minimum frame size, including the header of 512 bits. The Ethernet payload cannot be longer than 1500 bytes. This size was found reasonable when the first Ethernet specification was written. At that time, Xerox had been using its experimental 3 Mbps Ethernet that offered 554 bytes of payload and :rfc:`1122` required a minimum MTU of 572 bytes for IPv4. 1500 bytes was large enough to support these needs without forcing the network adapters to contain overly large memories. Furthermore, simulations and measurement studies performed in Ethernet networks revealed that CSMA/CD was able to achieve a very high utilization. This is illustrated in the figure below based on [SH1980]_, which shows the channel utilization achieved in Ethernet networks containing different numbers of hosts that are sending frames of different sizes. @@ -31,7 +31,7 @@ The fourth part of the Ethernet frame is the payload. The minimum length of the .. figure:: /../book/lan/png/lan-fig-102-c.png :align: center :scale: 70 - + Impact of the frame length on the maximum channel utilisation [SH1980]_ @@ -50,7 +50,7 @@ The last field of the Ethernet frame is a 32 bit Cyclical Redundancy Check (CRC) .. note:: Where should the CRC be located in a frame ? - The transport and datalink layers usually chose different strategies to place their CRCs or checksums. Transport layer protocols usually place their CRCs or checksums in the segment header. Datalink layer protocols sometimes place their CRC in the frame header, but often in a trailer at the end of the frame. This choice reflects implementation assumptions, but also influences performance :rfc:`893`. When the CRC is placed in the trailer, as in Ethernet, the datalink layer can compute it while transmitting the frame and insert it at the end of the transmission. All Ethernet interfaces use this optimisation today. When the checksum is placed in the header, as in a TCP segment, it is impossible for the network interface to compute it while transmitting the segment. Some network interfaces provide hardware assistance to compute the TCP checksum, but this is more complex than if the TCP checksum were placed in the trailer [#ftso]_. + The transport and datalink layers usually chose different strategies to place their CRCs or checksums. Transport layer protocols usually place their CRCs or checksums in the segment header. Datalink layer protocols sometimes place their CRC in the frame header, but often in a trailer at the end of the frame. This choice reflects implementation assumptions, but also influences performance :rfc:`893`. When the CRC is placed in the trailer, as in Ethernet, the datalink layer can compute it while transmitting the frame and insert it at the end of the transmission. All Ethernet interfaces use this optimisation today. When the checksum is placed in the header, as in a TCP segment, it is impossible for the network interface to compute it while transmitting the segment. Some network interfaces provide hardware assistance to compute the TCP checksum, but this is more complex than if the TCP checksum were placed in the trailer [#ftso]_. @@ -72,7 +72,8 @@ The Ethernet frame format shown above is specified in [DIX]_. This is the format .. note:: What is the Ethernet service ? - An Ethernet network provides an unreliable connectionless service. It supports three different transmission modes : `unicast`, `multicast` and `broadcast`. While the Ethernet service is unreliable in theory, a good Ethernet network should, in practice, provide a service that : + An Ethernet network provides an unreliable connectionless service. It supports three different transmission modes : `unicast`, `multicast` and `broadcast`. While the Ethernet service is unreliable in theory, a good Ethernet network should, in practice, provide a service that : + - delivers frames to their destination with a very high probability of successful delivery - does not reorder the transmitted frames @@ -80,11 +81,11 @@ The Ethernet frame format shown above is specified in [DIX]_. This is the format .. index:: 10Base5 -Several physical layers have been defined for Ethernet networks. The first physical layer, usually called 10Base5, provided 10 Mbps over a thick coaxial cable. The characteristics of the cable and the transceivers that were used then enabled the utilisation of 500 meter long segments. A 10Base5 network can also include repeaters between segments. +Several physical layers have been defined for Ethernet networks. The first physical layer, usually called 10Base5, provided 10 Mbps over a thick coaxial cable. The characteristics of the cable and the transceivers that were used then enabled the utilisation of 500 meter long segments. A 10Base5 network can also include repeaters between segments. .. index:: 10Base2 -The second physical layer was 10Base2. This physical layer used a thin coaxial cable that was easier to install than the 10Base5 cable, but could not be longer than 185 meters. A 10BaseF physical layer was also defined to transport Ethernet over point-to-point optical links. The major change to the physical layer was the support of twisted pairs in the 10BaseT specification. Twisted pair cables are traditionally used to support the telephone service in office buildings. Most office buildings today are equipped with structured cabling. Several twisted pair cables are installed between any room and a central telecom closet per building or per floor in large buildings. These telecom closets act as concentration points for the telephone service but also for LANs. +The second physical layer was 10Base2. This physical layer used a thin coaxial cable that was easier to install than the 10Base5 cable, but could not be longer than 185 meters. A 10BaseF physical layer was also defined to transport Ethernet over point-to-point optical links. The major change to the physical layer was the support of twisted pairs in the 10BaseT specification. Twisted pair cables are traditionally used to support the telephone service in office buildings. Most office buildings today are equipped with structured cabling. Several twisted pair cables are installed between any room and a central telecom closet per building or per floor in large buildings. These telecom closets act as concentration points for the telephone service but also for LANs. .. index:: Ethernet hub, 10BaseT @@ -95,7 +96,7 @@ The introduction of the twisted pairs led to two major changes to Ethernet. The .. figure:: /../book/lan/png/lan-fig-060-c.png :align: center :scale: 70 - + Ethernet hubs in the reference model @@ -107,7 +108,7 @@ Computers can directly be attached to Ethernet hubs. Ethernet hubs themselves ca .. figure:: /../book/lan/svg/datalink-fig-012-c.png :align: center :scale: 70 - + A hierarchical Ethernet network composed of hubs @@ -118,7 +119,7 @@ In the late 1980s, 10 Mbps became too slow for some applications and network man The evolution of Ethernet did not stop. In 1998, the IEEE published a first standard to provide Gigabit Ethernet over optical fibers. Several other types of physical layers were added afterwards. The `10 Gigabit Ethernet `_ standard appeared in 2002. Work is ongoing to develop `standards `_ for 40 Gigabit and 100 Gigabit Ethernet and some are thinking about `Terabit Ethernet `_. The table below lists the main Ethernet standards. A more detailed list may be found at http://en.wikipedia.org/wiki/Ethernet_physical_layer -.. In the late 1990s, the first Gigabit Ethernet interfaces had difficulties transmitting and receiving at 1000 Mbps given the performance limitations of the hosts on which they were running. One of the issues was the 1500 bytes maximum Ethernet frame size, as it forces hosts to send relatively small packets. This increases the number of interruptions that a host needs to process. To improve the usability of Gigabit Ethernet without requiring CPU and bus upgrades, several vendors proposed to increase the.... Experience with other networking technologies that support large frames showed limits performed with other networking technologies showed that a larger frame +.. In the late 1990s, the first Gigabit Ethernet interfaces had difficulties transmitting and receiving at 1000 Mbps given the performance limitations of the hosts on which they were running. One of the issues was the 1500 bytes maximum Ethernet frame size, as it forces hosts to send relatively small packets. This increases the number of interruptions that a host needs to process. To improve the usability of Gigabit Ethernet without requiring CPU and bus upgrades, several vendors proposed to increase the.... Experience with other networking technologies that support large frames showed limits performed with other networking technologies showed that a larger frame ============ ======================================================== Standard Comments @@ -127,7 +128,7 @@ Standard Comments 10Base2 Thin coaxial cable, 185m 10BaseT Two pairs of category 3+ UTP 10Base-F 10 Mb/s over optical fiber -100Base-Tx Category 5 UTP or STP, 100 m maximum +100Base-Tx Category 5 UTP or STP, 100 m maximum 100Base-FX Two multimode optical fiber, 2 km maximum 1000Base-CX Two pairs shielded twisted pair, 25m maximum 1000Base-SX Two multimode or single mode optical fibers with lasers @@ -160,8 +161,8 @@ Increasing the physical layer bandwidth as in `Fast Ethernet` was only one of th .. figure:: /../book/lan/png/lan-fig-063-c.png :align: center :scale: 70 - - Ethernet switches and the reference model + + Ethernet switches and the reference model @@ -173,8 +174,8 @@ An `Ethernet switch` understands the format of the Ethernet frames and can selec .. figure:: /../book/lan/svg/datalink-fig-013-c.png :align: center :scale: 70 - - Operation of Ethernet switches + + Operation of Ethernet switches @@ -187,13 +188,13 @@ The pseudo-code below details how an Ethernet switch forwards Ethernet frames. I .. code-block:: python # Arrival of frame F on port P - # Table : MAC address table dictionary : addr->port + # Table : MAC address table dictionary : addr->port # Ports : list of all ports on the switch src=F.SourceAddress dst=F.DestinationAddress Table[src]=P #src heard on port P if isUnicast(dst) : - if dst in Table: + if dst in Table: ForwardFrame(F,Table[dst]) else: for o in Ports : @@ -206,7 +207,7 @@ The pseudo-code below details how an Ethernet switch forwards Ethernet frames. I .. note:: Security issues with Ethernet hubs and switches - From a security perspective, Ethernet hubs have the same drawbacks as the older coaxial cable. A host attached to a hub will be able to capture all the frames exchanged between any pair of hosts attached to the same hub. + From a security perspective, Ethernet hubs have the same drawbacks as the older coaxial cable. A host attached to a hub will be able to capture all the frames exchanged between any pair of hosts attached to the same hub. Ethernet switches are much better from this perspective thanks to the selective forwarding, a host will usually only receive the frames destined to itself as well as the multicast, broadcast and unknown frames. However, this does not imply that switches are completely secure. There are, unfortunately, attacks against Ethernet switches. From a security perspective, the `MAC address table` is one of the fragile elements of an Ethernet switch. This table has a fixed size. Some low-end switches can store a few tens or a few hundreds of addresses while higher-end switches can store tens of thousands of addresses or more. From a security point of view, a limited resource can be the target of Denial of Service attacks. Unfortunately, such attacks are also possible on Ethernet switches. A malicious host could overflow the `MAC address table` of the switch by generating thousands of frames with random source addresses. Once the `MAC address table` is full, the switch needs to broadcast all the frames that it receives. At this point, an attacker will receive unicast frames that are not destined to its address. The ARP attack discussed in the previous chapter could also occur with Ethernet switches [Vyncke2007]_. Recent switches implement several types of defences against these attacks, but they need to be carefully configured by the network administrator. See [Vyncke2007]_ for a detailed discussion on security issues with Ethernet switches. @@ -216,13 +217,13 @@ The `MAC address learning` algorithm combined with the forwarding algorithm work .. figure:: /../book/lan/svg/datalink-fig-014-c.png :align: center :scale: 100 - + Ethernet switches in a loop -When all switches boot, their `MAC address table` is empty. Assume that host `A` sends a frame towards host `C`. Upon reception of this frame, switch1 updates its `MAC address table` to remember that address `A` is reachable via its West port. As there is no entry for address `C` in switch1's `MAC address table`, the frame is forwarded to both switch2 and switch3. When switch2 receives the frame, its updates its `MAC address table` for address `A` and forwards the frame to host `C` as well as to switch3. switch3 has thus received two copies of the same frame. As switch3 does not know how to reach the destination address, it forwards the frame received from switch1 to switch2 and the frame received from switch2 to switch1... The single frame sent by host `A` will be continuously duplicated by the switches until their `MAC address table` contains an entry for address `C`. Quickly, all the available link bandwidth will be used to forward all the copies of this frame. As Ethernet does not contain any `TTL` or `HopLimit`, this loop will never stop. +When all switches boot, their `MAC address table` is empty. Assume that host `A` sends a frame towards host `C`. Upon reception of this frame, switch1 updates its `MAC address table` to remember that address `A` is reachable via its West port. As there is no entry for address `C` in switch1's `MAC address table`, the frame is forwarded to both switch2 and switch3. When switch2 receives the frame, its updates its `MAC address table` for address `A` and forwards the frame to host `C` as well as to switch3. switch3 has thus received two copies of the same frame. As switch3 does not know how to reach the destination address, it forwards the frame received from switch1 to switch2 and the frame received from switch2 to switch1... The single frame sent by host `A` will be continuously duplicated by the switches until their `MAC address table` contains an entry for address `C`. Quickly, all the available link bandwidth will be used to forward all the copies of this frame. As Ethernet does not contain any `TTL` or `HopLimit`, this loop will never stop. -The `MAC address learning` algorithm allows switches to be plug-and-play. Unfortunately, the loops that arise when the network topology is not a tree are a severe problem. Forcing the switches to only be used in tree-shaped networks as hubs would be a severe limitation. To solve this problem, the inventors of Ethernet switches have developed the `Spanning Tree Protocol`. This protocol allows switches to automatically disable ports on Ethernet switches to ensure that the network does not contain any cycle that could cause frames to loop forever. +The `MAC address learning` algorithm allows switches to be plug-and-play. Unfortunately, the loops that arise when the network topology is not a tree are a severe problem. Forcing the switches to only be used in tree-shaped networks as hubs would be a severe limitation. To solve this problem, the inventors of Ethernet switches have developed the `Spanning Tree Protocol`. This protocol allows switches to automatically disable ports on Ethernet switches to ensure that the network does not contain any cycle that could cause frames to loop forever. .. rubric:: Footnotes @@ -230,25 +231,25 @@ The `MAC address learning` algorithm allows switches to be plug-and-play. Unfort -The Spanning Tree Protocol (802.1d) +The Spanning Tree Protocol (802.1d) ------------------------------------ -The `Spanning Tree Protocol` (STP), proposed in [Perlman1985]_, is a distributed protocol that is used by switches to reduce the network topology to a spanning tree, so that there are no cycles in the topology. For example, consider the network shown in the figure below. In this figure, each bold line corresponds to an Ethernet to which two Ethernet switches are attached. This network contains several cycles that must be broken to allow Ethernet switches that are using the MAC address learning algorithm to exchange frames. +The `Spanning Tree Protocol` (STP), proposed in [Perlman1985]_, is a distributed protocol that is used by switches to reduce the network topology to a spanning tree, so that there are no cycles in the topology. For example, consider the network shown in the figure below. In this figure, each bold line corresponds to an Ethernet to which two Ethernet switches are attached. This network contains several cycles that must be broken to allow Ethernet switches that are using the MAC address learning algorithm to exchange frames. .. figure:: /../book/lan/svg/datalink-fig-015-c.png :align: center :scale: 70 - + Spanning tree computed in a switched Ethernet network -In this network, the STP will compute the following spanning tree. `Switch1` will be the root of the tree. All the interfaces of `Switch1`, `Switch2` and `Switch7` are part of the spanning tree. Only the interface connected to `LANB` will be active on `Switch9`. `LANH` will only be served by `Switch7` and the port of `Switch44` on `LANG` will be disabled. A frame originating on `LANB` and destined for `LANA` will be forwarded by `Switch7` on `LANC`, then by `Switch1` on `LANE`, then by `Switch44` on `LANF` and eventually by `Switch2` on `LANA`. +In this network, the STP will compute the following spanning tree. `Switch1` will be the root of the tree. All the interfaces of `Switch1`, `Switch2` and `Switch7` are part of the spanning tree. Only the interface connected to `LANB` will be active on `Switch9`. `LANH` will only be served by `Switch7` and the port of `Switch44` on `LANG` will be disabled. A frame originating on `LANB` and destined for `LANA` will be forwarded by `Switch7` on `LANC`, then by `Switch1` on `LANE`, then by `Switch44` on `LANF` and eventually by `Switch2` on `LANA`. Switches running the `Spanning Tree Protocol` exchange `BPDUs`. These `BPDUs` are always sent as frames with destination MAC address as the `ALL_BRIDGES` reserved multicast MAC address. Each switch has a unique 64 bit `identifier`. To ensure uniqueness, the lower 48 bits of the identifier are set to the unique MAC address allocated to the switch by its manufacturer. The high order 16 bits of the switch identifier can be configured by the network administrator to influence the topology of the spanning tree. The default value for these high order bits is 32768. -The switches exchange `BPDUs` to build the spanning tree. Intuitively, the spanning tree is built by first selecting the switch with the smallest `identifier` as the root of the tree. The branches of the spanning tree are then composed of the shortest paths that allow all of the switches that compose the network to be reached. +The switches exchange `BPDUs` to build the spanning tree. Intuitively, the spanning tree is built by first selecting the switch with the smallest `identifier` as the root of the tree. The branches of the spanning tree are then composed of the shortest paths that allow all of the switches that compose the network to be reached. The `BPDUs` exchanged by the switches contain the following information : - the `identifier` of the root switch (`R`) @@ -256,10 +257,10 @@ The `BPDUs` exchanged by the switches contain the following information : - the `identifier` of the switch that sent the `BPDU` (`T`) - the number of the switch port over which the `BPDU` was sent (`p`) -We will use the notation `` to represent a `BPDU` whose `root identifier` is `R`, `cost` is `c` and that was sent on the port `p` of switch `T`. The construction of the spanning tree depends on an ordering relationship among the `BPDUs`. This ordering relationship could be implemented by the python function below. +We will use the notation `` to represent a `BPDU` whose `root identifier` is `R`, `cost` is `c` and that was sent on the port `p` of switch `T`. The construction of the spanning tree depends on an ordering relationship among the `BPDUs`. This ordering relationship could be implemented by the python function below. .. code-block:: python - + # returns True if bpdu b1 is better than bpdu b2 def better( b1, b2) : return ( (b1.R < b2.R) or @@ -281,14 +282,14 @@ Bandwidth Cost The `Spanning Tree Protocol` uses its own terminology that we illustrate in the figure above. A switch port can be in three different states : `Root`, `Designated` and `Blocked`. All the ports of the `root` switch are in the `Designated` state. The state of the ports on the other switches is determined based on the `BPDU` received on each port. -The `Spanning Tree Protocol` uses the ordering relationship to build the spanning tree. Each switch listens to `BPDUs` on its ports. When `BPDU=` is received on port `q`, the switch computes the port's `root priority vector`: `V[q]=` , where `cost[q]` is the cost associated to the port over which the `BPDU` was received. The switch stores in a table the last `root priority vector` received on each port. The switch then compares its own `identifier` with the smallest `root identifier` stored in this table. If its own `identifier` is smaller, then the switch is the root of the spanning tree and is, by definition, at a distance `0` of the root. The `BPDU` of the switch is then ``, where `R` is the switch `identifier` and `p` will be set to the port number over which the `BPDU` is sent. +The `Spanning Tree Protocol` uses the ordering relationship to build the spanning tree. Each switch listens to `BPDUs` on its ports. When `BPDU=` is received on port `q`, the switch computes the port's `root priority vector`: `V[q]=` , where `cost[q]` is the cost associated to the port over which the `BPDU` was received. The switch stores in a table the last `root priority vector` received on each port. The switch then compares its own `identifier` with the smallest `root identifier` stored in this table. If its own `identifier` is smaller, then the switch is the root of the spanning tree and is, by definition, at a distance `0` of the root. The `BPDU` of the switch is then ``, where `R` is the switch `identifier` and `p` will be set to the port number over which the `BPDU` is sent. + +Otherwise, the switch chooses the best priority vector from its table, `bv=`. The port `q'`, over which this best root priority vector was learned, is the switch port that is closest to the `root` switch. This port becomes the `Root` port of the switch. There is only one `Root` port per switch (except for the `Root` switches whose ports are all `Designated`). The switch can then compute its own `BPDU` as `BPDU=` , where `R` is the `root identifier`, `c'` the cost of the best root priority vector, `S` the identifier of the switch and `p` will be replaced by the number of the port over which the `BPDU` will be sent. -Otherwise, the switch chooses the best priority vector from its table, `bv=`. The port `q'`, over which this best root priority vector was learned, is the switch port that is closest to the `root` switch. This port becomes the `Root` port of the switch. There is only one `Root` port per switch (except for the `Root` switches whose ports are all `Designated`). The switch can then compute its own `BPDU` as `BPDU=` , where `R` is the `root identifier`, `c'` the cost of the best root priority vector, `S` the identifier of the switch and `p` will be replaced by the number of the port over which the `BPDU` will be sent. +To determine the state of its other ports, the switch compares its own `BPDU` with the last `BPDU` received on each port. Note that the comparison is done by using the `BPDUs` and not the `root priority vectors`. If the switch's `BPDU` is better than the last `BPDU` of this port, the port becomes a `Designated` port. Otherwise, the port becomes a `Blocked` port. -To determine the state of its other ports, the switch compares its own `BPDU` with the last `BPDU` received on each port. Note that the comparison is done by using the `BPDUs` and not the `root priority vectors`. If the switch's `BPDU` is better than the last `BPDU` of this port, the port becomes a `Designated` port. Otherwise, the port becomes a `Blocked` port. +The state of each port is important when considering the transmission of `BPDUs`. The root switch regularly sends its own `BPDU` over all of its (`Designated`) ports. This `BPDU` is received on the `Root` port of all the switches that are directly connected to the `root switch`. Each of these switches computes its own `BPDU` and sends this `BPDU` over all its `Designated` ports. These `BPDUs` are then received on the `Root` port of downstream switches, which then compute their own `BPDU`, etc. When the network topology is stable, switches send their own `BPDU` on all their `Designated` ports, once they receive a `BPDU` on their `Root` port. No `BPDU` is sent on a `Blocked` port. Switches listen for `BPDUs` on their `Blocked` and `Designated` ports, but no `BPDU` should be received over these ports when the topology is stable. The utilisation of the ports for both `BPDUs` and data frames is summarised in the table below. -The state of each port is important when considering the transmission of `BPDUs`. The root switch regularly sends its own `BPDU` over all of its (`Designated`) ports. This `BPDU` is received on the `Root` port of all the switches that are directly connected to the `root switch`. Each of these switches computes its own `BPDU` and sends this `BPDU` over all its `Designated` ports. These `BPDUs` are then received on the `Root` port of downstream switches, which then compute their own `BPDU`, etc. When the network topology is stable, switches send their own `BPDU` on all their `Designated` ports, once they receive a `BPDU` on their `Root` port. No `BPDU` is sent on a `Blocked` port. Switches listen for `BPDUs` on their `Blocked` and `Designated` ports, but no `BPDU` should be received over these ports when the topology is stable. The utilisation of the ports for both `BPDUs` and data frames is summarised in the table below. - ========== ============== ========== =================== Port state Receives BPDUs Sends BPDU Handles data frames ========== ============== ========== =================== @@ -299,16 +300,16 @@ Designated yes yes yes .. No `BPDU` should be received on a `Designated` or `Blocked` port when the topology is stable. The reception of a `BPDU` on such a port usually indicates a change in the topology. -To illustrate the operation of the `Spanning Tree Protocol`, let us consider the simple network topology in the figure below. +To illustrate the operation of the `Spanning Tree Protocol`, let us consider the simple network topology in the figure below. .. figure:: /../book/lan/svg/datalink-fig-016-c.png :Align: center :scale: 60 - + A simple Spanning tree computed in a switched Ethernet network -Assume that `Switch4` is the first to boot. It sends its own `BPDU=<4,0,4,?>` on its two ports. When `Switch1` boots, it sends `BPDU=<1,0,1,1>`. This `BPDU` is received by `Switch4`, which updates its `BPDU` and root priority vector tables and computes a new `BPDU=<1,3,4,?>`. Indeed, there is only one root priority vector and hence, it is the best one. Port 1 of `Switch4` becomes the `Root` port while its second port is still in the `Designated` state. +Assume that `Switch4` is the first to boot. It sends its own `BPDU=<4,0,4,?>` on its two ports. When `Switch1` boots, it sends `BPDU=<1,0,1,1>`. This `BPDU` is received by `Switch4`, which updates its `BPDU` and root priority vector tables and computes a new `BPDU=<1,3,4,?>`. Indeed, there is only one root priority vector and hence, it is the best one. Port 1 of `Switch4` becomes the `Root` port while its second port is still in the `Designated` state. Assume now that `Switch9` boots and immediately receives `Switch1` 's `BPDU` on port 1. `Switch9` computes its own `BPDU=<1,1,9,?>` and port 1 becomes the `Root` port of this switch. This `BPDU` is sent on port 2 of `Switch9` and reaches `Switch4`. `Switch4` compares the priority vectors. It notices that the last computed vector (i.e., `V[2]=<1,2,9,2,2>`) is better than `V[1]=<1,3,1,1,1>`. Thus, `Switch4`'s `BPDU` is recomputed and port 2 becomes the `Root` port of `Switch4`. `Switch4` compares its new `BPDU=<1,2,4,?>` with the last `BPDU` received on each port (except for the `Root` port). Port 1 becomes a `Blocked` port on `Switch4` because the `BPDU=<1,0,1,1>` received on this port is better. @@ -323,24 +324,24 @@ Virtual LANs Another important advantage of Ethernet switches is the ability to create Virtual Local Area Networks (VLANs). A virtual LAN can be defined as a `set of ports attached to one or more Ethernet switches`. A switch can support several VLANs and it runs one MAC learning algorithm for each Virtual LAN. When a switch receives a frame with an unknown or a multicast destination, it forwards it over all the ports that belong to the same Virtual LAN but not over the ports that belong to other Virtual LANs. Similarly, when a switch learns a source address on a port, it associates it to the Virtual LAN of this port and uses this information only when forwarding frames on this Virtual LAN. -The figure below illustrates a switched Ethernet network with three Virtual LANs. `VLAN2` and `VLAN3` only require a local configuration of switch `S1`. Host `C` can exchange frames with host `D`, but not with hosts that are outside of its VLAN. `VLAN1` is more complex as there are ports of this VLAN on several switches. To support such VLANs, local configuration is not sufficient anymore. When a switch receives a frame from another switch, it must be able to determine the VLAN in which the frame originated to use the correct MAC table to forward the frame. This is done by assigning an identifier to each Virtual LAN and placing this identifier inside the headers of the frames that are exchanged between switches. +The figure below illustrates a switched Ethernet network with three Virtual LANs. `VLAN2` and `VLAN3` only require a local configuration of switch `S1`. Host `C` can exchange frames with host `D`, but not with hosts that are outside of its VLAN. `VLAN1` is more complex as there are ports of this VLAN on several switches. To support such VLANs, local configuration is not sufficient anymore. When a switch receives a frame from another switch, it must be able to determine the VLAN in which the frame originated to use the correct MAC table to forward the frame. This is done by assigning an identifier to each Virtual LAN and placing this identifier inside the headers of the frames that are exchanged between switches. .. figure:: /../book/lan/svg/datalink-fig-017-c.png :align: center :scale: 70 - - Virtual Local Area Networks in a switched Ethernet network + + Virtual Local Area Networks in a switched Ethernet network IEEE defined in the [IEEE802.1q]_ standard a special header to encode the VLAN identifiers. This 32 bit header includes a 20 bit VLAN field that contains the VLAN identifier of each frame. The format of the [IEEE802.1q]_ header is described below. .. figure:: /../book/lan/pkt/8021q.png :align: center :scale: 100 - + Format of the 802.1q header -The [IEEE802.1q]_ header is inserted immediately after the source MAC address in the Ethernet frame (i.e. before the EtherType field). The maximum frame size is increased by 4 bytes. It is encoded in 32 bits and contains four fields. The Tag Protocol Identifier is set to `0x8100` to allow the receiver to detect the presence of this additional header. The `Priority Code Point` (PCP) is a three bit field that is used to support different transmission priorities for the frame. Value `0` is the lowest priority and value `7` the highest. Frames with a higher priority can expect to be forwarded earlier than frames having a lower priority. The `C` bit is used for compatibility between Ethernet and Token Ring networks. The last 12 bits of the 802.1q header contain the VLAN identifier. Value `0` indicates that the frame does not belong to any VLAN while value `0xFFF` is reserved. This implies that 4094 different VLAN identifiers can be used in an Ethernet network. +The [IEEE802.1q]_ header is inserted immediately after the source MAC address in the Ethernet frame (i.e. before the EtherType field). The maximum frame size is increased by 4 bytes. It is encoded in 32 bits and contains four fields. The Tag Protocol Identifier is set to `0x8100` to allow the receiver to detect the presence of this additional header. The `Priority Code Point` (PCP) is a three bit field that is used to support different transmission priorities for the frame. Value `0` is the lowest priority and value `7` the highest. Frames with a higher priority can expect to be forwarded earlier than frames having a lower priority. The `C` bit is used for compatibility between Ethernet and Token Ring networks. The last 12 bits of the 802.1q header contain the VLAN identifier. Value `0` indicates that the frame does not belong to any VLAN while value `0xFFF` is reserved. This implies that 4094 different VLAN identifiers can be used in an Ethernet network. .. include:: ../links.rst diff --git a/book-2nd/protocols/http.rst b/book-2nd/protocols/http.rst index 2e0e462..35a7c9c 100644 --- a/book-2nd/protocols/http.rst +++ b/book-2nd/protocols/http.rst @@ -11,30 +11,30 @@ In the early days of the Internet was mainly used for remote terminal access wit Many `FTP` clients offer a user interface similar to a Unix shell and allow the client to browse the file system on the server and to send and retrieve files. `FTP` servers can be configured in two modes : - authenticated : in this mode, the ftp server only accepts users with a valid user name and password. Once authenticated, they can access the files and directories according to their permissions - - anonymous : in this mode, clients supply the `anonymous` userid and their email address as password. These clients are granted access to a special zone of the file system that only contains public files. + - anonymous : in this mode, clients supply the `anonymous` userid and their email address as password. These clients are granted access to a special zone of the file system that only contains public files. -ftp was very popular in the 1990s and early 2000s, but today it has mostly been superseded by more recent protocols. Authenticated access to files is mainly done by using the Secure Shell (ssh_) protocol defined in :rfc:`4251` and supported by clients such as scp_ or sftp_. Nowadays, anonymous access is mainly provided by web protocols. +`FTP` was very popular in the 1990s and early 2000s, but today it has mostly been superseded by more recent protocols. Authenticated access to files is mainly done by using the Secure Shell (ssh_) protocol defined in :rfc:`4251` and supported by clients such as scp_ or sftp_. Nowadays, anonymous access is mainly provided by web protocols. In the late 1980s, high energy physicists working at CERN_ had to efficiently exchange documents about their ongoing and planned experiments. `Tim Berners-Lee`_ evaluated several of the documents sharing techniques that were available at that time [B1989]_. As none of the existing solutions met CERN's requirements, they chose to develop a completely new document sharing system. This system was initially called the `mesh`, but was quickly renamed the `world wide web`. The starting point for the `world wide web` are hypertext documents. An hypertext document is a document that contains references (hyperlinks) to other documents that the reader can immediately access. Hypertext was not invented for the world wide web. The idea of hypertext documents was proposed in 1945 [Bush1945]_ and the first experiments were done during the 1960s [Nelson1965]_ [Myers1998]_ . Compared to the hypertext documents that were used in the late 1980s, the main innovation introduced by the `world wide web` was to allow hyperlinks to reference documents stored on remote machines. .. figure:: ../../book/application/svg/www-basics.png :align: center - :scale: 60 + :scale: 60 - World-wide web clients and servers + World-wide web clients and servers A document sharing system such as the `world wide web` is composed of three important parts. - 1. A standardised addressing scheme that allows unambiguous identification of documents + 1. A standardised addressing scheme that allows unambiguous identification of documents 2. A standard document format : the `HyperText Markup Language `_ 3. A standardised protocol that facilitates efficient retrieval of documents stored on a server .. note:: Open standards and open implementations - Open standards have, and are still playing a key role in the success of the `world wide web` as we know it today. Without open standards, the world wide web would never have reached its current size. In addition to open standards, another important factor for the success of the web was the availability of open and efficient implementations of these standards. When CERN started to work on the `web`, their objective was to build a running system that could be used by physicists. They developed open-source implementations of the `first web servers `_ and `web clients `_. These open-source implementations were powerful and could be used as is, by institutions willing to share information on the web. They were also extended by other developers who contributed to new features. For example, NCSA_ added support for images in their `Mosaic browser `_ that was eventually used to create `Netscape Communications `_. + Open standards have, and are still playing a key role in the success of the `world wide web` as we know it today. Without open standards, the world wide web would never have reached its current size. In addition to open standards, another important factor for the success of the web was the availability of open and efficient implementations of these standards. When CERN started to work on the `web`, their objective was to build a running system that could be used by physicists. They developed open-source implementations of the `first web servers `_ and `web clients `_. These open-source implementations were powerful and could be used as is, by institutions willing to share information on the web. They were also extended by other developers who contributed to new features. For example, NCSA_ added support for images in their `Mosaic browser `_ that was eventually used to create `Netscape Communications `_. The first components of the `world wide web` are the Uniform Resource Identifiers (URI), defined in :rfc:`3986`. A URI is a character string that unambiguously identifies a resource on the world wide web. Here is a subset of the BNF for URIs :: @@ -63,14 +63,14 @@ The third part of the URI is the path to the document. This path is structured a .. code-block:: text http://tools.ietf.org/html/rfc3986.html - mailto:infobot@example.com?subject=current-issue + mailto:infobot@example.com?subject=current-issue http://docs.python.org/library/basehttpserver.html?highlight=http#BaseHTTPServer.BaseHTTPRequestHandler telnet://[2001:db8:3080:3::2]:80/ ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm -.. The first URI corresponds to a document named `rfc3986.html` that is stored on the server named `tools.ietf.org` and can be accessed by using the `http` protocol on its default port. The second URI corresponds to an email message, with subject `current-issue`, that will be sent to user `infobot` in domain `example.com`. The `mailto:` URI scheme is defined in :rfc:`2368`. The third URI references the portion `BaseHTTPServer.BaseHTTPRequestHandler` of the document `basehttpserver.html` that is stored in the `library` directory on server `docs.python.org`. This document can be retrieved by using the `http` protocol. The query `highlight=http` is associated to this URI. The fourth example is a server that operates the telnet_ protocol, uses IPv6 address `2001:db8:3080:3::2` and is reachable on port 80. The last URI is somewhat special. Most users will assume that it corresponds to a document stored on the `cnn.example.com` server. However, to parse this URI, it is important to remember that the `@` character is used to separate the user name from the host name in the authorisation part of a URI. This implies that the URI points to a document named `top_story.htm` on host having IPv4 address `10.0.0.1`. The document will be retrieved by using the `ftp` protocol with the user name set to `cnn.example.com&story=breaking_news`. +.. The first URI corresponds to a document named `rfc3986.html` that is stored on the server named `tools.ietf.org` and can be accessed by using the `http` protocol on its default port. The second URI corresponds to an email message, with subject `current-issue`, that will be sent to user `infobot` in domain `example.com`. The `mailto:` URI scheme is defined in :rfc:`2368`. The third URI references the portion `BaseHTTPServer.BaseHTTPRequestHandler` of the document `basehttpserver.html` that is stored in the `library` directory on server `docs.python.org`. This document can be retrieved by using the `http` protocol. The query `highlight=http` is associated to this URI. The fourth example is a server that operates the telnet_ protocol, uses IPv6 address `2001:db8:3080:3::2` and is reachable on port 80. The last URI is somewhat special. Most users will assume that it corresponds to a document stored on the `cnn.example.com` server. However, to parse this URI, it is important to remember that the `@` character is used to separate the user name from the host name in the authorisation part of a URI. This implies that the URI points to a document named `top_story.htm` on host having IPv4 address `10.0.0.1`. The document will be retrieved by using the `ftp` protocol with the user name set to `cnn.example.com&story=breaking_news`. -The first URI corresponds to a document named `rfc3986.html` that is stored on the server named `tools.ietf.org` and can be accessed by using the `http` protocol on its default port. The second URI corresponds to an email message, with subject `current-issue`, that will be sent to user `infobot` in domain `example.com`. The `mailto:` URI scheme is defined in :rfc:`6068`. The third URI references the portion `BaseHTTPServer.BaseHTTPRequestHandler` of the document `basehttpserver.html` that is stored in the `library` directory on server `docs.python.org`. This document can be retrieved by using the `http` protocol. The query `highlight=http` is associated to this URI. The fourth example is a server that operates the telnet_ protocol, uses IPv6 address `2001:db8:3080:3::2` and is reachable on port 80. The last URI is somewhat special. Most users will assume that it corresponds to a document stored on the `cnn.example.com` server. However, to parse this URI, it is important to remember that the `@` character is used to separate the user name from the host name in the authorisation part of a URI. This implies that the URI points to a document named `top_story.htm` on host having IPv4 address `10.0.0.1`. The document will be retrieved by using the `ftp` protocol with the user name set to `cnn.example.com&story=breaking_news`. +The first URI corresponds to a document named `rfc3986.html` that is stored on the server named `tools.ietf.org` and can be accessed by using the `http` protocol on its default port. The second URI corresponds to an email message, with subject `current-issue`, that will be sent to user `infobot` in domain `example.com`. The `mailto:` URI scheme is defined in :rfc:`6068`. The third URI references the portion `BaseHTTPServer.BaseHTTPRequestHandler` of the document `basehttpserver.html` that is stored in the `library` directory on server `docs.python.org`. This document can be retrieved by using the `http` protocol. The query `highlight=http` is associated to this URI. The fourth example is a server that operates the telnet_ protocol, uses IPv6 address `2001:db8:3080:3::2` and is reachable on port 80. The last URI is somewhat special. Most users will assume that it corresponds to a document stored on the `cnn.example.com` server. However, to parse this URI, it is important to remember that the `@` character is used to separate the user name from the host name in the authorisation part of a URI. This implies that the URI points to a document named `top_story.htm` on host having IPv4 address `10.0.0.1`. The document will be retrieved by using the `ftp` protocol with the user name set to `cnn.example.com&story=breaking_news`. The second component of the `word wide web` is the HyperText Markup Language (HTML). HTML defines the format of the documents that are exchanged on the `web`. The `first version of HTML `_ was derived from the Standard Generalized Markup Language (SGML) that was standardised in 1986 by :term:`ISO`. SGML_ was designed to allow large project documents in industries such as government, law or aerospace to be shared efficiently in a machine-readable manner. These industries require documents to remain readable and editable for tens of years and insisted on a standardised format supported by multiple vendors. Today, SGML_ is no longer widely used beyond specific applications, but its descendants including :term:`HTML` and :term:`XML` are now widespread. @@ -79,22 +79,22 @@ A markup language is a structured way of adding annotations about the formatting Some text to be displayed More complex HTML elements can also include optional attributes in the start tag :: - + some text to be displayed The HTML document shown below is composed of two parts : a header, delineated by the `` and `` markers, and a body (between the `` and `` markers). In the example below, the header only contains a title, but other types of information can be included in the header. The body contains an image, some text and a list with three hyperlinks. The image is included in the web page by indicating its URI between brackets inside the `` marker. The image can, of course, reside on any server and the client will automatically download it when rendering the web page. The `

...

` marker is used to specify the first level of headings. The `
    ` marker indicates an unnumbered list while the `
  • ` marker indicates a list item. The `text` indicates a hyperlink. The `text` will be underlined in the rendered web page and the client will fetch the specified URI when the user clicks on the link. .. figure:: ../../book/application/png/app-fig-015-c.png :align: center - :scale: 80 + :scale: 80 - A simple HTML page + A simple HTML page Additional details about the various extensions to HTML may be found in the `official specifications `_ maintained by W3C_. The third component of the `world wide web` is the HyperText Transfert Protocol (HTTP). HTTP is a text-based protocol, in which the client sends a request and the server returns a response. HTTP runs above the bytestream service and HTTP servers listen by default on port `80`. The design of HTTP has largely been inspired by the Internet email protocols. Each HTTP request contains three parts : - - a `method` , that indicates the type of request, a URI, and the version of the HTTP protocol used by the client + - a `method` , that indicates the type of request, a URI, and the version of the HTTP protocol used by the client - a `header` , that is used by the client to specify optional parameters for the request. An empty line is used to mark the end of the header - an optional MIME document attached to the request @@ -102,11 +102,11 @@ The response sent by the server also contains three parts : - a `status line` , that indicates whether the request was successful or not - a `header` , that contains additional information about the response. The response header ends with an empty line. - - a MIME document + - a MIME document .. figure:: ../../book/application/svg/http-requests-responses.png :align: center - :scale: 60 + :scale: 60 HTTP requests and responses @@ -119,7 +119,7 @@ Several types of method can be used in HTTP requests. The three most important o GET /MarkUp/ HTTP/1.0 - - the `HEAD` method is a variant of the `GET` method that allows the retrieval of the header lines for a given URI without retrieving the entire document. It can be used by a client to verify if a document exists, for instance. + - the `HEAD` method is a variant of the `GET` method that allows the retrieval of the header lines for a given URI without retrieving the entire document. It can be used by a client to verify if a document exists, for instance. - the `POST` method can be used by a client to send a document to a server. The sent document is attached to the HTTP request as a MIME document. @@ -127,20 +127,20 @@ HTTP clients and servers can include many different HTTP headers in HTTP request - the `Content-Length:` header is the :term:`MIME` header that indicates the length of the MIME document in bytes. - the `Content-Type:` header is the :term:`MIME` header that indicates the type of the attached MIME document. HTML pages use the `text/html` type. - - the `Content-Encoding:` header indicates how the :term:`MIME document` has been encoded. For example, this header would be set to `x-gzip` for a document compressed using the gzip_ software. + - the `Content-Encoding:` header indicates how the :term:`MIME document` has been encoded. For example, this header would be set to `x-gzip` for a document compressed using the gzip_ software. :rfc:`1945` and :rfc:`2616` define headers that are specific to HTTP responses. These server headers include : - the `Server:` header indicates the version of the web server that has generated the HTTP response. Some servers provide information about their software release and optional modules that they use. For security reasons, some system administrators disable these headers to avoid revealing too much information about their server to potential attackers. - the `Date:` header indicates when the HTTP response has been produced by the server. - - the `Last-Modified:` header indicates the date and time of the last modification of the document attached to the HTTP response. - + - the `Last-Modified:` header indicates the date and time of the last modification of the document attached to the HTTP response. + Similarly, the following header lines can only appear inside HTTP requests sent by a client : - the `User-Agent:` header provides information about the client that has generated the HTTP request. Some servers analyse this header line and return different headers and sometimes different documents for different user agents. - - the `If-Modified-Since:` header is followed by a date. It enables clients to cache in memory or on disk the recent or most frequently used documents. When a client needs to request a URI from a server, it first checks whether the document is already in its cache. If it is, the client sends a HTTP request with the `If-Modified-Since:` header indicating the date of the cached document. The server will only return the document attached to the HTTP response if it is newer than the version stored in the client's cache. - - the `Referrer:` header is followed by a URI. It indicates the URI of the document that the client visited before sending this HTTP request. Thanks to this header, the server can know the URI of the document containing the hyperlink followed by the client, if any. This information is very useful to measure the impact of advertisements containing hyperlinks placed on websites. - - the `Host:` header contains the fully qualified domain name of the URI being requested. + - the `If-Modified-Since:` header is followed by a date. It enables clients to cache in memory or on disk the recent or most frequently used documents. When a client needs to request a URI from a server, it first checks whether the document is already in its cache. If it is, the client sends a HTTP request with the `If-Modified-Since:` header indicating the date of the cached document. The server will only return the document attached to the HTTP response if it is newer than the version stored in the client's cache. + - the `Referer:` header is followed by a URI. It indicates the URI of the document that the client visited before sending this HTTP request. Thanks to this header, the server can know the URI of the document containing the hyperlink followed by the client, if any. This information is very useful to measure the impact of advertisements containing hyperlinks placed on websites. + - the `Host:` header contains the fully qualified domain name of the URI being requested. .. note:: The importance of the `Host:` header line @@ -150,9 +150,9 @@ Similarly, the following header lines can only appear inside HTTP requests sent GET /index.html HTTP/1.0 - By parsing this line, a server cannot determine which `index.html` file is requested. Thanks to the `Host:` header line, the server knows whether the request is for `http://web.example.com/index.html` or `http://www.dummy.net/index.html`. Without the `Host:` header, this is impossible. The `Host:` header line allowed web hosting companies to develop their business by supporting a large number of independent web servers on the same physical server. + By parsing this line, a server cannot determine which `index.html` file is requested. Thanks to the `Host:` header line, the server knows whether the request is for `http://web.example.com/index.html` or `http://www.dummy.net/index.html`. Without the `Host:` header, this is impossible. The `Host:` header line allowed web hosting companies to develop their business by supporting a large number of independent web servers on the same physical server. -The status line of the HTTP response begins with the version of HTTP used by the server (usually `HTTP/1.0` defined in :rfc:`1945` or `HTTP/1.1` defined in :rfc:`2616`) followed by a three digit status code and additional information in English. HTTP status codes have a similar structure as the reply codes used by SMTP. +The status line of the HTTP response begins with the version of HTTP used by the server (usually `HTTP/1.0` defined in :rfc:`1945` or `HTTP/1.1` defined in :rfc:`2616`) followed by a three digit status code and additional information in English. HTTP status codes have a similar structure as the reply codes used by SMTP. - All status codes starting with digit `2` indicate a valid response. `200 Ok` indicates that the HTTP request was successfully processed by the server and that the response is valid. - All status codes starting with digit `3` indicate that the requested document is no longer available on the server. `301 Moved Permanently` indicates that the requested document is no longer available on this server. A `Location:` header containing the new URI of the requested document is inserted in the HTTP response. `304 Not Modified` is used in response to an HTTP request containing the `If-Modified-Since:` header. This status line is used by the server if the document stored on the server is not more recent than the date indicated in the `If-Modified-Since:` header. @@ -161,29 +161,29 @@ The status line of the HTTP response begins with the version of HTTP used by the In both the HTTP request and the HTTP response, the MIME document refers to a representation of the document with the MIME headers indicating the type of document and its size. -As an illustration of HTTP/1.0, the transcript below shows a HTTP request for `http://www.ietf.org `_ and the corresponding HTTP response. The HTTP request was sent using the curl_ command line tool. The `User-Agent:` header line contains more information about this client software. There is no MIME document attached to this HTTP request, and it ends with a blank line. +As an illustration of HTTP/1.0, the transcript below shows a HTTP request for `http://www.ietf.org `_ and the corresponding HTTP response. The HTTP request was sent using the curl_ command line tool. The `User-Agent:` header line contains more information about this client software. There is no MIME document attached to this HTTP request, and it ends with a blank line. .. code-block:: text - + GET / HTTP/1.0 User-Agent: curl/7.19.4 (universal-apple-darwin10.0) libcurl/7.19.4 OpenSSL/0.9.8l zlib/1.2.3 Host: www.ietf.org - + The HTTP response indicates the version of the server software used with the modules included. The `Last-Modified:` header indicates that the requested document was modified about one week before the request. A HTML document (not shown) is attached to the response. Note the blank line between the header of the HTTP response and the attached MIME document. The `Server:` header line has been truncated in this output. .. code-block:: text - + HTTP/1.1 200 OK Date: Mon, 15 Mar 2010 13:40:38 GMT Server: Apache/2.2.4 (Linux/SUSE) mod_ssl/2.2.4 OpenSSL/0.9.8e (truncated) Last-Modified: Tue, 09 Mar 2010 21:26:53 GMT Content-Length: 17019 Content-Type: text/html - + -HTTP was initially designed to share self-contained text documents. For this reason, and to ease the implementation of clients and servers, the designers of HTTP chose to open a TCP connection for each HTTP request. This implies that a client must open one TCP connection for each URI that it wants to retrieve from a server as illustrated on the figure below. For a web page containing only text documents this was a reasonable design choice as the client usually remains idle while the (human) user is reading the retrieved document. +HTTP was initially designed to share self-contained text documents. For this reason, and to ease the implementation of clients and servers, the designers of HTTP chose to open a TCP connection for each HTTP request. This implies that a client must open one TCP connection for each URI that it wants to retrieve from a server as illustrated on the figure below. For a web page containing only text documents this was a reasonable design choice as the client usually remains idle while the (human) user is reading the retrieved document. .. figure:: ../../book/application/png/app-fig-016-c.png :align: center @@ -191,9 +191,9 @@ HTTP was initially designed to share self-contained text documents. For this rea HTTP 1.0 and the underlying TCP connection -However, as the web evolved to support richer documents containing images, opening a TCP connection for each URI became a performance problem [Mogul1995]_. Indeed, besides its HTML part, a web page may include dozens of images or more. Forcing the client to open a TCP connection for each component of a web page has two important drawbacks. First, the client and the server must exchange packets to open and close a TCP connection as we will see later. This increases the network overhead and the total delay of completely retrieving all the components of a web page. Second, a large number of established TCP connections may be a performance bottleneck on servers. +However, as the web evolved to support richer documents containing images, opening a TCP connection for each URI became a performance problem [Mogul1995]_. Indeed, besides its HTML part, a web page may include dozens of images or more. Forcing the client to open a TCP connection for each component of a web page has two important drawbacks. First, the client and the server must exchange packets to open and close a TCP connection as we will see later. This increases the network overhead and the total delay of completely retrieving all the components of a web page. Second, a large number of established TCP connections may be a performance bottleneck on servers. -This problem was solved by extending HTTP to support persistent TCP connections :rfc:`2616`. A persistent connection is a TCP connection over which a client may send several HTTP requests. This is illustrated in the figure below. +This problem was solved by extending HTTP to support persistent TCP connections :rfc:`2616`. A persistent connection is a TCP connection over which a client may send several HTTP requests. This is illustrated in the figure below. .. figure:: ../../book/application/svg/http-persistent.png :align: center @@ -206,17 +206,17 @@ To allow the clients and servers to control the utilisation of these persistent - The `Connection:` header is used with the `Keep-Alive` argument by the client to indicate that it expects the underlying TCP connection to be persistent. When this header is used with the `Close` argument, it indicates that the entity that sent it will close the underlying TCP connection at the end of the HTTP response. - The `Keep-Alive:` header is used by the server to inform the client about how it agrees to use the persistent connection. A typical `Keep-Alive:` contains two parameters : the maximum number of requests that the server agrees to serve on the underlying TCP connection and the timeout (in seconds) after which the server will close an idle connection -The example below shows the operation of HTTP/1.1 over a persistent TCP connection to retrieve three URIs stored on the same server. Once the connection has been established, the client sends its first request with the `Connection: keep-alive` header to request a persistent connection. +The example below shows the operation of HTTP/1.1 over a persistent TCP connection to retrieve three URIs stored on the same server. Once the connection has been established, the client sends its first request with the `Connection: keep-alive` header to request a persistent connection. .. code-block:: text - + GET / HTTP/1.1 Host: www.kame.net - User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us) + User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us) Connection: keep-alive -The server replies with the `Connection: Keep-Alive` header and indicates that it accepts a maximum of 100 HTTP requests over this connection and that it will close the connection if it remains idle for 15 seconds. +The server replies with the `Connection: Keep-Alive` header and indicates that it accepts a maximum of 100 HTTP requests over this connection and that it will close the connection if it remains idle for 15 seconds. .. code-block:: text @@ -234,15 +234,15 @@ The server replies with the `Connection: Keep-Alive` header and indicates that i The client sends a second request for the style sheet of the retrieved web page. .. code-block:: text - + GET /style.css HTTP/1.1 Host: www.kame.net Referer: http://www.kame.net/ - User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us) + User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us) Connection: keep-alive -The server replies with the requested style sheet and maintains the persistent connection. Note that the server only accepts 99 remaining HTTP requests over this persistent connection. +The server replies with the requested style sheet and maintains the persistent connection. Note that the server only accepts 99 remaining HTTP requests over this persistent connection. .. code-block:: text @@ -257,14 +257,14 @@ The server replies with the requested style sheet and maintains the persistent c ... -Then the client automatically requests the web server's icon [#ffavicon]_ , that could be displayed by the browser. This server does not contain such URI and thus replies with a `404` HTTP status. However, the underlying TCP connection is not closed immediately. +Then the client automatically requests the web server's icon [#ffavicon]_ , that could be displayed by the browser. This server does not contain such URI and thus replies with a `404` HTTP status. However, the underlying TCP connection is not closed immediately. .. code-block:: text GET /favicon.ico HTTP/1.1 Host: www.kame.net Referer: http://www.kame.net/ - User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us) + User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us) Connection: keep-alive HTTP/1.1 404 Not Found @@ -280,7 +280,7 @@ Then the client automatically requests the web server's icon [#ffavicon]_ , that As illustrated above, a client can send several HTTP requests over the same persistent TCP connection. However, it is important to note that all of these HTTP requests are considered to be independent by the server. Each HTTP request must be self-contained. This implies that each request must include all the header lines that are required by the server to understand the request. The independence of these requests is one of the important design choices of HTTP. As a consequence of this design choice, when a server processes a HTTP request, it doesn't use any other information than what is contained in the request itself. This explains why the client adds its `User-Agent:` header in all of the HTTP requests it sends over the persistent TCP connection. -However, in practice, some servers want to provide content tuned for each user. For example, some servers can provide information in several languages or other servers want to provide advertisements that are targeted to different types of users. To do this, servers need to maintain some information about the preferences of each user and use this information to produce content matching the user's preferences. HTTP contains several mechanisms that enable to solve this problem. We discuss three of them below. +However, in practice, some servers want to provide content tuned for each user. For example, some servers can provide information in several languages or other servers want to provide advertisements that are targeted to different types of users. To do this, servers need to maintain some information about the preferences of each user and use this information to produce content matching the user's preferences. HTTP contains several mechanisms that enable to solve this problem. We discuss three of them below. A first solution is to force the users to be authenticated. This was the solution used by `FTP` to control the files that each user could access. Initially, user names and passwords could be included inside URIs :rfc:`1738`. However, placing passwords in the clear in a potentially publicly visible URI is completely insecure and this usage has now been deprecated :rfc:`3986`. HTTP supports several extension headers :rfc:`2617` that can be used by a server to request the authentication of the client by providing his/her credentials. However, user names and passwords have not been popular on web servers as they force human users to remember one user name and one password per server. Remembering a password is acceptable when a user needs to access protected content, but users will not accept the need for a user name and password only to receive targeted advertisements from the web sites that they visit. @@ -290,14 +290,14 @@ The third, and widely adopted, solution are HTTP cookies. HTTP cookies were init .. figure:: ../../book/application/svg/http-cookies.png :align: center - :scale: 60 + :scale: 60 HTTP cookies .. note:: Privacy issues with HTTP cookies - The HTTP cookies introduced by Netscape_ are key for large e-commerce websites. However, they have also raised many discussions concerning their `potential misuses `_. Consider `ad.com`, a company that delivers lots of advertisements on web sites. A web site that wishes to include `ad.com`'s advertisements next to its content will add links to `ad.com` inside its HTML pages. If `ad.com` is used by many web sites, `ad.com` could be able to track the interests of all the users that visit its client websites and use this information to provide targeted advertisements. Privacy advocates have even `sued `_ online advertisement companies to force them to comply with the privacy regulations. More recent related technologies also raise `privacy concerns `_ - + The HTTP cookies introduced by Netscape_ are key for large e-commerce websites. However, they have also raised many discussions concerning their `potential misuses `_. Consider `ad.com`, a company that delivers lots of advertisements on web sites. A web site that wishes to include `ad.com`'s advertisements next to its content will add links to `ad.com` inside its HTML pages. If `ad.com` is used by many web sites, `ad.com` could be able to track the interests of all the users that visit its client websites and use this information to provide targeted advertisements. Privacy advocates have even `sued `_ online advertisement companies to force them to comply with the privacy regulations. More recent related technologies also raise `privacy concerns `_ + .. rubric:: Footnotes .. [#furiretrieve] An example of a non-retrievable URI is `urn:isbn:0-380-81593-1` which is an unique identifier for a book, through the urn scheme (see :rfc:`3187`). Of course, any URI can be made retrievable via a dedicated server or a new protocol but this one has no explicit protocol. Same thing for the scheme tag (see :rfc:`4151`), often used in Web syndication (see :rfc:`4287` about the Atom syndication format). Even when the scheme is retrievable (for instance with http`), it is often used only as an identifier, not as a way to get a resource. See http://norman.walsh.name/2006/07/25/namesAndAddresses for a good explanation. diff --git a/book-2nd/protocols/ipv6b.rst b/book-2nd/protocols/ipv6b.rst index 9a10afc..47975f9 100644 --- a/book-2nd/protocols/ipv6b.rst +++ b/book-2nd/protocols/ipv6b.rst @@ -2,14 +2,14 @@ .. This file is licensed under a `creative commons licence `_ -.. warning:: +.. warning:: This is an unpolished draft of the second edition of this ebook. If you find any error or have suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues?milestone=9 The IPv6 subnet =============== -Until now, we have focussed our discussion on the utilisation of IPv6 on point-to-point links. Although there are point-to-point links in the Internet, mainly between routers and sometimes for endhosts, most of the endhosts are attached to datalink layer networks such as Ethernet LANs or WiFi networks. These datalink layer networks play an important role in today's Internet and have heavily influenced the design of the operation of IPv6. To understand IPv6 and ICMPv6 completely, we first need to correctly understand the key principles behind these datalink layer technologies. +Until now, we have focused our discussion on the utilisation of IPv6 on point-to-point links. Although there are point-to-point links in the Internet, mainly between routers and sometimes for endhosts, most of the endhosts are attached to datalink layer networks such as Ethernet LANs or WiFi networks. These datalink layer networks play an important role in today's Internet and have heavily influenced the design of the operation of IPv6. To understand IPv6 and ICMPv6 completely, we first need to correctly understand the key principles behind these datalink layer technologies. As explained earlier, devices attached to a Local Area Network can directly exchange frames among themselves. For this, each datalink layer interface on a device (endhost, router, ...) attached to such a network is identified by a MAC address. Each datalink layer interface includes a unique hardwired MAC address. MAC addresses are allocated to manufacturers in blocks and interface is numbered with a unique address. Thanks to the global unicity of the MAC addresses, the datalink layer service can assume that two hosts attached to a LAN have different addresses. Most LANs provide an unreliable connectionless service and a datalink layer frame has a header containing : @@ -58,7 +58,7 @@ Hosts ``A`` and ``B`` are attached to the same datalink layer network. They can A MAC address - MAC addresses are allocated in blocks of :math:`2^{20}`. When a company registers for a block of MAC addresses, it receives an identifier. company identifier is then used to populated the `c` bits of the MAC addresses. The company can allocate all addresses in starting with this prefix and manages the `m` bits as it wishes. + MAC addresses are allocated in blocks of :math:`2^{20}`. When a company registers for a block of MAC addresses, it receives an identifier. The company identifier is then used to populated the `c` bits of the MAC addresses. The company can allocate all addresses in starting with this prefix and manages the `m` bits as it wishes. .. figure:: pkt/macaddr-eui64.png :align: center @@ -92,9 +92,9 @@ The next step is to connect the LAN to the Internet. For this, a router is attac Assume that the LAN containing the two hosts and the router is assigned prefix ``2001:db8:1234:5678/64``. A first solution to configure the IPv6 addresses in this network is to assign them manually. A possible assignment is : - - ``2001:db8:1234:5678::1`` is assigned to ``router`` - - ``2001:db8:1234:5678::AA`` is assigned to ``hostA`` - - ``2001:db8:1234:5678::BB`` is assigned to ``hostB`` + - ``2001:db8:1234:5678::1`` is assigned to ``router`` + - ``2001:db8:1234:5678::AA`` is assigned to ``hostA`` + - ``2001:db8:1234:5678::BB`` is assigned to ``hostB`` .. index:: Address resolution problem, Neighbor Discovery Protocol, NDP @@ -102,7 +102,7 @@ To be able to exchange IPv6 packets with ``hostB``, ``hostA`` needs to know the .. index:: Neighbor Solicitation message -NDP allows a host to discover the MAC address used by any other host attached to the same LAN. NDP operates in two steps. First, the querier sends a multicast ICMPv6 Neighbor Solicitation message that contains as parameter the queried IPv6 address. This multicast ICMPv6 NS is placed inside a multicast frame [#fndpmulti]_. The queried node receives the frame, parses it and replies with a unicast ICMPv6 Neighbor Advertisement that provides its own IPv6 and MAC addresses. Upon reception of the Neighbor Advertisement message, the querier stores the mapping between the IPv6 and the MAC address inside its NDP table. This table is a data structure that maintains a cache of the recently received Neighbor Advertisement. Thanks to this cache, a host only needs to send a Neighbor Sollicitation message for the first packet that it sends to a given host. After this initial packet, the NDP table can provide the mapping between the destination IPv6 address and the corresponding MAC address. +NDP allows a host to discover the MAC address used by any other host attached to the same LAN. NDP operates in two steps. First, the querier sends a multicast ICMPv6 Neighbor Solicitation message that contains as parameter the queried IPv6 address. This multicast ICMPv6 NS is placed inside a multicast frame [#fndpmulti]_. The queried node receives the frame, parses it and replies with a unicast ICMPv6 Neighbor Advertisement that provides its own IPv6 and MAC addresses. Upon reception of the Neighbor Advertisement message, the querier stores the mapping between the IPv6 and the MAC address inside its NDP table. This table is a data structure that maintains a cache of the recently received Neighbor Advertisement. Thanks to this cache, a host only needs to send a Neighbor Sollicitation message for the first packet that it sends to a given host. After this initial packet, the NDP table can provide the mapping between the destination IPv6 address and the corresponding MAC address. .. msc:: router [label="router", linecolour=black], @@ -111,7 +111,7 @@ NDP allows a host to discover the MAC address used by any other host attached to hostA->* [ label = "NS : Who has 2001:db8:1234:5678::BB" ]; hostB->hostA [ label = "NA : 1234:5678:9abc:dede"]; - |||; + |||; The NS message can also be used to verify the reachability of a host in the local subnet. For this usage, NS messages can be sent in unicast since other nodes on the subnet do not need to process the message. @@ -122,11 +122,11 @@ When an entry in the NDP table times out on a host, it may either be deleted or .. index:: Duplicate Address Detection -This is not the only usage of the Neighbor Solicitation and Neighbor Advertisement messages. They are also used to detect the utilization of duplicate addresses. In the network above, consider what happens when a new host is connected to the LAN. If this host is configured by mistake with the same address as ``hostA`` (i.e. ``2001:db8:1234:5678::AA``), problems could occur. Indeed, if two hosts have the same IPv6 address on the LAN, but different MAC addresses, it will be difficult to correctly reach them. IPv6 anticipated this problem and includes a `Duplicate Address Detection` Algorithm (DAD). When an IPv6 address [#flinklocal]_ is configured on a host, by any means, the host must verify the uniqueness of this address on the LAN. For this, it multicasts an ICMPv6 Neighbor Solicitation that queries the network for its newly configured address. The IPv6 source address of this NS is set to ``::`` (i.e. the reserved unassigned address) if the host does not already have an IPv6 address on this subnet). If the NS does not receive any answer, the new address is considered to be unique and can safely be used. Otherwise, the new address is refused and an error message should be returned to the system administrator or a new IPv6 address should be generated. The `Duplicate Address Detection` Algorithm can prevent various operational problems that are often difficult to debug. +This is not the only usage of the Neighbor Solicitation and Neighbor Advertisement messages. They are also used to detect the utilization of duplicate addresses. In the network above, consider what happens when a new host is connected to the LAN. If this host is configured by mistake with the same address as ``hostA`` (i.e. ``2001:db8:1234:5678::AA``), problems could occur. Indeed, if two hosts have the same IPv6 address on the LAN, but different MAC addresses, it will be difficult to correctly reach them. IPv6 anticipated this problem and includes a `Duplicate Address Detection` Algorithm (DAD). When an IPv6 address [#flinklocal]_ is configured on a host, by any means, the host must verify the uniqueness of this address on the LAN. For this, it multicasts an ICMPv6 Neighbor Solicitation that queries the network for its newly configured address. The IPv6 source address of this NS is set to ``::`` (i.e. the reserved unassigned address) if the host does not already have an IPv6 address on this subnet. If the NS does not receive any answer, the new address is considered to be unique and can safely be used. Otherwise, the new address is refused and an error message should be returned to the system administrator or a new IPv6 address should be generated. The `Duplicate Address Detection` Algorithm can prevent various operational problems that are often difficult to debug. -.. There are several differences between IPv6 and IPv4 when considering their interactions with the datalink layer. In IPv6, the interactions between the network and the datalink layer is performed using ICMPv6. +.. There are several differences between IPv6 and IPv4 when considering their interactions with the datalink layer. In IPv6, the interactions between the network and the datalink layer is performed using ICMPv6. Few users manually configure the IPv6 addresses on their hosts. They prefer to rely on protocols that can automatically configure their IPv6 addresses. IPv6 supports two such protocols : DHCPv6 and the Stateless Address Autoconfiguration (SLAAC). @@ -134,13 +134,13 @@ Few users manually configure the IPv6 addresses on their hosts. They prefer to r .. index:: DHCPv6, SLAC, Stateless Address Autoconfiguration -The Stateless Address Autoconfiguration (SLAAC) mechanism defined in :rfc:`4862` enables hosts to automatically configure their addresses without maintaining any state. When a host boots, it derives its identifier from its datalink layer address [#fprivacy]_ as explained earlier and concatenates this 64 bits identifier to the `FE80::/64` prefix to obtain its link-local IPv6 address. It then multicasts a Neighbour Solicitation with its link-local address as a target to verify whether another host is using the same link-local address on this subnet. If it receives a Neighbour Advertisement indicating that the link-local address is used by another host, it generates another 64 bits identifier and sends again a Neighbour Solicitation. If there is no answer, the host considers its link-local address to be valid. This address will be used as the source address for all NDP messages sent on the subnet. +The Stateless Address Autoconfiguration (SLAAC) mechanism defined in :rfc:`4862` enables hosts to automatically configure their addresses without maintaining any state. When a host boots, it derives its identifier from its datalink layer address [#fprivacy]_ as explained earlier and concatenates this 64 bits identifier to the `FE80::/64` prefix to obtain its link-local IPv6 address. It then multicasts a Neighbour Solicitation with its link-local address as a target to verify whether another host is using the same link-local address on this subnet. If it receives a Neighbour Advertisement indicating that the link-local address is used by another host, it generates another 64 bits identifier and sends again a Neighbour Solicitation. If there is no answer, the host considers its link-local address to be valid. This address will be used as the source address for all NDP messages sent on the subnet. To automatically configure its global IPv6 address, the host must know the globally routable IPv6 prefix that is used on the local subnet. IPv6 routers regularly multicast ICMPv6 Router Advertisement messages that indicate the IPv6 prefix assigned to the subnet. The Router Advertisement message contains several interesting fields. .. figure:: pkt/router-adv.png :align: center - + Format of the ICMPv6 Router Advertisement message This message is sent from the link-local address of the router on the subnet. Its destination is the IPv6 multicast address that targets all IPv6 enabled hosts (i.e. ``ff02::1``). The `Cur Hop Limit` field, if different from zero, allows to specify the default `Hop Limit` that hosts should use when sending IPv6 from this subnet. ``64`` is a frequently used value. The `M` and `O` bits are used to indicate that some information can be obtained from DHCPv6. The `Router Lifetime` parameter provides the expected lifetime (in seconds) of the sending router acting as a default router. This lifetime allows to plan the replacement of a router by another one in the same subnet. The `Reachable Time` and the `Retrans Timer` parameter are used to configure the utilisation of the NDP protocol on the hosts attached to the subnet. @@ -192,16 +192,16 @@ The last point that needs to be explained about ICMPv6 is the `Redirect` message router2--lan; } -In this network, ``router1`` is the default router for all hosts. The second router, ``router2`` provides connectivity to a specific IPv6 subnet, e.g. ``2001:db8:abcd::/48``. These two routers attached to the same subnet can be used in different ways. First, it is possible to manually configure the routing tables on all hosts to add a route towards ``2001:db8:abcd::/48`` via ``router2``. Unfortunately, forcing such manual configuration boils down all the benefits of using address auto-configuration in IPv6. The second approach is to automatically configure a default route via ``router1`` on all hosts. With such route, when a host needs to send a packet to any address within ``2001:db8:abcd::/48``, it will send it to ``router1``. ``router1`` would consult its routing table and find that the packet needs to be sent again on the subnet to reach ``router2``. This is a waste of time. A better approach would be to enable the hosts to automatically learn the new route. This is possible thanks to the ICMPv6 `Redirect` message. When ``router1`` receives a packet that needs to be forwarded back on the same interface, it replies with a `Redirect` message that indicates that the packet should have been sent via ``router2``. Upon reception of a `Redirect` message, the host updates it forwarding table to include a new transient entry for the destination reported in the message. A timeout is usually associated with this transient entry to automatically delete it after some time. - +In this network, ``router1`` is the default router for all hosts. The second router, ``router2`` provides connectivity to a specific IPv6 subnet, e.g. ``2001:db8:abcd::/48``. These two routers attached to the same subnet can be used in different ways. First, it is possible to manually configure the routing tables on all hosts to add a route towards ``2001:db8:abcd::/48`` via ``router2``. Unfortunately, forcing such manual configuration boils down all the benefits of using address auto-configuration in IPv6. The second approach is to automatically configure a default route via ``router1`` on all hosts. With such route, when a host needs to send a packet to any address within ``2001:db8:abcd::/48``, it will send it to ``router1``. ``router1`` would consult its routing table and find that the packet needs to be sent again on the subnet to reach ``router2``. This is a waste of time. A better approach would be to enable the hosts to automatically learn the new route. This is possible thanks to the ICMPv6 `Redirect` message. When ``router1`` receives a packet that needs to be forwarded back on the same interface, it replies with a `Redirect` message that indicates that the packet should have been sent via ``router2``. Upon reception of a `Redirect` message, the host updates its forwarding table to include a new transient entry for the destination reported in the message. A timeout is usually associated with this transient entry to automatically delete it after some time. + .. index:: DHCPv6 -An alternative is the Dynamic Host Configuration Protocol (DHCP) defined in :rfc:`2131` and :rfc:`3315`. DHCP allows a host to automatically retrieve its assigned IPv6 address, but relies on server. A DHCP server is associated to each subnet [#fdhcpserver]_. Each DHCP server manages a pool of IPv6 addresses assigned to the subnet. When a host is first attached to the subnet, it sends a DHCP request message in a UDP segment (the DHCP server listens on port 67). As the host knows neither its IPv6 address nor the IPv6 address of the DHCP server, this UDP segment is sent inside a multicast packet target at the DHCP servers. The DHCP request may contain various options such as the name of the host, its datalink layer address, etc. The server captures the DHCP request and selects an unassigned address in its address pool. It then sends the assigned IPv6 address in a DHCP reply message which contains the datalink layer address of the host and additional information such as the subnet mask, the address of the default router or the address of the DNS resolver. The DHCP reply also specifies the lifetime of the address allocation. This forces the host to renew its address allocation once it expires. Thanks to the limited lease time, IP addresses are automatically returned to the pool of addresses when hosts are powered off. +An alternative is the Dynamic Host Configuration Protocol (DHCP) defined in :rfc:`2131` and :rfc:`3315`. DHCP allows a host to automatically retrieve its assigned IPv6 address, but relies on server. A DHCP server is associated to each subnet [#fdhcpserver]_. Each DHCP server manages a pool of IPv6 addresses assigned to the subnet. When a host is first attached to the subnet, it sends a DHCP request message in a UDP segment (the DHCP server listens on port 67). As the host knows neither its IPv6 address nor the IPv6 address of the DHCP server, this UDP segment is sent inside a multicast packet target at the DHCP servers. The DHCP request may contain various options such as the name of the host, its datalink layer address, etc. The server captures the DHCP request and selects an unassigned address in its address pool. It then sends the assigned IPv6 address in a DHCP reply message which contains the datalink layer address of the host and additional information such as the subnet mask, the address of the default router or the address of the DNS resolver. The DHCP reply also specifies the lifetime of the address allocation. This forces the host to renew its address allocation once it expires. Thanks to the limited lease time, IP addresses are automatically returned to the pool of addresses when hosts are powered off. -Both SLAAC and DHCPv6 can be extended to provide additional information beyond the IPv6 prefix/address. For example, :rfc:`6106` defines options for the ICMPv6 ND message that can carry the IPv6 address of the recursive DNS resolver and a list of default domain search suffixes. It is also possible to combine SLAAC with DHCPv6. :rfc:`3736` defines a stateless variant of DHCPv6 that can be used to distribute DNS information while SLAAC is used to distribute the prefixes. +Both SLAAC and DHCPv6 can be extended to provide additional information beyond the IPv6 prefix/address. For example, :rfc:`6106` defines options for the ICMPv6 ND message that can carry the IPv6 address of the recursive DNS resolver and a list of default domain search suffixes. It is also possible to combine SLAAC with DHCPv6. :rfc:`3736` defines a stateless variant of DHCPv6 that can be used to distribute DNS information while SLAAC is used to distribute the prefixes. @@ -216,7 +216,7 @@ Both SLAAC and DHCPv6 can be extended to provide additional information beyond t .. [#flinklocal] The DAD algorithm is also used with `link-local` addresses. -.. [#fprivacy] Using a datalink layer address to derive a 64 bits identifier for each host raises privacy concerns as the host will always use the same identifier. Attackers could use this to track hosts on the Internet. An extension to the Stateless Address Configuration mechanism that does not raise privacy concerns is defined in :rfc:`4941`. These privacy extensions allow a host to generate its 64 bits identifier randomly every time it attaches to a subnet. It then becomes impossible for an attacker to use the 64-bits identifier to track a host. +.. [#fprivacy] Using a datalink layer address to derive a 64 bits identifier for each host raises privacy concerns as the host will always use the same identifier. Attackers could use this to track hosts on the Internet. An extension to the Stateless Address Configuration mechanism that does not raise privacy concerns is defined in :rfc:`4941`. These privacy extensions allow a host to generate its 64 bits identifier randomly every time it attaches to a subnet. It then becomes impossible for an attacker to use the 64-bits identifier to track a host. .. [#fsend] Using a `Hop Limit` of ``255`` prevents one family of attacks against ICMPv6, but other attacks still remain possible. A detailed discussion of the security issues with IPv6 is outside the scope of this book. It is possible to secure NDP by using the `Cryptographically Generated IPv6 Addresses` (CGA) defined in :rfc:`3972`. The Secure Neighbour Discovery Protocol is defined in :rfc:`3971`. A detailed discussion of the security of IPv6 may be found in [HV2008]_. diff --git a/book-2nd/protocols/ppp.rst b/book-2nd/protocols/ppp.rst index 4aeecbe..74fad08 100644 --- a/book-2nd/protocols/ppp.rst +++ b/book-2nd/protocols/ppp.rst @@ -13,13 +13,13 @@ Many point-to-point datalink layers [#flapb]_ have been developed, starting in t The first solution to transport IP packets over a serial line was proposed in :rfc:`1055` and is known as `Serial Line IP` (SLIP). SLIP is a simple character stuffing technique applied to IP packets. SLIP defines two special characters : `END` (decimal 192) and `ESC` (decimal 219). `END` appears at the beginning and at the end of each transmitted IP packet and the sender adds `ESC` before each `END` character inside each transmitted IP packet. SLIP only supports the transmission of IP packets and it assumes that the two communicating hosts/routers have been manually configured with each other's IP address. SLIP was mainly used over links offering bandwidth of often less than 20 Kbps. On such a low bandwidth link, sending 20 bytes of IP header followed by 20 bytes of TCP header for each TCP segment takes a lot of time. This initiated the development of a family of compression techniques to efficiently compress the TCP/IP headers. The first header compression technique proposed in :rfc:`1144` was designed to exploit the redundancy between several consecutive segments that belong to the same TCP connection. In all these segments, the IP addresses and port numbers are always the same. Furthermore, fields such as the sequence and acknowledgement numbers do not change in a random way. :rfc:`1144` defined simple techniques to reduce the redundancy found in successive segments. The development of header compression techniques continued and there are still improvements being developed now :rfc:`5795`. While SLIP was implemented and used in some environments, it had several limitations discussed in :rfc:`1055`. The `Point-to-Point Protocol` (PPP) was designed shortly after and is specified in :rfc:`1548`. PPP aims to support IP and other network layer protocols over various types of serial lines. PPP is in fact a family of three protocols that are used together : - + #. The `Point-to-Point Protocol` defines the framing technique to transport network layer packets. - #. The `Link Control Protocol` that is used to negotiate options and authenticate the session by using username and password or other types of credentials + #. The `Link Control Protocol` that is used to negotiate options and authenticate the session by using username and password or other types of credentials. #. The `Network Control Protocol` that is specific for each network layer protocol. It is used to negotiate options that are specific for each protocol. For example, IPv4's NCP :rfc:`1548` can negotiate the IPv4 address to be used, the IPv4 address of the DNS resolver. IPv6's NCP is defined in :rfc:`5072`. The PPP framing :rfc:`1662` was inspired by the datalink layer protocols standardised by ITU-T and ISO. A typical PPP frame is composed of the fields shown in the figure below. A PPP frame starts with a one byte flag containing `01111110`. PPP can use bit stuffing or character stuffing depending on the environment where the protocol is used. The address and control fields are present for backward compatibility reasons. The 16 bit Protocol field contains the identifier [#fpppid]_ of the network layer protocol that is carried in the PPP frame. `0x002d` is used for an IPv4 packet compressed with :rfc:`1144` while `0x002f` is used for an uncompressed IPv4 packet. `0xc021` is used by the Link Control Protocol, `0xc023` is used by the Password Authentication Protocol (PAP). `0x0057` is used for IPv6 packets. PPP supports variable length packets, but LCP can negotiate a maximum packet length. The PPP frame ends with a Frame Check Sequence. The default is a 16 bits CRC, but some implementations can negotiate a 32 bits CRC. The frame ends with the `01111110` flag. - + .. figure:: /../book/lan/pkt/ppp.png :align: center :scale: 100 @@ -28,10 +28,10 @@ The PPP framing :rfc:`1662` was inspired by the datalink layer protocols standar .. index:: Extensible Authentication Protocol, EAP -PPP played a key role in allowing Internet Service Providers to provide dial-up access over modems in the late 1990s and early 2000s. ISPs operated modem banks connected to the telephone network. For these ISPs, a key issue was to authenticate each user connected through the telephone network. This authentication was performed by using the `Extensible Authentication Protocol` (EAP) defined in :rfc:`3748`. EAP is a simple, but extensible protocol that was initially used by access routers to authenticate the users connected through dialup lines. Several authentication methods, starting from the simple username/password pairs to more complex schemes have been defined and implemented. When ISPs started to upgrade their physical infrastructure to provide Internet access over `Asymmetric Digital Subscriber Lines` (ADSL), they tried to reuse their existing authentication (and billing) systems. To meet these requirements, the IETF developed specifications to allow PPP frames to be transported over other networks than the point-to-point links for which PPP was designed. Nowadays, most ADSL deployments use PPP over either ATM :rfc:`2364` or Ethernet :rfc:`2516`. +PPP played a key role in allowing Internet Service Providers to provide dial-up access over modems in the late 1990s and early 2000s. ISPs operated modem banks connected to the telephone network. For these ISPs, a key issue was to authenticate each user connected through the telephone network. This authentication was performed by using the `Extensible Authentication Protocol` (EAP) defined in :rfc:`3748`. EAP is a simple, but extensible protocol that was initially used by access routers to authenticate the users connected through dialup lines. Several authentication methods, starting from the simple username/password pairs to more complex schemes have been defined and implemented. When ISPs started to upgrade their physical infrastructure to provide Internet access over `Asymmetric Digital Subscriber Lines` (ADSL), they tried to reuse their existing authentication (and billing) systems. To meet these requirements, the IETF developed specifications to allow PPP frames to be transported over other networks than the point-to-point links for which PPP was designed. Nowadays, most ADSL deployments use PPP over either ATM :rfc:`2364` or Ethernet :rfc:`2516`. .. rubric:: Footnotes -.. [#flapb] `LAPB `_ and `HDLC `_ were widely used datalink layer protocols. +.. [#flapb] `LAPB `_ and `HDLC `_ were widely used datalink layer protocols. .. [#fpppid] The IANA maintains the registry of all assigned PPP protocol fields at : http://www.iana.org/assignments/ppp-numbers diff --git a/book-2nd/protocols/routing.rst b/book-2nd/protocols/routing.rst index 066700b..0088992 100644 --- a/book-2nd/protocols/routing.rst +++ b/book-2nd/protocols/routing.rst @@ -1,20 +1,20 @@ .. Copyright |copy| 2013 by Olivier Bonaventure .. This file is licensed under a `creative commons licence `_ -.. warning:: +.. warning:: This is an unpolished draft of the second edition of this ebook. If you find any error or have suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues/new Routing in IP networks ====================== -In a large IP network such as the global Internet, routers need to exchange routing information. The Internet is an interconnection of networks, often called domains, that are under different responsibilities. As of this writing, the Internet is composed on more than 40,000 different domains and this number is still growing [#fas]_. A domain can be a small enterprise that manages a few routers in a single building, a larger enterprise with a hundred routers at multiple locations, or a large Internet Service Provider managing thousands of routers. Two classes of routing protocols are used to allow these domains to efficiently exchange routing information. +In a large IP network such as the global Internet, routers need to exchange routing information. The Internet is an interconnection of networks, often called domains, that are under different responsibilities. As of this writing, the Internet is composed on more than 40,000 different domains and this number is still growing [#fas]_. A domain can be a small enterprise that manages a few routers in a single building, a larger enterprise with a hundred routers at multiple locations, or a large Internet Service Provider managing thousands of routers. Two classes of routing protocols are used to allow these domains to efficiently exchange routing information. .. figure:: /../book/network/png/network-fig-093-c.png :align: center :scale: 70 - + Organisation of a small Internet @@ -29,7 +29,7 @@ A very important difference between intradomain and interdomain routing are the When we consider the interconnection of domains that are managed by different organisations, this is no longer true. Each domain implements its own routing policy. A routing policy is composed of three elements : an `import filter` that specifies which routes can be accepted by a domain, an `export filter` that specifies which routes can be advertised by a domain and a ranking algorithm that selects the best route when a domain knows several routes towards the same destination prefix. As we will see later, another important difference is that the objective of the interdomain routing protocol is to find the `cheapest` route towards each destination. There is only one interdomain routing protocol : :term:`BGP`. -Intradomain routing +Intradomain routing =================== In this section, we briefly describe the key features of the two main intradomain unicast routing protocols : RIP and OSPF. The basic principles of distance vector and link-state routing have been presented earlier. @@ -41,7 +41,7 @@ RIP The Routing Information Protocol (RIP) is the simplest routing protocol that was standardised for the TCP/IP protocol suite. RIP is defined in :rfc:`2453`. Additional information about RIP may be found in [Malkin1999]_ -RIP routers periodically exchange RIP messages. The format of these messages is shown below. A RIP message is sent inside a UDP segment whose destination port is set to `521`. A RIP message contains several fields. The `Cmd` field indicates whether the RIP message is a request or a response. When a router boots, its routing table is empty and it cannot forward any packet. To speedup the discovery of the network, it can send a request message to the RIP IPv6 multicast address, ``FF02::9``. All RIP routers listen to this multicast address and any router attached to the subnet will reply by sending its own routing table as a sequence of RIP messages. In steady state, routers multicast one of more RIP response messages every 30 seconds. These messages contain the distance vectors that summarize the router's routing table. The current version of RIP is version 2 defined in :rfc:`2453` for IPv4 and :rfc:`2080` for IPv6. +RIP routers periodically exchange RIP messages. The format of these messages is shown below. A RIP message is sent inside a UDP segment whose destination port is set to `521`. A RIP message contains several fields. The `Cmd` field indicates whether the RIP message is a request or a response. When a router boots, its routing table is empty and it cannot forward any packet. To speedup the discovery of the network, it can send a request message to the RIP IPv6 multicast address, ``FF02::9``. All RIP routers listen to this multicast address and any router attached to the subnet will reply by sending its own routing table as a sequence of RIP messages. In steady state, routers multicast one or more RIP response messages every 30 seconds. These messages contain the distance vectors that summarize the router's routing table. The current version of RIP is version 2 defined in :rfc:`2453` for IPv4 and :rfc:`2080` for IPv6. .. figure:: pkt/ripng.png :align: center @@ -61,7 +61,7 @@ Each RIP message contains a set of route entries. Each route entry is encoded as .. note:: A note on timers - The first RIP implementations sent their distance vector exactly every 30 seconds. This worked well in most networks, but some researchers noticed that routers were sometimes overloaded because they were processing too many distance vectors at the same time [FJ1994]_. They collected packet traces in these networks and found that after some time the routers' timers became synchronised, i.e. almost all routers were sending their distance vectors at almost the same time. This synchronisation of the transmission times of the distance vectors caused an overload on the routers' CPU but also increased the convergence time of the protocol in some cases. This was mainly due to the fact that all routers set their timers to the same expiration time after having processed the received distance vectors. `Sally Floyd`_ and `Van Jacobson`_ proposed in [FJ1994]_ a simple solution to solve this synchronisation problem. Instead of advertising their distance vector exactly after 30 seconds, a router should send its next distance vector after a delay chosen randomly in the [15,45] interval :rfc:`2080`. This randomisation of the delays prevents the synchronisation that occurs with a fixed delay and is now a recommended practice for protocol designers. + The first RIP implementations sent their distance vector exactly every 30 seconds. This worked well in most networks, but some researchers noticed that routers were sometimes overloaded because they were processing too many distance vectors at the same time [FJ1994]_. They collected packet traces in these networks and found that after some time the routers' timers became synchronised, i.e. almost all routers were sending their distance vectors at almost the same time. This synchronisation of the transmission times of the distance vectors caused an overload on the routers' CPU but also increased the convergence time of the protocol in some cases. This was mainly due to the fact that all routers set their timers to the same expiration time after having processed the received distance vectors. `Sally Floyd`_ and `Van Jacobson`_ proposed in [FJ1994]_ a simple solution to solve this synchronisation problem. Instead of advertising their distance vector exactly after 30 seconds, a router should send its next distance vector after a delay chosen randomly in the [15,45] interval :rfc:`2080`. This randomisation of the delays prevents the synchronisation that occurs with a fixed delay and is now a recommended practice for protocol designers. .. index:: OSPF, Open Shortest Path First @@ -72,31 +72,31 @@ Link-state routing protocols are used in IP networks. Open Shortest Path First ( .. index:: OSPF area -Compared to the basics of link-state routing protocols that we discussed in section :ref:`linkstate`, there are some particularities of OSPF that are worth discussing. First, in a large network, flooding the information about all routers and links to thousands of routers or more may be costly as each router needs to store all the information about the entire network. A better approach would be to introduce hierarchical routing. Hierarchical routing divides the network into regions. All the routers inside a region have detailed information about the topology of the region but only learn aggregated information about the topology of the other regions and their interconnections. OSPF supports a restricted variant of hierarchical routing. In OSPF's terminology, a region is called an `area`. +Compared to the basics of link-state routing protocols that we discussed in section :ref:`linkstate`, there are some particularities of OSPF that are worth discussing. First, in a large network, flooding the information about all routers and links to thousands of routers or more may be costly as each router needs to store all the information about the entire network. A better approach would be to introduce hierarchical routing. Hierarchical routing divides the network into regions. All the routers inside a region have detailed information about the topology of the region but only learn aggregated information about the topology of the other regions and their interconnections. OSPF supports a restricted variant of hierarchical routing. In OSPF's terminology, a region is called an `area`. -OSPF imposes restrictions on how a network can be divided into areas. An area is a set of routers and links that are grouped together. Usually, the topology of an area is chosen so that a packet sent by one router inside the area can reach any other router in the area without leaving the area [#fvirtual]_ . An OSPF area contains two types of routers :rfc:`2328`: +OSPF imposes restrictions on how a network can be divided into areas. An area is a set of routers and links that are grouped together. Usually, the topology of an area is chosen so that a packet sent by one router inside the area can reach any other router in the area without leaving the area [#fvirtual]_ . An OSPF area contains two types of routers :rfc:`2328`: - - Internal router : A router whose directly connected networks belong to the area - - Area border routers : A router that is attached to several areas. + - Internal router : A router whose directly connected networks belong to the area + - Area border routers : A router that is attached to several areas. For example, the network shown in the figure below has been divided into three areas : `area 1`, containing routers `R1`, `R3`, `R4`, `R5` and `RA`, `area 2` containing `R7`, `R8`, `R9`, `R10`, `RB` and `RC`. OSPF areas are identified by a 32 bit integer, which is sometimes represented as an IP address. Among the OSPF areas, `area 0`, also called the `backbone area` has a special role. The backbone area groups all the area border routers (routers `RA`, `RB` and `RC` in the figure below) and the routers that are directly connected to the backbone routers but do not belong to another area (router `RD` in the figure below). An important restriction imposed by OSPF is that the path between two routers that belong to two different areas (e.g. `R1` and `R8` in the figure below) must pass through the backbone area. .. figure:: /../book/network/png/network-fig-100-c.png :align: center :scale: 70 - - OSPF areas + + OSPF areas Inside each non-backbone area, routers distribute the topology of the area by exchanging link state packets with the other routers in the area. The internal routers do not know the topology of other areas, but each router knows how to reach the backbone area. Inside an area, the routers only exchange link-state packets for all destinations that are reachable inside the area. In OSPF, the inter-area routing is done by exchanging distance vectors. This is illustrated by the network topology shown below. .. figure:: /protocols/figures/ospf-area.png :align: center :scale: 40 - - Hierarchical routing with OSPF + + Hierarchical routing with OSPF Let us first consider OSPF routing inside `area 2`. All routers in the area learn a route towards `2001:db8:1234::/48` and `2001:db8:5678::/48`. The two area border routers, `RB` and `RC`, create network summary advertisements. Assuming that all links have a unit link metric, these would be: - + - `RB` advertises `2001:db8:1234::/48` at a distance of `2` and `2001:db8:5678::/48` at a distance of `3` - `RC` advertises `2001:db8:5678::/48` at a distance of `2` and `2001:db8:1234::/48` at a distance of `3` @@ -133,14 +133,14 @@ The second OSPF particularity that is worth discussing is the support of Local A R1--lan; R2--lan; R3--lan; - R4--lan; + R4--lan; } A first solution to support such a LAN with a link-state routing protocol would be to consider that a LAN is equivalent to a full-mesh of point-to-point links as if each router can directly reach any other router on the LAN. However, this approach has two important drawbacks : #. Each router must exchange HELLOs and link state packets with all the other routers on the LAN. This increases the number of OSPF packets that are sent and processed by each router. - #. Remote routers, when looking at the topology distributed by OSPF, consider that there is a full-mesh of links between all the LAN routers. Such a full-mesh implies a lot of redundancy in case of failure, while in practice the entire LAN may completely fail. In case of a failure of the entire LAN, all routers need to detect the failures and flood link state packets before the LAN is completely removed from the OSPF topology by remote routers. + #. Remote routers, when looking at the topology distributed by OSPF, consider that there is a full-mesh of links between all the LAN routers. Such a full-mesh implies a lot of redundancy in case of failure, while in practice the entire LAN may completely fail. In case of a failure of the entire LAN, all routers need to detect the failures and flood link state packets before the LAN is completely removed from the OSPF topology by remote routers. To better represent LANs and reduce the number of OSPF packets that are exchanged, OSPF handles LAN differently. When OSPF routers boot on a LAN, they elect [#felection]_ one of them as the `Designated Router (DR)` :rfc:`2328`. The `DR` router `represents` the local area network, and advertises the LAN's subnet. Furthermore, LAN routers only exchange HELLO packets with the `DR`. Thanks to the utilisation of a `DR`, the topology of the LAN appears as a set of point-to-point links connected to the `DR` router. @@ -152,7 +152,7 @@ To better represent LANs and reduce the number of OSPF packets that are exchange - the routers that are adjacent to the failure detect it quickly. The default solution is to rely on the regular exchange of HELLO packets. However, the interval between successive HELLOs is often set to 10 seconds... Setting the HELLO timer down to a few milliseconds is difficult as HELLO packets are created and processed by the main CPU of the routers and these routers cannot easily generate and process a HELLO packet every millisecond on each of their interfaces. A better solution is to use a dedicated failure detection protocol such as the Bidirectional Forwarding Detection (BFD) protocol defined in [KW2009]_ that can be implemented directly on the router interfaces. Another solution to be able to detect the failure is to instrument the physical and the datalink layer so that they can interrupt the router when a link fails. Unfortunately, such a solution cannot be used on all types of physical and datalink layers. - the routers that have detected the failure flood their updated link state packets in the network - - all routers update their routing table + - all routers update their routing table A last, but operationally important, point needs to be discussed about intradomain routing protocols such as OSPF and IS-IS. Intradomain routing protocols always select the shortest path for each destination. In practice, there are often several equal paths towards the same destination. When a router computes several equal cost paths towards one destination, it can use these paths in different ways. @@ -161,7 +161,7 @@ A first approach is to select one of the equal cost paths (e.g. the first or the A second approach is to install all equal cost paths [#fmaxpaths]_ in the forwarding table and load-balance the packets on the different paths. Consider the case where a router has `N` different outgoing interfaces to reach destination `d`. A first possibility to load-balance the traffic among these interfaces is to use `round-robin`. `Round-robin` allows to equally balance the packets among the `N` outgoing interfaces. This equal load-balancing is important in practice because it allows to better spread the load throughout the network. However, few networks use this `round-robin` strategy to load-balance traffic on routers. The main drawback of `round-robin` is that packets that belong to the same flow (e.g. TCP connection) may be forwarded over different paths. If packets belonging to the same TCP connection are sent over different paths, they will probably experience different delays and arrive out-of-sequence at their destination. When a TCP receiver detects out-of-order segments, it sends duplicate acknowledgements that may cause the sender to initiate a fast retransmission and enter congestion avoidance. Thus, out-of-order segments may lead to lower TCP performance. This is annoying for a load-balancing technique whose objective is to improve the network performance by spreading the load. -To efficiently spread the load over different paths, routers need to implement `per-flow` load-balancing. This implies that they must forward all the packets that belong to the same flow on the same path. Since a TCP connection is always identified by the four-tuple (source and destination addresses, source and destination ports), one possibility would be to select an outgoing interface upon arrival of the first packet of the flow and store this decision in the router's memory. Unfortunately, such a solution does not scale since the required memory grows with the number of TCP connections that pass through the router. +To efficiently spread the load over different paths, routers need to implement `per-flow` load-balancing. This implies that they must forward all the packets that belong to the same flow on the same path. Since a TCP connection is always identified by the four-tuple (source and destination addresses, source and destination ports), one possibility would be to select an outgoing interface upon arrival of the first packet of the flow and store this decision in the router's memory. Unfortunately, such a solution does not scale since the required memory grows with the number of TCP connections that pass through the router. Fortunately, it is possible to perform `per-flow` load balancing without maintaining any state on the router. Most routers today use hash functions for this purpose :rfc:`2991`. When a packet arrives, the router extracts the Next Header information and the four-tuple from the packet and computes : @@ -178,7 +178,7 @@ In this formula, `N` is the number of outgoing interfaces on the equal cost path .. [#fvirtual] OSPF can support `virtual links` to connect routers together that belong to the same area but are not directly connected. However, this goes beyond this introduction to OSPF. -.. [#felection] The OSPF Designated Router election procedure is defined in :rfc:`2328`. Each router can be configured with a router priority that influences the election process since the router with the highest priority is preferred when an election is run. +.. [#felection] The OSPF Designated Router election procedure is defined in :rfc:`2328`. Each router can be configured with a router priority that influences the election process since the router with the highest priority is preferred when an election is run. .. [#fmaxpaths] In some networks, there are several dozens of paths towards a given destination. Some routers, due to hardware limitations, cannot install more than 8 or 16 paths in their forwarding table. In this case, a subset of the computed paths is installed in the forwarding table. diff --git a/book-2nd/protocols/rpc.rst b/book-2nd/protocols/rpc.rst index bab8a4f..2ee39e4 100644 --- a/book-2nd/protocols/rpc.rst +++ b/book-2nd/protocols/rpc.rst @@ -17,7 +17,7 @@ In traditional programming languages, `procedure calls` allow programmers to bet This model was developed with a single host in mind. How should it be modified if the caller and the callee are different hosts connected through a network ? Since the two hosts can be different, the two main problems are the fact they do not share the same memory and that they do not necessarily use the same representation for numbers, characters, ... Let us examine how the five steps identified above can be supported through a network. -The first problem to be solved is how to transfer the information from the caller to the callee. This problem is not simple and includes two sub-problems. The first subproblem is the encoding of the information. How to encode the values of the parameters so that they can be transferred correctly through the network ? The second problem is how to reach the callee through the network ? The callee is identified by a procedure name, but to use the transport service, we need to convert this name into an address and a port number. +The first problem to be solved is how to transfer the information from the caller to the callee. This problem is not simple and includes two sub-problems. The first subproblem is the encoding of the information. How to encode the values of the parameters so that they can be transferred correctly through the network ? The second subproblem is how to reach the callee through the network ? The callee is identified by a procedure name, but to use the transport service, we need to convert this name into an address and a port number. .. index:: XDR @@ -26,14 +26,14 @@ Encoding data The encoding problem exists in a wide range of applications. In the previous sections, we have described how character-based encodings are used by email and http. Although standard encoding techniques such as ASN.1 [Dubuisson2000]_ have been defined to cover most application needs, many applications have defined their specific encoding. `Remote Procedure Call` are no exception to this rule. The three most popular encoding methods are probably XDR :rfc:`1832` used by ONC-RPC :rfc:`1831`, XML, used by XML-RPC and JSON :rfc:`4627`. -The eXternal Data Representation (XDR) Standard, defined in :rfc:`1832` is an early specification that describes how information exchanged during Remote Procedure Calls should be encoded before being transmitted through a network. Since the transport service allows to transfer a block of bytes (with the connectionless service) or a stream of bytes (by using the connection-oriented service), XDR maps each datatype onto a sequence of bytes. The caller encodes each data in the appropriate sequence and the callee decodes the received information. Here are a few examples extracted from :rfc:`1832` to illustrate how this encoding/decoding can be performed. +The eXternal Data Representation (XDR) Standard, defined in :rfc:`1832`, is an early specification that describes how information exchanged during Remote Procedure Calls should be encoded before being transmitted through a network. Since the transport service allows to transfer a block of bytes (with the connectionless service) or a stream of bytes (by using the connection-oriented service), XDR maps each datatype onto a sequence of bytes. The caller encodes each data in the appropriate sequence and the callee decodes the received information. Here are a few examples extracted from :rfc:`1832` to illustrate how this encoding/decoding can be performed. For basic data types, :rfc:`1832` simply maps their representation into a sequence of bytes. For example a 32 bits integer is transmitted as follows (with the most significant byte first, which corresponds to big-endian encoding). .. figure:: /protocols/pkt/xdr-integer.png :align: center - + XDR also supports 64 bits integers and booleans. The booleans are mapped onto integers (`0` for `false` and `1` for `true`). For the floating point numbers, the encoding defined in the IEEE standard is used. @@ -41,7 +41,7 @@ XDR also supports 64 bits integers and booleans. The booleans are mapped onto in .. figure:: /protocols/pkt/xdr-integer-64.png :align: center -In this representation, the first bit (`S`) is the sign (`0` represents positive). The next 11 bits represent the exponent of the number (`E`), in base 2, and the remaining 52 bits are the fractional part of the number (`F`). The floating point number that corresponds to this representation is :math:`(-1)^{S} \times 2^{E-1023} \times 1.F`. XDR also allows to encode complex data types. A first example is the string of bytes. A string of bytes is composed of two parts : a length (encoded as an integer) and a sequence of bytes. For performance reasons, the encoding of a string is aligned to 32 bits boundaries. This implies that some padding bytes may be inserted during the encoding operation is the length of the string is not a multiple of 4. The structure of the string is shown below (source :rfc:`1832`). +In this representation, the first bit (`S`) is the sign (`0` represents positive). The next 11 bits represent the exponent of the number (`E`), in base 2, and the remaining 52 bits are the fractional part of the number (`F`). The floating point number that corresponds to this representation is :math:`(-1)^{S} \times 2^{E-1023} \times 1.F`. XDR also allows to encode complex data types. A first example is the string of bytes. A string of bytes is composed of two parts : a length (encoded as an integer) and a sequence of bytes. For performance reasons, the encoding of a string is aligned to 32 bits boundaries. This implies that some padding bytes may be inserted during the encoding operation if the length of the string is not a multiple of 4. The structure of the string is shown below (source :rfc:`1832`). .. figure:: /protocols/pkt/xdr-double.png @@ -49,13 +49,13 @@ In this representation, the first bit (`S`) is the sign (`0` represents positive -In some situations, it is necessary to encode fixed or variable length arrays. XDR :rfc:`1832` supports such arrays. For example, the encoding below corresponds to a variable length array containing n elements. The encoded representation starts with an integer that contains the number of elements and follows with all elements in sequence. It is also possible to encode a fixed-length array. In this case, the first integer is missing. +In some situations, it is necessary to encode fixed or variable length arrays. XDR :rfc:`1832` supports such arrays. For example, the encoding below corresponds to a variable length array containing n elements. The encoded representation starts with an integer that contains the number of elements and follows with all elements in sequence. It is also possible to encode a fixed-length array. In this case, the first integer is missing. .. figure:: /protocols/pkt/xdr-array.png :align: center -XDR also supports the definition of unions, structures, ... Additional details are provided in :rfc:`1832`. +XDR also supports the definition of unions, structures, ... Additional details are provided in :rfc:`1832`. A second popular method to encode data is the JavaScript Object Notation (JSON). This syntax was initially defined to allow applications written in JavaScript to exchange data, but it has now wider usages. JSON :rfc:`4627` is a text-based representation. The simplest data type is the integer. It is represented as a sequence of digits in ASCII. Strings can also be encoding by using JSON. A JSON string always starts and ends with a quote character (`"`) as in the C language. As in the C language, some characters (like `"` or `\\`) must be escaped if they appear in a string. :rfc:`4627` describes this in details. Booleans are also supported by using the strings `false` and `true`. Like XDR, JSON supports more complex data types. A structure or object is defined as a comma separated list of elements enclosed in curly brackets. :rfc:`4627` provides the following example as an illustration. @@ -73,14 +73,14 @@ A second popular method to encode data is the JavaScript Object Notation (JSON). }, "ID": 1234 } - } + } This object has one field named `Image`. It has five attributes. The first one, `Width`, is an integer set to 800. The third one is a string. The fourth attribute, `Thumbnail` is also an object composed of three different attributes, one string and two integers. JSON can also be used to encode arrays or lists. In this case, square brackets are used as delimiters. The snippet below shows an array which contains the prime integers that are smaller than ten. .. code-block:: javascript - { + { "Primes" : [ 2, 3, 5, 7 ] } @@ -110,7 +110,7 @@ Upon reception of this JSON structure, the callee parses the object, locates the - `result`: if the request succeeded, this member contains the result of the request (in our example, value `4`). - `error`: if the method called does not exist or its execution causes an error, the `result` element will be replaced by an `error` element which contains the following members : - - `code`: a number that indicates the type of error. Several error codes are defined in [JSON-RPC2]_. For example, `-32700` indicates an error in parsing the request, `-32602` indicates invalid parameters and `-32601` indicates that the method could not be found on the server. Other error codes are listed in [JSON-RPC2]_. + - `code`: a number that indicates the type of error. Several error codes are defined in [JSON-RPC2]_. For example, `-32700` indicates an error in parsing the request, `-32602` indicates invalid parameters and `-32601` indicates that the method could not be found on the server. - `message`: a string (limited to one sentence) that provides a short description of the error. - `data`: an optional field that provides additional information about the error. @@ -118,7 +118,7 @@ Coming back to our example with the call for the `sum` procedure, it would retur .. code-block:: javascript - { "jsonrpc": "2.0", "result": 4, "id": 1} + { "jsonrpc": "2.0", "result": 4, "id": 1} If the `sum` method is not implemented on the server, it would reply with the following response. @@ -128,7 +128,7 @@ If the `sum` method is not implemented on the server, it would reply with the fo { "jsonrpc": "2.0", "error": {"code": -32601, "message": "Method not found"}, "id": "1"} -The `id` field, which is present in the request and the response plays the same role as the identifier field in the DNS message. It allows the caller to match the response with the request that it sent. This `id` is very important when JSON-RPC is used over the connectionless service which is unreliable. If a request is sent, it may need to be retransmitted and it is possible that a callee will receive twice the same request (e.g. if the response for the first request was lost). In the DNS, when a request is lost, it can be retransmitted without causing any difficulty. However with remote procedure calls in general, losses can cause some problems. Consider a method which is used to deposit money on a bank account. If the request is lost, it will be retransmitted and the deposit will be eventually performed. However, if the response is lost, the caller will also retransmit its request. This request will be received by the callee that will deposit the money again. To prevent this problem from affecting the application, either the programmer must ensure that the remote procedures that it calls can be safely called multiple times or the application must verify whether the request has been transmitted earlier. In most deployments, the programmers use remote methods that can be safely called multiple times without breaking the application logic. +The `id` field, which is present in the request and the response plays the same role as the identifier field in the DNS message. It allows the caller to match the response with the request that it sent. This `id` is very important when JSON-RPC is used over the connectionless service which is unreliable. If a request is sent, it may need to be retransmitted and it is possible that a callee will receive twice the same request (e.g. if the response for the first request was lost). In the DNS, when a request is lost, it can be retransmitted without causing any difficulty. However with remote procedure calls in general, losses can cause some problems. Consider a method which is used to deposit money on a bank account. If the request is lost, it will be retransmitted and the deposit will be eventually performed. However, if the response is lost, the caller will also retransmit its request. This request will be received by the callee that will deposit the money again. To prevent this problem from affecting the application, either the programmer must ensure that the remote procedures that it calls can be safely called multiple times or the application must verify whether the request has been transmitted earlier. In most deployments, the programmers use remote methods that can be safely called multiple times without breaking the application logic. .. index:: portmapper diff --git a/book-2nd/protocols/sctp.rst b/book-2nd/protocols/sctp.rst index 627aec7..d5a5d05 100644 --- a/book-2nd/protocols/sctp.rst +++ b/book-2nd/protocols/sctp.rst @@ -1,6 +1,6 @@ .. Copyright |copy| 2013 by Olivier Bonaventure .. This file is licensed under a `creative commons licence `_ -.. Part of this text has been extracted from . Recent Advances in Reliable Transport Protocols (Costin Raiciu , Olivier Bonaventure, Janardhan Iyengar), http://www.sigcomm.org/content/ebook +.. Part of this text has been extracted from . Recent Advances in Reliable Transport Protocols (Costin Raiciu , Olivier Bonaventure, Janardhan Iyengar), http://www.sigcomm.org/content/ebook .. index:: SCTP .. _SCTP: @@ -16,12 +16,12 @@ One of the first motivations for SCTP was the need to efficiently support multih A second motivation for designing SCTP was to provide a different service than TCP's bytestream to the applications. A first service brought by SCTP is the ability exchange messages instead of only a stream of bytes. This is a major modification which has many benefits for applications. Unfortunately, there are many deployed applications that have been designed under the assumption of the bytestream service. Rewriting them to benefit from a message-mode service will require a lot of effort. It seems unlikely as of this writing to expect old applications to be rewritten to fully support SCTP and use it. However, some new applications are considering using SCTP instead of TCP. Voice over IP signaling protocols are a frequently cited example. The Real-Time Communication in Web-browsers working group is also considering the utilization of SCTP for some specific data channels [JLT2013]_. From a service viewpoint, a second advantage of SCTP compared to TCP is its ability to support several simultaneous streams. Consider a web application that needs to retrieve five objects from a remote server. With TCP, one possibility is to open one TCP connection for each object, send a request over each connection and retrieve one object per connection. This is the solution used by HTTP/1.0 as explained earlier. The drawback of this approach is that the application needs to maintain several concurrent TCP connections. Another solution is possible with HTTP/1.1 [NGB+1997]_ . With HTTP/1.1, the client can use pipelining to send several HTTP Requests without waiting for the answer of each request. The server replies to these requests in sequence, one after the other. If the server replies to the requests in the sequence, this may lead to `head-of-line blocking` problems. Consider that the objects different sizes. The first object is a large 10 MBytes image while the other objects are small javascript files. In this case, delivering the objects in sequence will cause a very long delay for the javascript files since they will only be transmitted once the large image has been sent. -With SCTP, `head-of-line blocking` can be mitigated. SCTP can open a single connection and divide it in five logical streams so that the five objects are sent in parallel over the single connection. SCTP controls the transmission of the segments over the connection and ensures that the data is delivered efficiently to the application. In the example above, the small javascript files could be delivered as independent messages before the large image. +With SCTP, `head-of-line blocking` can be mitigated. SCTP can open a single connection and divide it in five logical streams so that the five objects are sent in parallel over the single connection. SCTP controls the transmission of the segments over the connection and ensures that the data is delivered efficiently to the application. In the example above, the small javascript files could be delivered as independent messages before the large image. Another extension to SCTP :rfc:`3758` supports partially-reliable delivery. With this -extension, an SCTP sender can be instructed to "expire" data based on +extension, an SCTP sender can be instructed to "expire" data based on one of several events, such as a timeout, -the sender can signal the SCTP receiver to move on without waiting for +the sender can signal the SCTP receiver to move on without waiting for the `expired` data. This partially reliable service could be useful to provide timed delivery for example. With this service, there is an upper limit on the time @@ -40,11 +40,11 @@ the data is discarded by the sender without causing any stall in the stream. SCTP segments ------------- -SCTP entities exchange segments. In contrast with TCP that uses a simple segment format with a limited space for the options, the designers of SCTP have learned from the experience of using and extending TCP during almost two decades. An SCTP segment is always composed of a fixed size `common header` followed -by a variable number of chunks. The `common header` is 12 bytes long and contains four fields. The first two fields and the `Source` and `Destination` ports that allow to identify the SCTP connection. The `Verification tag` is a field that is set during connection establishment and placed in all segments exchanged during a connection to validate the received segments. The last field of the common header is a 32bits CRC. This CRC is computed over the entire segment (common header and all chunks). It is computed by the sender and verified by the receiver. Note that although this field is named `Checksum` :rfc:`4960` it is computed by using the CRC-32 algorithm that has much stronger error detection capabilities than the Internet checksum algorithm used by TCP [SGP98]_. +SCTP entities exchange segments. In contrast with TCP that uses a simple segment format with a limited space for the options, the designers of SCTP have learned from the experience of using and extending TCP during almost two decades. An SCTP segment is always composed of a fixed size `common header` followed +by a variable number of chunks. The `common header` is 12 bytes long and contains four fields. The first two fields are the `Source` and `Destination` ports that allow to identify the SCTP connection. The `Verification tag` is a field that is set during connection establishment and placed in all segments exchanged during a connection to validate the received segments. The last field of the common header is a 32bits CRC. This CRC is computed over the entire segment (common header and all chunks). It is computed by the sender and verified by the receiver. Note that although this field is named `Checksum` :rfc:`4960` it is computed by using the CRC-32 algorithm that has much stronger error detection capabilities than the Internet checksum algorithm used by TCP [SGP98]_. .. figure:: /protocols/pkt/sctp-header-chunks.png - + The SCTP segment format .. index:: SCTP chunk @@ -57,7 +57,7 @@ The SCTP chunks play a key role in the extensibility of SCTP. In TCP, the extens The SCTP chunk format -The first byte indicates the chunk type. 15 chunk types are defined in :rfc:`4960` and new ones can be easily added. The low-order 16 bits of the first word contain the length of the chunk in bytes. The presence of the length field ensures that any SCTP implementation will be able to correctly parse any received SCTP segment, even if it contains unknown or new chunks. To further ease the processing of unknown chunks, :rfc:`4960` uses the first two bits of the chunk type to specify how an SCTP implementation should react when receiving an unknown chunk. If the two high-order bits of the type of the unknown are set to ``00``, then the entire SCTP segment containing the chunk should be discarded. It is expected that all SCTP implementations are capable of recognizing and processing these chunks. If the first two bits of the chunk type are set to ``01`` the SCTP segment must be discarded and an error reported to the sender. If the two high order bits of the type are set to ``10`` (resp. ``11``), the chunk must be ignored, but the processing of the other chunks in the SCTP segment continues (resp. and an error is reported). The second byte contains flags that are used for some chunks. +The first byte indicates the chunk type. 15 chunk types are defined in :rfc:`4960` and new ones can be easily added. The low-order 16 bits of the first word contain the length of the chunk in bytes. The presence of the length field ensures that any SCTP implementation will be able to correctly parse any received SCTP segment, even if it contains unknown or new chunks. To further ease the processing of unknown chunks, :rfc:`4960` uses the first two bits of the chunk type to specify how an SCTP implementation should react when receiving an unknown chunk. If the two high-order bits of the type of the unknown are set to ``00``, then the entire SCTP segment containing the chunk should be discarded. It is expected that all SCTP implementations are capable of recognizing and processing these chunks. If the first two bits of the chunk type are set to ``01`` the SCTP segment must be discarded and an error reported to the sender. If the two high order bits of the type are set to ``10`` (resp. ``11``), the chunk must be ignored, but the processing of the other chunks in the SCTP segment continues (resp. and an error is reported). The second byte contains flags that are used for some chunks. Connection establishment ------------------------ @@ -82,7 +82,7 @@ The SCTP connection establishment uses several chunks to specify the values of s |||; -The first segment contains the ``INIT`` chunk. To establish an SCTP connection with a server, the client first creates some local state for this connection. The most important parameter of the ``INIT`` chunk is the `Initiation tag`. This value is a random number that is used to identify the connection on the client host for its entire lifetime. This `Initiation tag` is placed as the `Verification tag` in all segments sent by the server. This is an important change compared to TCP where only the source and destination ports are used to identify a given connection. The `INIT`` chunk may also contain the other addresses owned by the client. The server responds by sending an ``INIT-ACK`` chunk. This chunk also contains an `Initiation tag` chosen by the server and a copy of the `Initiation tag` chosen by the client. The ``INIT`` and ``INIT-ACK`` chunks also contain an initial sequence number. A key difference between TCP's three-way handshake and SCTP's four-way handshake is that an SCTP server does not create any state when receiving an ``INIT`` chunk. For this, the server places inside the ``INIT-ACK`` reply a `State cookie` chunk. This `State cookie` is an opaque block of data that contains information computed from the ``INIT`` and ``INIT-ACK`` chunks that the server would have had stored locally, some lifetime information and a signature. The format of the `State cookie` is flexible and the server could in theory place almost any information inside this chunk. The only requirement is that the `State cookie` must be echoed back by the client to confirm the establishment of the connection. Upon reception of the ``COOKIE-ECHO`` chunk, the server verifies the signature of the `State cookie`. The client may provide some user data and an initial sequence number inside the ``COOKIE-ECHO`` chunk. The server then responds with a ``COOKIE-ACK`` chunk that acknowledges the ``COOKIE-ECHO`` chunk. The SCTP connection between the client and the server is now established. This four-way handshake is both more secure and more flexible than the three-way handshake used by TCP. The detailed formats of the ``INIT``, ``INIT-ACK``, ``COOKIE-ECHO`` and ``COOKIE-ACK`` chunks may be found in :rfc:`4960`. +The first segment contains the ``INIT`` chunk. To establish an SCTP connection with a server, the client first creates some local state for this connection. The most important parameter of the ``INIT`` chunk is the `Initiation tag`. This value is a random number that is used to identify the connection on the client host for its entire lifetime. This `Initiation tag` is placed as the `Verification tag` in all segments sent by the server. This is an important change compared to TCP where only the source and destination ports are used to identify a given connection. The ``INIT`` chunk may also contain the other addresses owned by the client. The server responds by sending an ``INIT-ACK`` chunk. This chunk also contains an `Initiation tag` chosen by the server and a copy of the `Initiation tag` chosen by the client. The ``INIT`` and ``INIT-ACK`` chunks also contain an initial sequence number. A key difference between TCP's three-way handshake and SCTP's four-way handshake is that an SCTP server does not create any state when receiving an ``INIT`` chunk. For this, the server places inside the ``INIT-ACK`` reply a `State cookie` chunk. This `State cookie` is an opaque block of data that contains information computed from the ``INIT`` and ``INIT-ACK`` chunks that the server would have had stored locally, some lifetime information and a signature. The format of the `State cookie` is flexible and the server could in theory place almost any information inside this chunk. The only requirement is that the `State cookie` must be echoed back by the client to confirm the establishment of the connection. Upon reception of the ``COOKIE-ECHO`` chunk, the server verifies the signature of the `State cookie`. The client may provide some user data and an initial sequence number inside the ``COOKIE-ECHO`` chunk. The server then responds with a ``COOKIE-ACK`` chunk that acknowledges the ``COOKIE-ECHO`` chunk. The SCTP connection between the client and the server is now established. This four-way handshake is both more secure and more flexible than the three-way handshake used by TCP. The detailed formats of the ``INIT``, ``INIT-ACK``, ``COOKIE-ECHO`` and ``COOKIE-ACK`` chunks may be found in :rfc:`4960`. Reliable data transfert @@ -98,7 +98,7 @@ SCTP provides a slightly different service model :rfc:`3286`. Once an SCTP conne .. index:: SCTP TSN, Transmission Sequence Number -An SCTP DATA chunk contains several fields as shown in the figure above. The detailed description of this chunk may be found in :rfc:`4960`. For simplicity, we focus on an SCTP connection that supports a single stream. SCTP uses the `Transmission Sequence Number` (TSN) to sequence the data chunks that are sent. The TSN is also used to reorder the received DATA chunks and detect lost chunks. This TSN is encoded as a 32 bits field, as the sequence number by the TCP. However, the TSN is only incremented by one for each data chunk. This implies that the TSN space does not wrap as quickly as the TCP sequence number. When a small message needs to be sent, the SCTP entity creates a new data chunk with the next available TSN and places the data inside the chunk. A single SCTP segment may contain several data chunks, e.g. when small messages are transmitted. Each message is identified by its TSN and within a stream all messages are delivered in sequence. If the message to be transmitted is larger than the underlying network packet, SCTP needs to fragment the message in several chunks that are placed in subsequent segments. The packing of the message in successive segments must still enable the receiver to detect the message boundaries. This is achieved by using the ``B`` and ``E`` bits of the second high-order byte of the data chunk. The ``B`` (Begin) bit is set when the first byte of the User data field of the data chunk is the first byte of the message. The ``E`` (End) bit is set when the last byte of the User data field of the data chunk is the last byte of the message. A small message is always a sent as chunk whose ``B`` and ``E`` bits are set to `1`. A message which is larger than one network packet will be fragmented in several chunks. Consider for example a message that needs to be divided in three chunks sent in three different SCTP segments. The first chunk will have its ``B`` bit set to ``1`` and its ``E`` bit set to ``0`` and a TSN (say `x`). The second chunk will have both its ``B`` and ``E`` bits set to ``0`` and its TSN will be `x+1`. The third, and last, chunk will have its ``B`` bit set to ``0``, its ``E`` bit set to ``1`` and its TSN will be `x+2`. All the chunks that correspond to a given message must have successive TSNs. The ``B`` and ``E`` bits allow the receiver to recover the message from the received data chunks. +An SCTP DATA chunk contains several fields as shown in the figure above. The detailed description of this chunk may be found in :rfc:`4960`. For simplicity, we focus on an SCTP connection that supports a single stream. SCTP uses the `Transmission Sequence Number` (TSN) to sequence the data chunks that are sent. The TSN is also used to reorder the received DATA chunks and detect lost chunks. This TSN is encoded as a 32 bits field, as the sequence number by the TCP. However, the TSN is only incremented by one for each data chunk. This implies that the TSN space does not wrap as quickly as the TCP sequence number. When a small message needs to be sent, the SCTP entity creates a new data chunk with the next available TSN and places the data inside the chunk. A single SCTP segment may contain several data chunks, e.g. when small messages are transmitted. Each message is identified by its TSN and within a stream all messages are delivered in sequence. If the message to be transmitted is larger than the underlying network packet, SCTP needs to fragment the message in several chunks that are placed in subsequent segments. The packing of the message in successive segments must still enable the receiver to detect the message boundaries. This is achieved by using the ``B`` and ``E`` bits of the second high-order byte of the data chunk. The ``B`` (Begin) bit is set when the first byte of the User data field of the data chunk is the first byte of the message. The ``E`` (End) bit is set when the last byte of the User data field of the data chunk is the last byte of the message. A small message is always sent as a chunk whose ``B`` and ``E`` bits are set to `1`. A message which is larger than one network packet will be fragmented in several chunks. Consider for example a message that needs to be divided in three chunks sent in three different SCTP segments. The first chunk will have its ``B`` bit set to ``1`` and its ``E`` bit set to ``0`` and a TSN (say `x`). The second chunk will have both its ``B`` and ``E`` bits set to ``0`` and its TSN will be `x+1`. The third, and last, chunk will have its ``B`` bit set to ``0``, its ``E`` bit set to ``1`` and its TSN will be `x+2`. All the chunks that correspond to a given message must have successive TSNs. The ``B`` and ``E`` bits allow the receiver to recover the message from the received data chunks. .. index:: SCTP SACK chunk, SCTP Selective Acknowledgement chunk @@ -108,7 +108,7 @@ The data chunks are only one part of the reliable data transfert. To reliably tr The SCTP Sack chunk -This chunk is sent by a sender when it needs to send feedback about the reception of data chunks or its buffer space to the remote sender. The `Cumulative TSN ack` contains the TSN of the last data chunk that was received in sequence. This cumulative indicates which TSN has been reliably received by the receiver. The evolution of this field shows the progress of the reliable transmission. This is the first feedback provided by SCTP. Note that in SCTP the acknowledgements are at the chunk level and not at the byte level in contrast with TCP. While SCTP transfers messages divided in chunks, buffer space is still measured in bytes and not in variable-length messages or chunks. The `Advertised Receiver Window Credit` field of the Sack chunk provides the current receive window of the receiver. This window is measured in bytes and its left edge is the last byte of the last in-sequence data chunk. +This chunk is sent by a sender when it needs to send feedback about the reception of data chunks or its buffer space to the remote sender. The `Cumulative TSN ack` contains the TSN of the last data chunk that was received in sequence. This cumulative indicates which TSN has been reliably received by the receiver. The evolution of this field shows the progress of the reliable transmission. This is the first feedback provided by SCTP. Note that in SCTP the acknowledgements are at the chunk level and not at the byte level in contrast with TCP. While SCTP transfers messages divided in chunks, buffer space is still measured in bytes and not in variable-length messages or chunks. The `Advertised Receiver Window Credit` field of the Sack chunk provides the current receive window of the receiver. This window is measured in bytes and its left edge is the last byte of the last in-sequence data chunk. The Sack chunk also provides information about the received out-of-sequence chunks (if any). The Sack chunk contains gap blocks that are in principle similar to the TCP Sack option. However, there are some differences between TCP and SCTP. The Sack option used by TCP has a limited size. This implies that if there are many gaps that need to be reported, a TCP receiver must decide which gaps to include in the SACK option. The SCTP Sack chunk is only limited by the network packet length, which is not a problem in practice. A second difference is that SCTP can also provide feedback about the reception of duplicate chunks. If several copies of the same data chunk have been received, this probably indicates a bad heuristic on the sender. The last part of the Sack chunk provides the list of duplicate TSN received to enable a sender to tune its retransmission mechanism based on this information. Some details on a possible use of this field may be found in :rfc:`3708`. The last difference with the TCP SACK option is that the gaps are encoded as deltas relative to the `Cumulative TSN ack`. These deltas are encoded as 16 bits integers and allow to reduce the length of the chunk. @@ -119,7 +119,7 @@ Connection release ------------------ -SCTP uses a different approach to terminante connections. When an application requests a shutdown of a connection, SCTP performs a three-way handshake. This handshake uses the ``SHUTDOWN``, ``SHUTDOWN-ACK`` and ``SHUTDOWN-COMPLETE`` chunks. The ``SHUTDOWN`` chunk is sent once all outgoing data has been acknowledged. It contains the last cumulative sequence number. Upon reception of a ``SHUTDOWN`` chunk, an SCTP entity informs its application that it cannot accept anymore data over this connection. It then ensures that all outstanding data have been delivered correctly. At that point, it sends a ``SHUTDOWN-ACK`` to confirm the reception of the ``SHUTDOWN`` segment. The three-way handshake completes with the transmission of the ``SHUTDOWN-COMPLETE`` chunk :rfc:`4960`. +SCTP uses a different approach to terminate connections. When an application requests a shutdown of a connection, SCTP performs a three-way handshake. This handshake uses the ``SHUTDOWN``, ``SHUTDOWN-ACK`` and ``SHUTDOWN-COMPLETE`` chunks. The ``SHUTDOWN`` chunk is sent once all outgoing data has been acknowledged. It contains the last cumulative sequence number. Upon reception of a ``SHUTDOWN`` chunk, an SCTP entity informs its application that it cannot accept anymore data over this connection. It then ensures that all outstanding data have been delivered correctly. At that point, it sends a ``SHUTDOWN-ACK`` to confirm the reception of the ``SHUTDOWN`` segment. The three-way handshake completes with the transmission of the ``SHUTDOWN-COMPLETE`` chunk :rfc:`4960`. .. msc:: @@ -128,11 +128,11 @@ SCTP uses a different approach to terminante connections. When an application re client=>server [ label = "SHUTDOWN(TSN=last)", arcskip="1" ]; server=>client [ label = "SHUTDOWN-ACK", arcskip="1"]; - |||; + |||; client=>server [ label = "SHUTDOWN-COMPLETE", arcskip="1" ]; |||; -Note that in contrast with TCP's four-way handshake, the utilisation of a three-way handshake to close an SCTP connection implies that the client (resp. server) may close the connection when the application at the other end has still some data to transmit. Upon reception of the ``SHUTDOWN`` chunk, an SCTP entity must stop accepting new data from the application, but it still needs to retransmit the unacknowledged data chunks (the ``SHUTDOWN`` chunk may be placed in the same segment as a ``Sack`` chunk that indicates gaps in the received chunks). +Note that in contrast with TCP's four-way handshake, the utilisation of a three-way handshake to close an SCTP connection implies that the client (resp. server) may close the connection when the application at the other end has still some data to transmit. Upon reception of the ``SHUTDOWN`` chunk, an SCTP entity must stop accepting new data from the application, but it still needs to retransmit the unacknowledged data chunks (the ``SHUTDOWN`` chunk may be placed in the same segment as a ``Sack`` chunk that indicates gaps in the received chunks). SCTP also provides the equivalent to TCP's ``RST`` segment. The ``ABORT`` chunk can be used to refuse a connection, react to the reception of an invalid segment or immediately close a connection (e.g. due to lack of resources). diff --git a/book-2nd/protocols/ssh.rst b/book-2nd/protocols/ssh.rst index cba41c8..13b7bf2 100644 --- a/book-2nd/protocols/ssh.rst +++ b/book-2nd/protocols/ssh.rst @@ -14,9 +14,9 @@ and 1970s, the mainframes and the emerging minicomputers were composed of a central unit and a set of terminals connected through serial lines or modems. The simplest protocol that was designed to access remote computers over a network is probably :term:`telnet` :rfc:`854`. -:term:`telnet` runs over TCP and a telnet server listens on port `23` by -default. The TCP connection used by telnet is bidirectional, both the client -and the server can send data over it. The data exchanged over such a +:term:`telnet` runs over TCP and a telnet server listens on port `23` by +default. The TCP connection used by telnet is bidirectional, both the client +and the server can send data over it. The data exchanged over such a connection is essentially the characters that are typed by the user on the client machine and the text output of the processes running on the server machine with a few exceptions (e.g. control characters, characters to control @@ -45,7 +45,7 @@ similar protocols [Ylonen1996]_. :term:`ssh` became quickly popular and system administrators encouraged its usage. The original version of :term:`ssh` was freely available. After a few years, his author created a company to distribute it commercially, but other programmers continued to -develop an open-source version of :term`ssh` called +develop an open-source version of :term`ssh` called `OpenSSH `_. Over the years, :term:`ssh` evolved and became a flexible applicable whose usage extends beyond remote @@ -55,24 +55,24 @@ how it differs from :term:`telnet`. Entire books have been written to describe :term:`ssh` in details [BS2005]_. An overview of the protocol appeared in [Stallings2009]_. -The :term:`ssh` protocol runs directly above the TCP protocol. +The :term:`ssh` protocol runs directly above the TCP protocol. Once the TCP bytestream has been established, the client and the server exchange messages. The first message exchanged is an ASCII line that announces the version of the protocol and the version of the software implementation used by the client and the server. These two lines are useful when debugging interoperability -problems and other issues. +problems and other issues. The next message is the ``SSH_MSG_KEX_INIT`` message that is used -to negotiate the cryptographic algorithms that will be used for the +to negotiate the cryptographic algorithms that will be used for the ``ssh`` session. It is very important for security protocols to include mechanisms that enable a negotiation of the cryptographic algorithms that are used for several reasons. First, these algorithms provide different levels of security. Some algorithms might be considered totally secure and are recommended today while they could become deprecated a few years laters after the publication of some -attacks. Second, these algorithms provide different levels of -performance and have different CPU and memory impacts. +attacks. Second, these algorithms provide different levels of +performance and have different CPU and memory impacts. In practice, an ``ssh`` implementation supports four types of cryptographic algorithms : @@ -80,13 +80,13 @@ cryptographic algorithms : - key exchange - encryption - Message Authentication Code (MAC) - - compression + - compression -The IANA_ maintains a `list of the cryptographic algorithms `_ +The IANA_ maintains a `list of the cryptographic algorithms `_ that can be used by ``ssh`` implementations. For each type of algorithm, the client provides an ordered list of the algorithms that it supports and agrees to use. The server compares the received list with its own list. -The outcome of the negotiation is a set of four algorithms [#fnull]_ +The outcome of the negotiation is a set of four algorithms [#fnull]_ that will be combined for this session. .. msc:: @@ -124,28 +124,28 @@ used to negotiate a secret key that will be shared by the client and the server. These key exchange algorithms include some variations over the basic algorithms. As an example, let us analyse how the Diffie Hellman key exchange algorithm is used within the -``ssh`` protocol. In this case, each host has both a private and a public key. +``ssh`` protocol. In this case, each host has both a private and a public key. - - the client generates the random number :math:`a` and sends + - the client generates the random number :math:`a` and sends :math:`A=g^{a} mod p` to the server - the server generates the random number :math:`b`. It then computes - :math:`B=g^{b} mod p`, :math:`K=B^{a} mod p` and signs with its private - key :math:`hash(V_{Client} || V_{Server} || KEX\_INIT_{Client} || KEX\_INIT_{Server} || Server_{pub} || A || B || K )` + :math:`B=g^{b} mod p`, :math:`K=B^{a} mod p` and signs with its private + key :math:`hash(V_{Client} || V_{Server} || KEX\_INIT_{Client} || KEX\_INIT_{Server} || Server_{pub} || A || B || K )` where :math:`V_{Server}` (resp. :math:`V_{Client}`) is the initial - messages sent by the client (resp. server), :math:`KEX\_INIT_{Client}` + messages sent by the client (resp. server), :math:`KEX\_INIT_{Client}` (resp. :math:`KEX\_INIT_{Server}`) is the key exchange message sent by - the client (resp. server) and :math:`A`, :math:`B` and :math:`K` are the + the client (resp. server) and :math:`A`, :math:`B` and :math:`K` are the messages of the Diffie Hellman key exchange - - the client can recompute :math:`K=A^{b} mod p` and verify the + - the client can recompute :math:`K=A^{b} mod p` and verify the signature provided by the server This is a slightly modified authenticated Diffie Hellman key exchange -with two interesting points. The first point is that +with two interesting points. The first point is that when the server authenticates the key exchange it does not provide a -certificate. This is because ``ssh`` assumes that the client will store +certificate. This is because ``ssh`` assumes that the client will store inside its cache the public key of the servers that it uses on a regular basis. This assumption is valid for a protocol like ``ssh`` -because users typically use it to interact with a small number of +because users typically use it to interact with a small number of servers, typically a few or a few tens. Storing this information does not require a lot of storage. In practice, most ``ssh`` clients will accept to connect to remote servers without knowing their public key before @@ -155,7 +155,7 @@ with a fingerprint of the key, either as a sequence of letters or as an ASCII art which can be posted on the web or elsewhere [#fdnsssh]_ by the system administrator of the server. If a client connects to a server whose public key does not match the stored one, a stronger warning is -issued because this could indicate a man-in-the-middle attack or that +issued because this could indicate a man-in-the-middle attack or that the remote server has been compromised. It can also indicate that the server has been upgraded and that a new key has been generated during this upgrade. @@ -169,26 +169,26 @@ active attacker modifies the messages sent by the communicating hosts (typically the client) to request the utilisation of weaker encryption algorithms. Consider a client that supports two encryption schemes. The preferred one uses 128 bits secret keys and the second one is an old -encryption scheme that uses 48 bits keys. This second algorithm is +encryption scheme that uses 48 bits keys. This second algorithm is kept for backward compatibility with older implementations. If an attacker can remove the preferred algorithm from the list of encryption algorithms supported by the client, he can force the server to use a weaker encryption scheme that will be easier to break. Thanks to the hash that covers all the messages exchanged by the server, the downgrade attack cannot occur against ``ssh``. Algorithm agility is -a key requirement for security protocols that need to evolve when +a key requirement for security protocols that need to evolve when encryption algorithms are broken by researchers. This agility cannot be used without care and signing a hash of all the messages exchanged -is a technique that is frequently used to prevent downgrade attacks. +is a technique that is frequently used to prevent downgrade attacks. .. note:: Single use keys - Thanks to the Diffie Hellman key exchange, the client and the + Thanks to the Diffie Hellman key exchange, the client and the servers share key :math:`K`. A naive implementation would probably directly use this key for all the cryptographic algorithms that have been negotiated for this session. Like most security protocols, ``ssh`` does not directly use key :math:`K`. Instead, it uses - the negotiated hash function with different parameters [fsshkeys]_ + the negotiated hash function with different parameters [fsshkeys]_ to allow the client and the servers to compute six keys from :math:`K` : @@ -196,14 +196,14 @@ is a technique that is frequently used to prevent downgrade attacks. it sends - a key used by the client (resp. server) to authenticate the data that is sends - - a key used by the client (resp. server) to initialise the + - a key used by the client (resp. server) to initialize the negotiated encryption scheme (if required by this scheme) It is common practice among designers of security protocols to never use the same key for different purposes. For example, allowing the client and the server to use the same key to encrypt data could enable an attacker to launch a replay attack by resending to the - client data that it has itself encrypted. + client data that it has itself encrypted. At this point, all the messages sent over the TCP connection will be encrypted @@ -214,21 +214,21 @@ that are encoded according to the Binary Packet Protocol defined in - ``length`` : this is the length of the message in bytes, excluding the MAC and length fields - ``padding length`` : this is the number of random bytes that have been added - at the end of the message. + at the end of the message. - ``payload`` : the data (after optional compression) passed by the user - ``padding`` : random bytes added in each message (at least four) to ensure that the message length is a multiple of the block size used by the negotiated encryption algorithm - - ``MAC`` : this field is present if a Message Authentication Code has been + - ``MAC`` : this field is present if a Message Authentication Code has been negotiated for the session (in practice, using ``ssh`` without authentication is risky and this field should always be present). Note that to compute the MAC, an ``ssh`` implementation must maintain a message counter. This counter is incremented by one every time a message is sent and the MAC is computed with the negotiated authentication - algorithm using the MAC key over the concatenation of - the message counter and the unencrypted message. + algorithm using the MAC key over the concatenation of + the message counter and the unencrypted message. The message counter is not transmitted, - but the recipient can easily recover its value. The ``MAC`` is computed as + but the recipient can easily recover its value. The ``MAC`` is computed as :math:`mac = MAC(key, sequence_number || unencrypted_message)` where the key is the negotiated authentication key. @@ -240,17 +240,17 @@ that are encoded according to the Binary Packet Protocol defined in (MAC) to authenticates the messages that are sent. A naïve implementation of such a MAC would be to simply use a hash function like SHA-1. However, such a construction would not be safe from a security viewpoint. Internet - protocols usually rely on the HMAC construction defined in :rfc:`2104`. + protocols usually rely on the HMAC construction defined in :rfc:`2104`. It works with any hash function (`H`) and a key (`K`). As an example, let us consider HMAC with the SHA-1 hash function. SHA-1 uses 20 bytes blocks and the block size will play an important role in the operation - of HMAC. We first require the key to as long as the block size. Since this + of HMAC. We first require the key to be as long as the block size. Since this key is the output of the key generation algorithm, this is one parameter - of this algorithm. + of this algorithm. - HMAC uses two padding strings : `ipad` (resp. `opad`) which is a + HMAC uses two padding strings : `ipad` (resp. `opad`) which is a string containing 20 times byte ``0x36`` (resp. byte ``0x5C``). The HMAC - is then computed as :math:`H[K \oplus opad, H(K \oplus ipad, data) ]` + is then computed as :math:`H[K \oplus opad, H(K \oplus ipad, data) ]` where :math:`\oplus` denotes the bitwise XOR operation. This computation has been shown to be stronger than the naïve :math:`H(K,data)` against some types of cryptographic attacks. @@ -262,10 +262,10 @@ to mention how users are authenticated by the server. The ``ssh`` protocol supports the classical username/password authentication (but both the username and the password are transmitted over the secure encrypted channel). In addition, ``ssh`` supports two authentication mechanisms that -rely on public keys. To use the first one, each user needs to generate +rely on public keys. To use the first one, each user needs to generate his/her own public/private key pair and store the public key on the server. To be authenticated, the user needs to sign a message containing his/her -public key by using his/her private key. The server can easily verify the +public key by using his/her private key. The server can easily verify the validity of the signature since it already knows the user's public key. The second authentication scheme is designed for hosts that trust each other. Each host has a public/private key pair and stores the public keys @@ -276,7 +276,7 @@ command on ``computer2``, she can create an ``ssh`` session on this computer and type (again) her password. With the host-based authentication scheme, ``computer1`` signs a message with its private key to confirm that it has already authenticated Alice. ``computer2`` would then accept -Alice's session without asking her credentials. +Alice's session without asking her credentials. The ``ssh`` protocol includes other features that are beyond the scope of this book. Additional details may be found in [BS2005]_. @@ -284,16 +284,16 @@ scope of this book. Additional details may be found in [BS2005]_. .. rubric:: Footnotes -.. [#fnull] For some of the algorithms, it is possible to negotiate the +.. [#fnull] For some of the algorithms, it is possible to negotiate the utilisation of no algorithm. This happens frequently for the compression algorithm that is not always used. For this, both the client and the server must announce ``null`` in their ordered list of supported algorithms. -.. [#fdnsssh] For example, :rfc:`4255` describes a DNS record that can be +.. [#fdnsssh] For example, :rfc:`4255` describes a DNS record that can be used to associate an ``ssh`` fingerprint to a DNS name. .. [#fsshkeys] The exact algorithms used for the computation of these - keys are defined in :rfc:`4253` + keys are defined in :rfc:`4253` .. include:: /links.rst diff --git a/book-2nd/protocols/tcp.rst b/book-2nd/protocols/tcp.rst index dcb5e12..f412a4a 100644 --- a/book-2nd/protocols/tcp.rst +++ b/book-2nd/protocols/tcp.rst @@ -22,7 +22,7 @@ TCP provides a reliable bytestream, connection-oriented transport service on top On the global Internet, most of the applications used in the wide area rely on TCP. Many studies [#ftcpusage]_ have reported that TCP was responsible for more than 90% of the data exchanged in the global Internet. .. index:: TCP header - + To provide this service, TCP relies on a simple segment format that is shown in the figure below. Each TCP segment contains a header described below and, optionally, a payload. The default length of the TCP header is twenty bytes, but some TCP headers contain options. .. figure:: /../book/transport/pkt/tcp.png @@ -43,25 +43,25 @@ A TCP header contains the following fields : - the `sequence number` (32 bits), `acknowledgement number` (32 bits) and `window` (16 bits) fields are used to provide a reliable data transfer, using a window-based protocol. In a TCP bytestream, each byte of the stream consumes one sequence number. Their utilisation will be described in more detail in section :ref:`TCPReliable` - the `Urgent pointer` is used to indicate that some data should be considered as urgent in a TCP bytestream. However, it is rarely used in practice and will not be described here. Additional details about the utilisation of this pointer may be found in :rfc:`793`, :rfc:`1122` or [Stevens1994]_ - - the flags field contains a set of bit flags that indicate how a segment should be interpreted by the TCP entity receiving it : + - the flags field contains a set of bit flags that indicate how a segment should be interpreted by the TCP entity receiving it : - the `SYN` flag is used during connection establishment - the `FIN` flag is used during connection release - the `RST` is used in case of problems or when an invalid segment has been received - when the `ACK` flag is set, it indicates that the `acknowledgment` field contains a valid number. Otherwise, the content of the `acknowledgment` field must be ignored by the receiver - the `URG` flag is used together with the `Urgent pointer` - - the `PSH` flag is used as a notification from the sender to indicate to the receiver that it should pass all the data it has received to the receiving process. However, in practice TCP implementations do not allow TCP users to indicate when the `PSH` flag should be set and thus there are few real utilizations of this flag. + - the `PSH` flag is used as a notification from the sender to indicate to the receiver that it should pass all the data it has received to the receiving process. However, in practice TCP implementations do not allow TCP users to indicate when the `PSH` flag should be set and thus there are few real utilizations of this flag. - the `checksum` field contains the value of the Internet checksum computed over the entire TCP segment and a pseudo-header as with UDP - the `Reserved` field was initially reserved for future utilization. It is now used by :rfc:`3168`. - the `TCP Header Length` (THL) or `Data Offset` field is a four bits field that indicates the size of the TCP header in 32 bit words. The maximum size of the TCP header is thus 64 bytes. - the `Optional header extension` is used to add optional information to the TCP header. Thanks to this header extension, it is possible to add new fields to the TCP header that were not planned in the original specification. This allowed TCP to evolve since the early eighties. The details of the TCP header extension are explained in sections :ref:`TCPOpen` and :ref:`TCPReliable`. - + .. _fig-tcpports: .. figure:: /../book/transport/svg/tcp-ports.png :align: center - :scale: 70 + :scale: 70 Utilization of the TCP source and destination ports @@ -95,41 +95,41 @@ This segment is often called a `SYN+ACK` segment. The acknowledgment confirms to - the `ACK` flag set - the `acknowledgment number` set to the `sequence number` of the received `SYN+ACK` segment incremented by 1 :math:`\pmod{2^{32}}` -At this point, the TCP connection is open and both the client and the server are allowed to send TCP segments containing data. This is illustrated in the figure below. +At this point, the TCP connection is open and both the client and the server are allowed to send TCP segments containing data. This is illustrated in the figure below. .. figure:: /../book/transport/png/transport-fig-059-c.png :align: center - :scale: 70 + :scale: 70 Establishment of a TCP connection -In the figure above, the connection is considered to be established by the client once it has received the `SYN+ACK` segment, while the server considers the connection to be established upon reception of the `ACK` segment. The first data segment sent by the client (server) has its `sequence number` set to `x+1` (resp. `y+1`). +In the figure above, the connection is considered to be established by the client once it has received the `SYN+ACK` segment, while the server considers the connection to be established upon reception of the `ACK` segment. The first data segment sent by the client (server) has its `sequence number` set to `x+1` (resp. `y+1`). .. index:: TCP Initial Sequence Number .. note:: Computing TCP's initial sequence number In the original TCP specification :rfc:`793`, each TCP entity maintained a clock to compute the initial sequence number (:term:`ISN`) placed in the `SYN` and `SYN+ACK` segments. This made the ISN predictable and caused a security issue. The typical security problem was the following. Consider a server that trusts a host based on its IP address and allows the system administrator to login from this host without giving a password [#frlogin]_. Consider now an attacker who knows this particular configuration and is able to send IP packets having the client's address as source. He can send fake TCP segments to the server, but does not receive the server's answers. If he can predict the `ISN` that is chosen by the server, he can send a fake `SYN` segment and shortly after the fake `ACK` segment confirming the reception of the `SYN+ACK` segment sent by the server. Once the TCP connection is open, he can use it to send any command to the server. To counter this attack, current TCP implementations add randomness to the `ISN`. One of the solutions, proposed in :rfc:`1948` is to compute the `ISN` as :: - + ISN = M + H(localhost, localport, remotehost, remoteport, secret). where `M` is the current value of the TCP clock and `H` is a cryptographic hash function. `localhost` and `remotehost` (resp. `localport` and `remoteport` ) are the IP addresses (port numbers) of the local and remote host and `secret` is a random number only known by the server. This method allows the server to use different ISNs for different clients at the same time. `Measurements `_ performed with the first implementations of this technique showed that it was difficult to implement it correctly, but today's TCP implementation now generate good ISNs. - + .. index:: TCP RST A server could, of course, refuse to open a TCP connection upon reception of a `SYN` segment. This refusal may be due to various reasons. There may be no server process that is listening on the destination port of the `SYN` segment. The server could always refuse connection establishments from this particular client (e.g. due to security reasons) or the server may not have enough resources to accept a new TCP connection at that time. In this case, the server would reply with a TCP segment having its `RST` flag set and containing the `sequence number` of the received `SYN` segment as its `acknowledgment number`. This is illustrated in the figure below. We discuss the other utilizations of the TCP `RST` flag later (see :ref:`TCPRelease`). .. figure:: /../book/transport/png/transport-fig-061-c.png :align: center - :scale: 70 + :scale: 70 TCP connection establishment rejected by peer -TCP connection establishment can be described as the four state Finite State Machine shown below. In this FSM, `!X` (resp. `?Y`) indicates the transmission of segment `X` (resp. reception of segment `Y`) during the corresponding transition. `Init` is the initial state. +TCP connection establishment can be described as the four state Finite State Machine shown below. In this FSM, `!X` (resp. `?Y`) indicates the transmission of segment `X` (resp. reception of segment `Y`) during the corresponding transition. `Init` is the initial state. .. figure:: /../book/transport/png/transport-fig-063-c.png :align: center - :scale: 70 + :scale: 70 TCP FSM for connection establishment @@ -139,7 +139,7 @@ Apart from these two paths in the TCP connection establishment FSM, there is a t .. figure:: /../book/transport/png/transport-fig-062-c.png :align: center - :scale: 70 + :scale: 70 Simultaneous establishment of a TCP connection @@ -148,7 +148,7 @@ Apart from these two paths in the TCP connection establishment FSM, there is a t .. topic:: Denial of Service attacks - When a TCP entity opens a TCP connection, it creates a Transmission Control Block (:term:`TCB`). The TCB contains the entire state that is maintained by the TCP entity for each TCP connection. During connection establishment, the TCB contains the local IP address, the remote IP address, the local port number, the remote port number, the current local sequence number, the last sequence number received from the remote entity. Until the mid 1990s, TCP implementations had a limit on the number of TCP connections that could be in the `SYN RCVD` state at a given time. Many implementations set this limit to about 100 TCBs. This limit was considered sufficient even for heavily load http servers given the small delay between the reception of a `SYN` segment and the reception of the `ACK` segment that terminates the establishment of the TCP connection. When the limit of 100 TCBs in the `SYN Rcvd` state is reached, the TCP entity discards all received TCP `SYN` segments that do not correspond to an existing TCB. + When a TCP entity opens a TCP connection, it creates a Transmission Control Block (:term:`TCB`). The TCB contains the entire state that is maintained by the TCP entity for each TCP connection. During connection establishment, the TCB contains the local IP address, the remote IP address, the local port number, the remote port number, the current local sequence number, the last sequence number received from the remote entity. Until the mid 1990s, TCP implementations had a limit on the number of TCP connections that could be in the `SYN RCVD` state at a given time. Many implementations set this limit to about 100 TCBs. This limit was considered sufficient even for heavily load http servers given the small delay between the reception of a `SYN` segment and the reception of the `ACK` segment that terminates the establishment of the TCP connection. When the limit of 100 TCBs in the `SYN Rcvd` state is reached, the TCP entity discards all received TCP `SYN` segments that do not correspond to an existing TCB. This limit of 100 TCBs in the `SYN Rcvd` state was chosen to protect the TCP entity from the risk of overloading its memory with too many TCBs in the `SYN Rcvd` state. However, it was also the reason for a new type of Denial of Service (DoS) attack :rfc:`4987`. A DoS attack is defined as an attack where an attacker can render a resource unavailable in the network. For example, an attacker may cause a DoS attack on a 2 Mbps link used by a company by sending more than 2 Mbps of packets through this link. In this case, the DoS attack was more subtle. As a TCP entity discards all received `SYN` segments as soon as it has 100 TCBs in the `SYN Rcvd` state, an attacker simply had to send a few 100 `SYN` segments every second to a server and never reply to the received `SYN+ACK` segments. To avoid being caught, attackers were of course sending these `SYN` segments with a different address than their own IP address [#fspoofing]_. On most TCP implementations, once a TCB entered the `SYN Rcvd` state, it remained in this state for several seconds, waiting for a retransmission of the initial `SYN` segment. This attack was later called a `SYN flood` attack and the servers of the ISP named panix were among the first to `be affected `_ by this attack. @@ -156,18 +156,18 @@ Apart from these two paths in the TCP connection establishment FSM, there is a t - the high order bits contain the low order bits of a counter that is incremented slowly - the low order bits contain a hash value computed over the local and remote IP addresses and ports and a random secret only known to the server - + The advantage of the `SYN cookies`_ is that by using them, the server does not need to create a :term:`TCB` upon reception of the `SYN` segment and can still check the returned `ACK` segment by recomputing the `SYN cookie`. The main disadvantage is that they are not fully compatible with the TCP options. This is why they are not enabled by default on a typical system. .. topic:: Retransmitting the first `SYN` segment - As IP provides an unreliable connectionless service, the `SYN` and `SYN+ACK` segments sent to open a TCP connection could be lost. Current TCP implementations start a retransmission timer when they send the first `SYN` segment. This timer is often set to three seconds for the first retransmission and then doubles after each retransmission :rfc:`2988`. TCP implementations also enforce a maximum number of retransmissions for the initial `SYN` segment. + As IP provides an unreliable connectionless service, the `SYN` and `SYN+ACK` segments sent to open a TCP connection could be lost. Current TCP implementations start a retransmission timer when they send the first `SYN` segment. This timer is often set to three seconds for the first retransmission and then doubles after each retransmission :rfc:`2988`. TCP implementations also enforce a maximum number of retransmissions for the initial `SYN` segment. .. index:: TCP Options -As explained earlier, TCP segments may contain an optional header extension. In the `SYN` and `SYN+ACK` segments, these options are used to negotiate some parameters and the utilisation of extensions to the basic TCP specification. +As explained earlier, TCP segments may contain an optional header extension. In the `SYN` and `SYN+ACK` segments, these options are used to negotiate some parameters and the utilisation of extensions to the basic TCP specification. .. index:: TCP MSS, Maximum Segment Size, MSS @@ -183,7 +183,7 @@ The TCP options are encoded by using a Type Length Value format where : :rfc:`793` defines the Maximum Segment Size (MSS) TCP option that must be understood by all TCP implementations. This option (type 2) has a length of 4 bytes and contains a 16 bits word that indicates the MSS supported by the sender of the `SYN` segment. The MSS option can only be used in TCP segments having the `SYN` flag set. -:rfc:`793` also defines two special options that must be supported by all TCP implementations. The first option is `End of option`. It is encoded as a single byte having value `0x00` and can be used to ensure that the TCP header extension ends on a 32 bits boundary. The `No-Operation` option, encoded as a single byte having value `0x01`, can be used when the TCP header extension contains several TCP options that should be aligned on 32 bit boundaries. All other options [#ftcpoptions]_ are encoded by using the TLV format. +:rfc:`793` also defines two special options that must be supported by all TCP implementations. The first option is `End of option`. It is encoded as a single byte having value `0x00` and can be used to ensure that the TCP header extension ends on a 32 bits boundary. The `No-Operation` option, encoded as a single byte having value `0x01`, can be used when the TCP header extension contains several TCP options that should be aligned on 32 bit boundaries. All other options [#ftcpoptions]_ are encoded by using the TLV format. .. note:: The robustness principle @@ -197,7 +197,7 @@ The TCP options are encoded by using a Type Length Value format where : TCP reliable data transfer -------------------------- -The original TCP data transfer mechanisms were defined in :rfc:`793`. Based on the experience of using TCP on the growing global Internet, this part of the TCP specification has been updated and improved several times, always while preserving the backward compatibility with older TCP implementations. In this section, we review the main data transfer mechanisms used by TCP. +The original TCP data transfer mechanisms were defined in :rfc:`793`. Based on the experience of using TCP on the growing global Internet, this part of the TCP specification has been updated and improved several times, always while preserving the backward compatibility with older TCP implementations. In this section, we review the main data transfer mechanisms used by TCP. TCP is a window-based transport protocol that provides a bi-directional byte stream service. This has several implications on the fields of the TCP header and the mechanisms used by TCP. The three fields of the TCP header are : @@ -213,20 +213,20 @@ TCP is a window-based transport protocol that provides a bi-directional byte str - the local IP address - the remote IP address - - the local TCP port number + - the local TCP port number - the remote TCP port number - - the current state of the TCP FSM - - the `maximum segment size` (MSS) + - the current state of the TCP FSM + - the `maximum segment size` (MSS) - `snd.nxt` : the sequence number of the next byte in the byte stream (the first byte of a new data segment that you send uses this sequence number) - `snd.una` : the earliest sequence number that has been sent but has not yet been acknowledged - `snd.wnd` : the current size of the sending window (in bytes) - `rcv.nxt` : the sequence number of the next byte that is expected to be received from the remote host - `rcv.wnd` : the current size of the receive window advertised by the remote host - `sending buffer` : a buffer used to store all unacknowledged data - - `receiving buffer` : a buffer to store all data received from the remote host that has not yet been delivered to the user. Data may be stored in the `receiving buffer` because either it was not received in sequence or because the user is too slow to process it + - `receiving buffer` : a buffer to store all data received from the remote host that has not yet been delivered to the user. Data may be stored in the `receiving buffer` because either it was not received in sequence or because the user is too slow to process it -The original TCP specification can be categorised as a transport protocol that provides a byte stream service and uses `go-back-n`. +The original TCP specification can be categorised as a transport protocol that provides a byte stream service and uses `go-back-n`. To send new data on an established connection, a TCP entity performs the following operations on the corresponding TCB. It first checks that the `sending buffer` does not contain more data than the receive window advertised by the remote host (`rcv.wnd`). If the window is not full, up to `MSS` bytes of data are placed in the payload of a TCP segment. The `sequence number` of this segment is the sequence number of the first byte of the payload. It is set to the first available sequence number : `snd.nxt` and `snd.nxt` is incremented by the length of the payload of the TCP segment. The `acknowledgement number` of this segment is set to the current value of `rcv.nxt` and the `window` field of the TCP segment is computed based on the current occupancy of the `receiving buffer`. The data is kept in the `sending buffer` in case it needs to be retransmitted later. @@ -237,9 +237,9 @@ Segment transmission strategies .. index:: Nagle algorithm -In a transport protocol such as TCP that offers a bytestream, a practical issue that was left as an implementation choice in :rfc:`793` is to decide when a new TCP segment containing data must be sent. There are two simple and extreme implementation choices. The first implementation choice is to send a TCP segment as soon as the user has requested the transmission of some data. This allows TCP to provide a low delay service. However, if the user is sending data one byte at a time, TCP would place each user byte in a segment containing 20 bytes of TCP header [#fnagleip]_. This is a huge overhead that is not acceptable in wide area networks. A second simple solution would be to only transmit a new TCP segment once the user has produced MSS bytes of data. This solution reduces the overhead, but at the cost of a potentially very high delay. +In a transport protocol such as TCP that offers a bytestream, a practical issue that was left as an implementation choice in :rfc:`793` is to decide when a new TCP segment containing data must be sent. There are two simple and extreme implementation choices. The first implementation choice is to send a TCP segment as soon as the user has requested the transmission of some data. This allows TCP to provide a low delay service. However, if the user is sending data one byte at a time, TCP would place each user byte in a segment containing 20 bytes of TCP header [#fnagleip]_. This is a huge overhead that is not acceptable in wide area networks. A second simple solution would be to only transmit a new TCP segment once the user has produced MSS bytes of data. This solution reduces the overhead, but at the cost of a potentially very high delay. -An elegant solution to this problem was proposed by John Nagle in :rfc:`896`. John Nagle observed that the overhead caused by the TCP header was a problem in wide area connections, but less in local area connections where the available bandwidth is usually higher. He proposed the following rules to decide to send a new data segment when a new data has been produced by the user or a new ack segment has been received +An elegant solution to this problem was proposed by John Nagle in :rfc:`896`. John Nagle observed that the overhead caused by the TCP header was a problem in wide area connections, but less in local area connections where the available bandwidth is usually higher. He proposed the following rules to decide to send a new data segment when a new data has been produced by the user or a new ack segment has been received .. code-block:: python @@ -252,7 +252,7 @@ An elegant solution to this problem was proposed by John Nagle in :rfc:`896`. Jo send one TCP segment containing all buffered data The first rule ensures that a TCP connection used for bulk data transfer always sends full TCP segments. The second rule sends one partially filled TCP segment every round-trip-time. - + .. index:: packet size distribution This algorithm, called the Nagle algorithm, takes a few lines of code in all TCP implementations. These lines of code have a huge impact on the packets that are exchanged in TCP/IP networks. Researchers have analysed the distribution of the packet sizes by capturing and analysing all the packets passing through a given link. These studies have shown several important results : @@ -263,7 +263,7 @@ This algorithm, called the Nagle algorithm, takes a few lines of code in all TCP .. The figure below provides a distribution of the packet sizes measured on a link. It shows a three-modal distribution of the packet size. 50% of the packets contain pure TCP acknowledgements and occupy 40 bytes. About 20% of the packets contain about 500 bytes [#fmss500]_ of user data and 12% of the packets contain 1460 bytes of user data. However, most of the user data is transported in large packets. This packet size distribution has implications on the design of routers as we discuss in the next chapter. -`Recent measurements `_ indicate that these packet size distributions are still valid in today's Internet, although the packet distribution tends to become bimodal with small packets corresponding to TCP pure acks and large 1440-bytes packets carrying most of the user data [SMASU2012]_. +`Recent measurements `_ indicate that these packet size distributions are still valid in today's Internet, although the packet distribution tends to become bimodal with small packets corresponding to TCP pure acks and large 1440-bytes packets carrying most of the user data [SMASU2012]_. @@ -272,29 +272,29 @@ This algorithm, called the Nagle algorithm, takes a few lines of code in all TCP TCP windows ----------- -From a performance point of view, one of the main limitations of the original TCP specification is the 16 bits `window` field in the TCP header. As this field indicates the current size of the receive window in bytes, it limits the TCP receive window at 65535 bytes. This limitation was not a severe problem when TCP was designed since at that time high-speed wide area networks offered a maximum bandwidth of 56 kbps. However, in today's network, this limitation is not acceptable anymore. The table below provides the rough [#faveragebandwidth]_ maximum throughput that can be achieved by a TCP connection with a 64 KBytes window in function of the connection's round-trip-time +From a performance point of view, one of the main limitations of the original TCP specification is the 16 bits `window` field in the TCP header. As this field indicates the current size of the receive window in bytes, it limits the TCP receive window at 65535 bytes. This limitation was not a severe problem when TCP was designed since at that time high-speed wide area networks offered a maximum bandwidth of 56 kbps. However, in today's network, this limitation is not acceptable anymore. The table below provides the rough [#faveragebandwidth]_ maximum throughput that can be achieved by a TCP connection with a 64 KBytes window in function of the connection's round-trip-time -======== ================== - RTT Maximum Throughput -======== ================== +======== ================== + RTT Maximum Throughput +======== ================== 1 msec 524 Mbps 10 msec 52.4 Mbps 100 msec 5.24 Mbps 500 msec 1.05 Mbps -======== ================== +======== ================== -To solve this problem, a backward compatible extension that allows TCP to use larger receive windows was proposed in :rfc:`1323`. Today, most TCP implementations support this option. The basic idea is that instead of storing `snd.wnd` and `rcv.wnd` as 16 bits integers in the :term:`TCB`, they should be stored as 32 bits integers. As the TCP segment header only contains 16 bits to place the window field, it is impossible to copy the value of `snd.wnd` in each sent TCP segment. Instead the header contains `snd.wnd >> S` where `S` is the scaling factor ( :math:`0 \le S \le 14`) negotiated during connection establishment. The client adds its proposed scaling factor as a TCP option in the `SYN` segment. If the server supports :rfc:`1323`, it places in the `SYN+ACK` segment the scaling factor that it uses when advertising its own receive window. The local and remote scaling factors are included in the :term:`TCB`. If the server does not support :rfc:`1323`, it ignores the received option and no scaling is applied. +To solve this problem, a backward compatible extension that allows TCP to use larger receive windows was proposed in :rfc:`1323`. Today, most TCP implementations support this option. The basic idea is that instead of storing `snd.wnd` and `rcv.wnd` as 16 bits integers in the :term:`TCB`, they should be stored as 32 bits integers. As the TCP segment header only contains 16 bits to place the window field, it is impossible to copy the value of `snd.wnd` in each sent TCP segment. Instead the header contains `snd.wnd >> S` where `S` is the scaling factor ( :math:`0 \le S \le 14`) negotiated during connection establishment. The client adds its proposed scaling factor as a TCP option in the `SYN` segment. If the server supports :rfc:`1323`, it places in the `SYN+ACK` segment the scaling factor that it uses when advertising its own receive window. The local and remote scaling factors are included in the :term:`TCB`. If the server does not support :rfc:`1323`, it ignores the received option and no scaling is applied. By using the window scaling extensions defined in :rfc:`1323`, TCP implementations can use a receive buffer of up to 1 GByte. With such a receive buffer, the maximum throughput that can be achieved by a single TCP connection becomes : -======== ================== - RTT Maximum Throughput -======== ================== +======== ================== + RTT Maximum Throughput +======== ================== 1 msec 8590 Gbps 10 msec 859 Gbps 100 msec 86 Gbps 500 msec 17 Gbps -======== ================== +======== ================== These throughputs are acceptable in today's networks. However, there are already servers having 10 Gbps interfaces... Early TCP implementations had fixed receiving and sending buffers [#ftcphosts]_. Today's high performance implementations are able to automatically adjust the size of the sending and receiving buffer to better support high bandwidth flows [SMM1998]_ @@ -309,37 +309,37 @@ A good setting of the retransmission timeout clearly depends on an accurate esti .. figure:: /../book/transport/png/transport-fig-070-c.png :align: center - :scale: 70 + :scale: 70 - Evolution of the round-trip-time between two hosts + Evolution of the round-trip-time between two hosts The easiest solution to measure the round-trip-time on a TCP connection is to measure the delay between the transmission of a data segment and the reception of a corresponding acknowledgement [#frttmes]_. As illustrated in the figure below, this measurement works well when there are no segment losses. .. figure:: /../book/transport/png/transport-fig-072-c.png :align: center - :scale: 70 + :scale: 70 - How to measure the round-trip-time ? + How to measure the round-trip-time ? However, when a data segment is lost, as illustrated in the bottom part of the figure, the measurement is ambiguous as the sender cannot determine whether the received acknowledgement was triggered by the first transmission of segment `123` or its retransmission. Using incorrect round-trip-time estimations could lead to incorrect values of the retransmission timeout. For this reason, Phil Karn and Craig Partridge proposed, in [KP91]_, to ignore the round-trip-time measurements performed during retransmissions. To avoid this ambiguity in the estimation of the round-trip-time when segments are retransmitted, recent TCP implementations rely on the `timestamp option` defined in :rfc:`1323`. This option allows a TCP sender to place two 32 bit timestamps in each TCP segment that it sends. The first timestamp, TS Value (`TSval`) is chosen by the sender of the segment. It could for example be the current value of its real-time clock [#ftimestamp]_. The second value, TS Echo Reply (`TSecr`), is the last `TSval` that was received from the remote host and stored in the :term:`TCB`. The figure below shows how the utilization of this timestamp option allows for the disambiguation of the round-trip-time measurement when there are retransmissions. - + .. figure:: /../book/transport/png/transport-fig-073-c.png :align: center - :scale: 70 + :scale: 70 - Disambiguating round-trip-time measurements with the :rfc:`1323` timestamp option + Disambiguating round-trip-time measurements with the :rfc:`1323` timestamp option -Once the round-trip-time measurements have been collected for a given TCP connection, the TCP entity must compute the retransmission timeout. As the round-trip-time measurements may change during the lifetime of a connection, the retransmission timeout may also change. At the beginning of a connection [#ftcbtouch]_ , the TCP entity that sends a `SYN` segment does not know the round-trip-time to reach the remote host and the initial retransmission timeout is usually set to 3 seconds :rfc:`2988`. +Once the round-trip-time measurements have been collected for a given TCP connection, the TCP entity must compute the retransmission timeout. As the round-trip-time measurements may change during the lifetime of a connection, the retransmission timeout may also change. At the beginning of a connection [#ftcbtouch]_ , the TCP entity that sends a `SYN` segment does not know the round-trip-time to reach the remote host and the initial retransmission timeout is usually set to 3 seconds :rfc:`2988`. -The original TCP specification proposed in :rfc:`793` to include two additional variables in the TCB : +The original TCP specification proposed in :rfc:`793` to include two additional variables in the TCB : - `srtt` : the smoothed round-trip-time computed as :math:`srtt=(\alpha \times srtt)+( (1-\alpha) \times rtt)` where `rtt` is the round-trip-time measured according to the above procedure and :math:`\alpha` a smoothing factor (e.g. 0.8 or 0.9) - `rto` : the retransmission timeout is computed as :math:`rto=\min(60,max(1,\beta \times srtt))` where :math:`\beta` is used to take into account the delay variance (value : 1.3 to 2.0). The `60` and `1` constants are used to ensure that the `rto` is not larger than one minute nor smaller than 1 second. - + However, in practice, this computation for the retransmission timeout did not work well. The main problem was that the computed `rto` did not correctly take into account the variations in the measured round-trip-time. `Van Jacobson` proposed in his seminal paper [Jacobson1988]_ an improved algorithm to compute the `rto` and implemented it in the BSD Unix distribution. This algorithm is now part of the TCP standard :rfc:`2988`. Jacobson's algorithm uses two state variables, `srtt` the smoothed `rtt` and `rttvar` the estimation of the variance of the `rtt` and two parameters : :math:`\alpha` and :math:`\beta`. When a TCP connection starts, the first `rto` is set to `3` seconds. When a first estimation of the `rtt` is available, the `srtt`, `rttvar` and `rto` are computed as follows : @@ -356,7 +356,7 @@ Then, when other rtt measurements are collected, `srtt` and `rttvar` are updated :math:`rttvar=(1-\beta) \times rttvar + \beta \times |srtt - rtt|` :math:`srtt=(1-\alpha) \times srtt + \alpha \times rtt` - + :math:`rto=srtt + 4 \times rttvar` The proposed values for the parameters are :math:`\alpha=\frac{1}{8}` and :math:`\beta=\frac{1}{4}`. This allows a TCP implementation, implemented in the kernel, to perform the `rtt` computation by using shift operations instead of the more costly floating point operations [Jacobson1988]_. The figure below illustrates the computation of the `rto` upon `rtt` changes. @@ -364,16 +364,16 @@ The proposed values for the parameters are :math:`\alpha=\frac{1}{8}` and :math: .. figure:: /../book/transport/png/transport-fig-071-c.png :align: center - :scale: 70 + :scale: 70 Example computation of the `rto` - + Advanced retransmission strategies ---------------------------------- .. index:: exponential backoff - + The default go-back-n retransmission strategy was defined in :rfc:`793`. When the retransmission timer expires, TCP retransmits the first unacknowledged segment (i.e. the one having sequence number `snd.una`). After each expiration of the retransmission timeout, :rfc:`2988` recommends to double the value of the retransmission timeout. This is called an `exponential backoff`. This doubling of the retransmission timeout after a retransmission was included in TCP to deal with issues such as network/receiver overload and incorrect initial estimations of the retransmission timeout. If the same segment is retransmitted several times, the retransmission timeout is doubled after every retransmission until it reaches a configured maximum. :rfc:`2988` suggests a maximum retransmission timeout of at least 60 seconds. Once the retransmission timeout reaches this configured maximum, the remote host is considered to be unreachable and the TCP connection is closed. @@ -385,7 +385,7 @@ This retransmission strategy has been refined based on the experience of using T reception of a data segment: if pkt.seq==rcv.nxt: # segment received in sequence - if delayedack : + if delayedack : send pure ack segment cancel acktimer delayedack=False @@ -396,30 +396,30 @@ This retransmission strategy has been refined based on the experience of using T send pure ack segment if delayedack: delayedack=False - cancel acktimer + cancel acktimer transmission of a data segment: # piggyback ack if delayedack: delayedack=False cancel acktimer - + acktimer expiration: send pure ack segment delayedack=False Due to this delayed acknowledgement strategy, during a bulk transfer, a TCP implementation usually acknowledges every second TCP segment received. -The default go-back-n retransmission strategy used by TCP has the advantage of being simple to implement, in particular on the receiver side, but when there are losses, a go-back-n strategy provides a lower performance than a selective repeat strategy. The TCP developers have designed several extensions to TCP to allow it to use a selective repeat strategy while maintaining backward compatibility with older TCP implementations. These TCP extensions assume that the receiver is able to buffer the segments that it receives out-of-sequence. +The default go-back-n retransmission strategy used by TCP has the advantage of being simple to implement, in particular on the receiver side, but when there are losses, a go-back-n strategy provides a lower performance than a selective repeat strategy. The TCP developers have designed several extensions to TCP to allow it to use a selective repeat strategy while maintaining backward compatibility with older TCP implementations. These TCP extensions assume that the receiver is able to buffer the segments that it receives out-of-sequence. .. index:: TCP fast retransmit -The first extension that was proposed is the fast retransmit heuristic. This extension can be implemented on TCP senders and thus does not require any change to the protocol. It only assumes that the TCP receiver is able to buffer out-of-sequence segments. +The first extension that was proposed is the fast retransmit heuristic. This extension can be implemented on TCP senders and thus does not require any change to the protocol. It only assumes that the TCP receiver is able to buffer out-of-sequence segments. From a performance point of view, one issue with TCP's `retransmission timeout` is that when there are isolated segment losses, the TCP sender often remains idle waiting for the expiration of its retransmission timeouts. Such isolated losses are frequent in the global Internet [Paxson99]_. A heuristic to deal with isolated losses without waiting for the expiration of the retransmission timeout has been included in many TCP implementations since the early 1990s. To understand this heuristic, let us consider the figure below that shows the segments exchanged over a TCP connection when an isolated segment is lost. -.. figure:: /../book/transport/png/transport-fig-074-c.png +.. figure:: /../book/transport/png/transport-fig-074-c.png :align: center - :scale: 70 + :scale: 70 Detecting isolated segment losses @@ -441,9 +441,9 @@ This heuristic requires an additional variable in the TCB (`dupacks`). Most impl The figure below illustrates the operation of the `fast retransmit` heuristic. -.. figure:: /../book/transport/png/transport-fig-075-c.png +.. figure:: /../book/transport/png/transport-fig-075-c.png :align: center - :scale: 70 + :scale: 70 TCP fast retransmit heuristics @@ -452,23 +452,23 @@ The figure below illustrates the operation of the `fast retransmit` heuristic. When losses are not isolated or when the windows are small, the performance of the `fast retransmit` heuristic decreases. In such environments, it is necessary to allow a TCP sender to use a selective repeat strategy instead of the default go-back-n strategy. Implementing selective-repeat requires a change to the TCP protocol as the receiver needs to be able to inform the sender of the out-of-order segments that it has already received. This can be done by using the Selective Acknowledgements (SACK) option defined in :rfc:`2018`. This TCP option is negotiated during the establishment of a TCP connection. If both TCP hosts support the option, SACK blocks can be attached by the receiver to the segments that it sends. SACK blocks allow a TCP receiver to indicate the blocks of data that it has received correctly but out of sequence. The figure below illustrates the utilisation of the SACK blocks. -.. figure:: /../book/transport/png/transport-fig-076-c.png +.. figure:: /../book/transport/png/transport-fig-076-c.png :align: center - :scale: 70 + :scale: 70 TCP selective acknowledgements A SACK option contains one or more blocks. A block corresponds to all the sequence numbers between the `left edge` and the `right edge` of the block. The two edges of the block are encoded as 32 bit numbers (the same size as the TCP sequence number) in an SACK option. As the SACK option contains one byte to encode its type and one byte for its length, a SACK option containing `b` blocks is encoded as a sequence of :math:`2+8 \times b` bytes. In practice, the size of the SACK option can be problematic as the optional TCP header extension cannot be longer than 44 bytes. As the SACK option is usually combined with the :rfc:`1323` timestamp extension, this implies that a TCP segment cannot usually contain more than three SACK blocks. This limitation implies that a TCP receiver cannot always place in the SACK option that it sends, information about all the received blocks. -To deal with the limited size of the SACK option, a TCP receiver currently having more than 3 blocks inside its receiving buffer must select the blocks to place in the SACK option. A good heuristic is to put in the SACK option the blocks that have most recently changed, as the sender is likely to be already aware of the older blocks. +To deal with the limited size of the SACK option, a TCP receiver currently having more than 3 blocks inside its receiving buffer must select the blocks to place in the SACK option. A good heuristic is to put in the SACK option the blocks that have most recently changed, as the sender is likely to be already aware of the older blocks. When a sender receives an SACK option indicating a new block and thus a new possible segment loss, it usually does not retransmit the missing segments immediately. To deal with reordering, a TCP sender can use a heuristic similar to `fast retransmit` by retransmitting a gap only once it has received three SACK options indicating this gap. It should be noted that the SACK option does not supersede the `acknowledgement number` of the TCP header. A TCP sender can only remove data from its sending buffer once they have been acknowledged by TCP's cumulative acknowledgements. This design was chosen for two reasons. First, it allows the receiver to discard parts of its receiving buffer when it is running out of memory without loosing data. Second, as the SACK option is not transmitted reliably, the cumulative acknowledgements are still required to deal with losses of `ACK` segments carrying only SACK information. Thus, the SACK option only serves as a hint to allow the sender to optimise its retransmissions. .. Protection agains wrapped sequence numbers - + .. todo -.. Many researchers have worked on techniques to improve the data transfer mechanisms used by TCP. +.. Many researchers have worked on techniques to improve the data transfer mechanisms used by TCP. @@ -492,33 +492,33 @@ The abrupt connection release mechanism is very simple and relies on a single se - by extension, some implementations respond with an `RST` segment to a segment that is received on an existing connection but with an invalid header :rfc:`3360`. This causes the corresponding connection to be closed and has caused security attacks :rfc:`4953` - by extension, some implementations send an `RST` segment when they need to close an existing TCP connection (e.g. because there are not enough resources to support this connection or because the remote host is considered to be unreachable). Measurements have shown that this usage of TCP `RST` is widespread [AW05]_ -When an `RST` segment is sent by a TCP entity, it should contain the current value of the `sequence number` for the connection (or 0 if it does not belong to any existing connection) and the `acknowledgement number` should be set to the next expected in-sequence `sequence number` on this connection. +When an `RST` segment is sent by a TCP entity, it should contain the current value of the `sequence number` for the connection (or 0 if it does not belong to any existing connection) and the `acknowledgement number` should be set to the next expected in-sequence `sequence number` on this connection. .. note:: TCP `RST` wars .. index:: Robustness principle - - TCP implementers should ensure that two TCP entities never enter a TCP `RST` war where host `A` is sending a `RST` segment in response to a previous `RST` segment that was sent by host `B` in response to a TCP `RST` segment sent by host `A` ... To avoid such an infinite exchange of `RST` segments that do not carry data, a TCP entity is *never* allowed to send a `RST` segment in response to another `RST` segment. + TCP implementers should ensure that two TCP entities never enter a TCP `RST` war where host `A` is sending a `RST` segment in response to a previous `RST` segment that was sent by host `B` in response to a TCP `RST` segment sent by host `A` ... To avoid such an infinite exchange of `RST` segments that do not carry data, a TCP entity is *never* allowed to send a `RST` segment in response to another `RST` segment. -The normal way of terminating a TCP connection is by using the graceful TCP connection release. This mechanism uses the `FIN` flag of the TCP header and allows each host to release its own direction of data transfer. As for the `SYN` flag, the utilisation of the `FIN` flag in the TCP header consumes one sequence number. The figure :ref:`fig-tcprelease` shows the part of the TCP FSM used when a TCP connection is released. + +The normal way of terminating a TCP connection is by using the graceful TCP connection release. This mechanism uses the `FIN` flag of the TCP header and allows each host to release its own direction of data transfer. As for the `SYN` flag, the utilisation of the `FIN` flag in the TCP header consumes one sequence number. The figure :ref:`fig-tcprelease` shows the part of the TCP FSM used when a TCP connection is released. .. _fig-tcprelease: .. figure:: /../book/transport/png/transport-fig-067-c.png :align: center - :scale: 70 + :scale: 70 FSM for TCP connection release Starting from the `Established` state, there are two main paths through this FSM. -The first path is when the host receives a segment with sequence number `x` and the `FIN` flag set. The utilisation of the `FIN` flag indicates that the byte before `sequence number` `x` was the last byte of the byte stream sent by the remote host. Once all of the data has been delivered to the user, the TCP entity sends an `ACK` segment whose `ack` field is set to :math:`(x+1) \pmod{2^{32}}` to acknowledge the `FIN` segment. The `FIN` segment is subject to the same retransmission mechanisms as a normal TCP segment. In particular, its transmission is protected by the retransmission timer. At this point, the TCP connection enters the `CLOSE\_WAIT` state. In this state, the host can still send data to the remote host. Once all its data have been sent, it sends a `FIN` segment and enter the `LAST\_ACK` state. In this state, the TCP entity waits for the acknowledgement of its `FIN` segment. It may still retransmit unacknowledged data segments e.g. if the retransmission timer expires. Upon reception of the acknowledgement for the `FIN` segment, the TCP connection is completely closed and its :term:`TCB` can be discarded. +The first path is when the host receives a segment with sequence number `x` and the `FIN` flag set. The utilisation of the `FIN` flag indicates that the byte before `sequence number` `x` was the last byte of the byte stream sent by the remote host. Once all of the data has been delivered to the user, the TCP entity sends an `ACK` segment whose `ack` field is set to :math:`(x+1) \pmod{2^{32}}` to acknowledge the `FIN` segment. The `FIN` segment is subject to the same retransmission mechanisms as a normal TCP segment. In particular, its transmission is protected by the retransmission timer. At this point, the TCP connection enters the `CLOSE\_WAIT` state. In this state, the host can still send data to the remote host. Once all its data have been sent, it sends a `FIN` segment and enter the `LAST\_ACK` state. In this state, the TCP entity waits for the acknowledgement of its `FIN` segment. It may still retransmit unacknowledged data segments e.g. if the retransmission timer expires. Upon reception of the acknowledgement for the `FIN` segment, the TCP connection is completely closed and its :term:`TCB` can be discarded. The second path is when the host has transmitted all data. Assume that the last transmitted sequence number is `z`. Then, the host sends a `FIN` segment with sequence number :math:`(z+1) \pmod{2^{32}}` and enters the `FIN_WAIT1` state. It this state, it can retransmit unacknowledged segments but cannot send new data segments. It waits for an acknowledgement of its `FIN` segment (i.e. sequence number :math:`(z+1) \pmod{2^{32}}`), but may receive a `FIN` segment sent by the remote host. In the first case, the TCP connection enters the `FIN\_WAIT2` state. In this state, new data segments from the remote host are still accepted until the reception of the `FIN` segment. The acknowledgement for this `FIN` segment is sent once all data received before the `FIN` segment have been delivered to the user and the connection enters the `TIME\_WAIT` state. In the second case, a `FIN` segment is received and the connection enters the `Closing` state once all data received from the remote host have been delivered to the user. In this state, no new data segments can be sent and the host waits for an acknowledgement of its `FIN` segment before entering the `TIME\_WAIT` state. -The `TIME\_WAIT` state is different from the other states of the TCP FSM. A TCP entity enters this state after having sent the last `ACK` segment on a TCP connection. This segment indicates to the remote host that all the data that it has sent have been correctly received and that it can safely release the TCP connection and discard the corresponding :term:`TCB`. After having sent the last `ACK` segment, a TCP connection enters the `TIME\_WAIT` and remains in this state for :math:`2*MSL` seconds. During this period, the TCB of the connection is maintained. This ensures that the TCP entity that sent the last `ACK` maintains enough state to be able to retransmit this segment if this `ACK` segment is lost and the remote host retransmits its last `FIN` segment or another one. The delay of :math:`2*MSL` seconds ensures that any duplicate segments on the connection would be handled correctly without causing the transmission of an `RST` segment. Without the `TIME\_WAIT` state and the :math:`2*MSL` seconds delay, the connection release would not be graceful when the last `ACK` segment is lost. +The `TIME\_WAIT` state is different from the other states of the TCP FSM. A TCP entity enters this state after having sent the last `ACK` segment on a TCP connection. This segment indicates to the remote host that all the data that it has sent have been correctly received and that it can safely release the TCP connection and discard the corresponding :term:`TCB`. After having sent the last `ACK` segment, a TCP connection enters the `TIME\_WAIT` and remains in this state for :math:`2*MSL` seconds. During this period, the TCB of the connection is maintained. This ensures that the TCP entity that sent the last `ACK` maintains enough state to be able to retransmit this segment if this `ACK` segment is lost and the remote host retransmits its last `FIN` segment or another one. The delay of :math:`2*MSL` seconds ensures that any duplicate segments on the connection would be handled correctly without causing the transmission of an `RST` segment. Without the `TIME\_WAIT` state and the :math:`2*MSL` seconds delay, the connection release would not be graceful when the last `ACK` segment is lost. .. note:: TIME\_WAIT on busy TCP servers @@ -529,7 +529,7 @@ The `TIME\_WAIT` state is different from the other states of the TCP FSM. A TCP .. note TCP RST attacks Explain TCP reset and the risks of attacks rfc4953 - + .. rubric:: Footnotes @@ -543,7 +543,7 @@ The `TIME\_WAIT` state is different from the other states of the TCP FSM. A TCP .. [#frlogin] On many departmental networks containing Unix workstations, it was common to allow users on one of the hosts to use rlogin :rfc:`1258` to run commands on any of the workstations of the network without giving any password. In this case, the remote workstation "authenticated" the client host based on its IP address. This was a bad practice from a security viewpoint. -.. [#ftcpboth] Of course, such a simultaneous TCP establishment can only occur if the source port chosen by the client is equal to the destination port chosen by the server. This may happen when a host can serve both as a client as a server or in peer-to-peer applications when the communicating hosts do not use ephemeral port numbers. +.. [#ftcpboth] Of course, such a simultaneous TCP establishment can only occur if the source port chosen by the client is equal to the destination port chosen by the server. This may happen when a host can serve both as a client as a server or in peer-to-peer applications when the communicating hosts do not use ephemeral port numbers. .. [#fspoofing] Sending a packet with a different source IP address than the address allocated to the host is called sending a :term:`spoofed packet`. @@ -551,11 +551,11 @@ The `TIME\_WAIT` state is different from the other states of the TCP FSM. A TCP .. [#fackflag] In practice, only the `SYN` segment do not have their `ACK` flag set. -.. [#ftcpurgent] A complete TCP implementation contains additional information in its TCB, notably to support the `urgent` pointer. However, this part of TCP is not discussed in this book. Refer to :rfc:`793` and :rfc:`2140` for more details about the TCB. +.. [#ftcpurgent] A complete TCP implementation contains additional information in its TCB, notably to support the `urgent` pointer. However, this part of TCP is not discussed in this book. Refer to :rfc:`793` and :rfc:`2140` for more details about the TCB. .. [#fmss] In theory, TCP implementations could send segments as large as the MSS advertised by the remote host during connection establishment. In practice, most implementations use as MSS the minimum between the received MSS and their own MSS. This avoids fragmentation in the underlying IP layer and is discussed in the next chapter. -.. [#fnagleip] This TCP segment is then placed in an IP header. We describe IPv6 in the next chapter. The minimum size of the IPv6 (resp. IPv4) header is 40 bytes (resp. 20 bytes). +.. [#fnagleip] This TCP segment is then placed in an IP header. We describe IPv6 in the next chapter. The minimum size of the IPv6 (resp. IPv4) header is 40 bytes (resp. 20 bytes). .. [#faveragebandwidth] A precise estimation of the maximum bandwidth that can be achieved by a TCP connection should take into account the overhead of the TCP and IP headers as well. @@ -566,7 +566,7 @@ The `TIME\_WAIT` state is different from the other states of the TCP FSM. A TCP .. [#ftimestamp] Some security experts have raised concerns that using the real-time clock to set the `TSval` in the timestamp option can leak information such as the system's up-time. Solutions proposed to solve this problem may be found in [CNPI09]_ -.. [#ftcbtouch] As a TCP client often establishes several parallel or successive connections with the same server, :rfc:`2140` has proposed to reuse for a new connection some information that was collected in the TCB of a previous connection, such as the measured rtt. However, this solution has not been widely implemented. +.. [#ftcbtouch] As a TCP client often establishes several parallel or successive connections with the same server, :rfc:`2140` has proposed to reuse for a new connection some information that was collected in the TCB of a previous connection, such as the measured rtt. However, this solution has not been widely implemented. .. [#fdelack] If the destination is using delayed acknowledgements, the sending host sends two data segments after each acknowledgement. diff --git a/book-2nd/protocols/transport-service.rst b/book-2nd/protocols/transport-service.rst index 3617e88..71a1ff6 100644 --- a/book-2nd/protocols/transport-service.rst +++ b/book-2nd/protocols/transport-service.rst @@ -5,7 +5,7 @@ The application layer ********************* -.. warning:: +.. warning:: This is an unpolished draft of the second edition of this ebook. If you find any error or have suggestions to improve the text, please create an issue via https://github.com/obonaventure/cnp3/issues?milestone=5 @@ -14,9 +14,9 @@ Networked applications rely on the transport service. As explained earlier, ther - the `connectionless` service - the `connection-oriented` or `byte-stream` service -The connectionless service allows applications to easily exchange messages or Service Data Units. On the Internet, this service is provided by the UDP protocol that will be explained in the next chapter. The connectionless transport service on the Internet is unreliable, but is able to detect transmission errors. This implies that an application will not receive data that has been corrupted due to transmission errors. +The connectionless service allows applications to easily exchange messages or Service Data Units. On the Internet, this service is provided by the UDP protocol that will be explained in the next chapter. The connectionless transport service on the Internet is unreliable, but is able to detect transmission errors. This implies that an application will not receive data that has been corrupted due to transmission errors. -The connectionless transport service allows networked application to exchange messages. Several networked applications may be running at the same time on a single host. Each of these applications must be able to exchange SDUs with remote applications. To enable these exchanges of SDUs, each networked application running on a host is identified by the following information : +The connectionless transport service allows networked applications to exchange messages. Several networked applications may be running at the same time on a single host. Each of these applications must be able to exchange SDUs with remote applications. To enable these exchanges of SDUs, each networked application running on a host is identified by the following information : - the `host` on which the application is running - the `port number` on which the application `listens` for SDUs @@ -26,14 +26,14 @@ On the Internet, the `port number` is an integer and the `host` is identified by - `IP version 4` addresses that are 32 bits wide - `IP version 6` addresses that are 128 bits wide -IPv4 addresses are usually represented by using a dotted decimal representation where each decimal number corresponds to one byte of the address, e.g. `203.0.113.56`. IPv6 addresses are usually represented as a set of hexadecimal numbers separated by semicolons, e.g. `2001:db8:3080:2:217:f2ff:fed6:65c0`. Today, most Internet hosts have one IPv4 address. A small fraction of them also have an IPv6 address. In the future, we can expect that more and more hosts will have IPv6 addresses and that some of them will not have an IPv4 address anymore. A host that only has an IPv4 address cannot communicate with a host having only an IPv6 address. The figure below illustrates two that are using the datagram service provided by UDP on hosts that are using IPv4 addresses. +IPv4 addresses are usually represented by using a dotted decimal representation where each decimal number corresponds to one byte of the address, e.g. `203.0.113.56`. IPv6 addresses are usually represented as a set of hexadecimal numbers separated by semicolons, e.g. `2001:db8:3080:2:217:f2ff:fed6:65c0`. Today, most Internet hosts have one IPv4 address. A small fraction of them also have an IPv6 address. In the future, we can expect that more and more hosts will have IPv6 addresses and that some of them will not have an IPv4 address anymore. A host that only has an IPv4 address cannot communicate with a host having only an IPv6 address. The figure below illustrates two applications that are using the datagram service provided by UDP on hosts that are using IPv4 addresses. .. figure:: /../book/application/png/app-fig-002-c.png :align: center - :scale: 80 + :scale: 80 - The connectionless or datagram service + The connectionless or datagram service .. note:: Textual representation of IPv6 addresses @@ -44,21 +44,20 @@ IPv4 addresses are usually represented by using a dotted decimal representation - 2001:db8:0:0:8:800:200c:417a - fe80:0:0:0:219:e3ff:fed7:1204 - IPv6 addresses often contain a long sequence of bits set to `0`. In this case, a compact notation has been defined. With this notation, `::` is used to indicate one or more groups of 16 bits blocks containing only bits set to `0`. For example, - + IPv6 addresses often contain a long sequence of bits set to `0`. In this case, a compact notation has been defined. With this notation, `::` is used to indicate one or more groups of 16 bits blocks containing only bits set to `0`. For example, + - 2001:db8:0:0:8:800:200c:417a is represented as `2001:db8::8:800:200c:417a` - - ff01:0:0:0:0:0:0:101 is represented as `ff01::101` + - ff01:0:0:0:0:0:0:101 is represented as `ff01::101` - 0:0:0:0:0:0:0:1 is represented as `::1` - 0:0:0:0:0:0:0:0 is represented as `\:\:` -The second transport service is the connection-oriented service. On the Internet, this service is often called the `byte-stream service` as it creates a reliable byte stream between the two applications that are linked by a transport connection. Like the datagram service, the networked applications that use the byte-stream service are identified by the host on which they run and a port number. These hosts can be identified by an address or a name. The figure below illustrates two applications that are using the byte-stream service provided by the TCP protocol on IPv6 hosts. The byte stream service provided by TCP is reliable and bidirectional. +The second transport service is the connection-oriented service. On the Internet, this service is often called the `byte-stream service` as it creates a reliable byte stream between the two applications that are linked by a transport connection. Like the datagram service, the networked applications that use the byte-stream service are identified by the host on which they run and a port number. These hosts can be identified by an address or a name. The figure below illustrates two applications that are using the byte-stream service provided by the TCP protocol on IPv6 hosts. The byte stream service provided by TCP is reliable and bidirectional. .. figure:: /../book/application/png/app-fig-003-c.png :align: center - :scale: 80 - - The connection-oriented or byte-stream service + :scale: 80 + The connection-oriented or byte-stream service From 56e813aba47a2907a131c0259cbc89c77fbaade8 Mon Sep 17 00:00:00 2001 From: magalii Date: Tue, 29 Jan 2019 13:06:59 +0100 Subject: [PATCH 3/3] Correction to throughput formula --- book-2nd/protocols/congestion.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book-2nd/protocols/congestion.rst b/book-2nd/protocols/congestion.rst index 4841443..90906f0 100644 --- a/book-2nd/protocols/congestion.rst +++ b/book-2nd/protocols/congestion.rst @@ -186,7 +186,7 @@ As the losses are equally spaced, the congestion window always starts at some va However, given the regular losses that we consider, the number of segments that are sent between two losses (i.e. during a cycle) is by definition equal to :math:`\frac{1}{p}`. Thus, :math:`W=\sqrt{\frac{8}{3 \times p}}=\frac{k}{\sqrt{p}}`. The throughput (in bytes per second) of the TCP connection is equal to the number of segments transmitted divided by the duration of the cycle : - :math:`Throughput=\frac{area \times MSS}{time} = \frac{ \frac{3 \times W^2}{8}}{\frac{W}{2} \times rtt}` + :math:`Throughput=\frac{area \times MSS}{time} = \frac{ \frac{3 \times W^2}{8} \times MSS}{\frac{W}{2} \times rtt}` or, after having eliminated `W`, :math:`Throughput=\sqrt{\frac{3}{2}} \times \frac{MSS}{rtt \times \sqrt{p}}`