Network Working Group
Independent Submission                                  C. Filsfils, Ed.
Internet-Draft                                                S. Previdi
Intended status: Informational
Request for Comments: 8604                           Cisco Systems, Inc.
Expires: September 6, 2019
Category: Informational                                       S. Previdi
ISSN: 2070-1721                                      Huawei Technologies
                                                           G. Dawra, Ed.
                                                                LinkedIn
                                                           W. Henderickx
                                                                   Nokia
                                                               D. Cooper
                                                                 Level 3
                                                           March 5,
                                                             CenturyLink
                                                               June 2019

       Interconnecting Millions Of of Endpoints With with Segment Routing
           draft-filsfils-spring-large-scale-interconnect-13

Abstract

   This document describes an application of Segment Routing to scale
   the network to support hundreds of thousands of network nodes, and
   tens of millions of physical underlay endpoints.  This use-case use case can
   be applied to the interconnection of massive-scale DCs Data Centers (DCs)
   and/or large aggregation networks.  Forwarding tables of midpoint and
   leaf nodes only require a few tens of thousands of entries.  This may
   be achieved by inherert the inherently scaleable nature of Segment Routing and designed
   the design proposed in this document.

Status of This Memo

   This Internet-Draft document is submitted in full conformance with not an Internet Standards Track specification; it is
   published for informational purposes.

   This is a contribution to the
   provisions RFC Series, independently of BCP 78 any other
   RFC stream.  The RFC Editor has chosen to publish this document at
   its discretion and BCP 79.

   Internet-Drafts makes no statement about its value for
   implementation or deployment.  Documents approved for publication by
   the RFC Editor are working documents not candidates for any level of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list Standard;
   see Section 2 of RFC 7841.

   Information about the current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum status of six months this document, any errata,
   and how to provide feedback on it may be updated, replaced, or obsoleted by other documents obtained at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 6, 2019.
   https://www.rfc-editor.org/info/rfc8604.

Copyright Notice

   Copyright (c) 2019 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   2
   3.  Reference Design  . . . . . . . . . . . . . . . . . . . . . .   3
   4.  Control Plane . . . . . . . . . . . . . . . . . . . . . . . .   4
   5.  Illustration of the scale Scale . . . . . . . . . . . . . . . . . .   5
   6.  Design Options  . . . . . . . . . . . . . . . . . . . . . . .   6
     6.1.  Segment Routing Global Block(SRGB) Block (SRGB) Size  . . . . . . . . .   6
     6.2.  Redistribution of Routes for Agg nodes routes  . . Nodes  . . . . . . . . .   6
     6.3.  Sizing and hierarchy Hierarchy  . . . . . . . . . . . . . . . . . .   6
     6.4.  Local Segments to Hosts/Servers . . . . . . . . . . . . .   7
     6.5.  Compressed SRTE policies Policies  . . . . . . . . . . . . . . . .   7
   7.  Deployment Model  . . . . . . . . . . . . . . . . . . . . . .   7
   8.  Benefits  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
     8.1.  Simplified operations Operations . . . . . . . . . . . . . . . . . .   8
     8.2.  Inter-domain SLA SLAs . . . . . . . . . . . . . . . . . . . .   8
     8.3.  Scale . . . . . . . . . . . . . . . . . . . . . . . . . .   8
     8.4.  ECMP  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8   9
   10. Manageability Considerations  . . . . . . . . . . . . . . . .   9
   11. Security Considerations . . . . . . . . . . . . . . . . . . .   9
   12. Acknowledgements Informative References  . . . . . . . . . . . . . . . . . . .   9
   Acknowledgements  . . . .   9
   13. Contributors . . . . . . . . . . . . . . . . . . . .  10
   Contributors  . . . . . . .   9
   14. Informative References . . . . . . . . . . . . . . . . . . .  10
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   This document describes how SR Segment Routing (SR) can be used to
   interconnect millions of endpoints.The following terminology is used in this document: endpoints.

2.  Terminology

   The following terms and abbreviations are used in this document:

      Term          Definition
      ---------------------------------------------------------
      Agg           Aggregation
      BGP           Border Gateway Protocol
      DC            Data Center
      DCI           Data Center Interconnect
      ECMP        Equal Cost MultiPathing          Equal-Cost Multipath
      FIB           Forwarding Information Base
      LDP           Label Distribution Protocol
      LFIB          Label Forwarding Information Base
      MPLS        Multi-Protocol          Multiprotocol Label Switching
      PCE           Path Computation Element
      PCEP          Path Computation Element Communication Protocol
      PW            Pseudowire
      SLA           Service level Level Agreement
      SR            Segment Routing
      SRTE Policy   Segment Routing Traffic Engineering Policy
      TE            Traffic Engineering
      TI-LFA        Topology Independent - Loop Free Alternative Loop-Free Alternate

3.  Reference Design

   The network diagram here below describes illustrates the reference network topology
   used in this document:

           +-------+ +--------+ +--------+ +-------+ +-------+
           A       DCI1       Agg1       Agg3      DCI3      Z
           |  DC1  | |   M1   | |   C    | |   M2  | |  DC2  |
           |       DCI2       Agg2       Agg4      DCI4      |
           +-------+ +--------+ +--------+ +-------+ +-------+

                       Figure 1: Reference Topology

   The following applies apply to the reference topology above:

   o  Independent ISIS-OSPF/SR instance in core (C) region.

   o  Independent ISIS-OSPF/SR instance in Metro1 (M1) region.

   o  Independent ISIS-OSPF/SR instance in Metro2 (M2) region.

   o  BGP/SR in DC1.

   o  BGP/SR in DC2.

   o  Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M
      (M1 and M2) and from M to DC domains.

   o  No other route is advertised or redistributed between regions.

   o  The same homogeneous SRGB Segment Routing Global Block (SRGB) is used
      throughout the domains (e.g. (e.g., 16000-23999).

   o  Unique SRGB sub-ranges are allocated to each metro (M) and core
      (C) domains: domain:

      *  The 16000-16999 range is allocated to the core (C)
         domain/region.

      *  The 17000-17999 range is allocated to the M1 domain/region.

      *  The 18000-18999 range is allocated to the M2 domain/region.

      *  Specifically, the Agg1 router has SID Segment Identifier (SID)
         16001 allocated allocated, and the Agg2 router has SID 16002 allocated.

      *  Specifically, the Agg3 router has SID 16003 allocated allocated, and the
         anycast SID for Agg3 and Agg4 is 16006.

      *  Specifically, the DCI3 router has SID 18003 allocated allocated, and the
         anycast SID for DCI3 and DCI4 is 18006.

      *  Specifically, at the Agg1 router Binding router, the binding SID 4001 leads to
         DCI Pair
         DCI3, DCI4 pair (DCI3, DCI4) via a specific low-latency path {16002,
         16003, 18006}.

   o  The same SRGB sub-range is re-used reused within each DC (DC1 and DC2)
      region.
      region for each DC: e.g. 20000-23999. DC (e.g., 20000-23999).  Specifically, nodes A
      and Z both have SID 20001 allocated to them.

4.  Control Plane

   This section provides a high-level description of a how a control plane
   could be implemented using protocol components already defined in
   other RFCs.

   The mechanism through which SRTE Policies are defined, computed computed, and
   programmed in the source nodes, are nodes is outside the scope of this document.

   Typically, a controller or a service orchestration system programs
   node A with a pseudowire (PW) PW to a remote next-hop node Z with a given SLA
   contract (e.g. (e.g., low-latency path, be disjoint disjointness from a specific core
   plane, be disjoint disjointness from a different PW service, etc.). service).

   Node A automatically detects that it does node Z is not have reachability to Z. reachable.  It then
   automatically sends a PCEP request to an SR PCE for an SRTE policy
   that provides reachability to information for node Z with the
   requested SLA.

   The SR PCE [RFC4655] is made of two components.  A components: a multi-domain
   topology and a computation engine.  The multi-domain topology is
   continuously refreshed through BGP-LS [RFC7752] BGP - Link State (BGP-LS) feeds
   [RFC7752] from each domain.  The computing computation engine is desined designed to implemet Traffic
   Engineering (TE)
   implement TE algorithms and provide output in SR Path format.  Upon
   receiving the PCEP [RFC5440] request, request [RFC5440], the SR PCE computes the
   requested path.  The path is expressed through a list of segments
   (e.g.
   (e.g., {16003, 18006, 20001} 20001}) and provided to node A.

   The SR PCE logs the request as a stateful query and hence is capable able to
   recompute the path at each network topology change.

   Node A receives the PCEP reply with the path (expressed as a segment
   list).  Node A installs the received SRTE policy in the dataplane. data plane.
   Node A then automatically steers the PW into that SRTE policy.

5.  Illustration of the scale Scale

   According to the reference topology described shown in Figure 1 1, the following
   assumptions are made:

      There's 1 core domain

   o  There is one core domain, and there are 100 leaf (metro) domains.

   o  The core domain includes 200 nodes.

   o  Two nodes connect each leaf (metro) domain.  Each node connecting
      a leaf domain has a SID allocated.  Each pair of nodes connecting
      a leaf domain also has a common anycast SID.  This brings yields up to
      300 prefix segments in total.

   o  A core node connects only one leaf domain.

   o  Each leaf domain has 6000 leaf node 6,000 leaf-node segments.  Each leaf-node leaf node has
      500 endpoints attached, attached and thus 500 adjacency segments.  In total, it
      is  This
      yields a total of 3 millions million endpoints for a leaf domain.

   Based on the above, the network scaling numbers are as follows:

   o  6,000 leaf node leaf-node segments multiplied by 100 leaf domains:
      600,000 nodes.

   o  600,000 nodes multiplied by 500 endpoints: 300 millions of million endpoints.

   The node scaling numbers are as follows:

      Leaf node

   o  Leaf-node segment scale: 6,000 leaf node leaf-node segments + 300 core node core-node
      segments + 500 adjacency segments = 6,800 segments
      Core node segments.

   o  Core-node segment scale: 6,000 leaf domain leaf-domain segments +
      300 core
      domain core-domain segments = 6,300 segments segments.

   In the above calculations, the link adjacency link-adjacency segments are not taken
   into account.  These are local segments and, typically, less than 100
   per node.

   It has to be noted that, depending on leaf node leaf-node FIB capabilities,
   leaf domains could be split into multiple smaller domains.  In the
   above example, the leaf domains could be split into 6 six smaller
   domains so that each leaf node only need needs to learn 1000 leaf node 1,000 leaf-node
   segments + 300 core node core-node segments + 500 adjacency segments which gives segments, yielding
   a total of 1,800 segments.

6.  Design Options

   This section describes multiple design options to illustrate scale as
   described in the illustration of previous section.

6.1.  Segment Routing Global Block(SRGB) Block (SRGB) Size

   In the simplified illustrations of in this document, we picked a small
   homogeneous SRGB range of 16000-23999.  In practice, a large-scale
   design would use a bigger range range, such as 16000-80000, 16000-80000 or even larger.
   Larger
   A larger range provides allocations for various Traffic Engineering TE applications
   within a given domain domain.

6.2.  Redistribution of Routes for Agg nodes routes Nodes

   The operator might choose to not redistribute the routes for Agg
   nodes routes into the Metro/DC domains.  In that case, more segments are
   required in order to express an inter-domain path.

   For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3,
   DCI3, Z} in order to reach Z instead of {Agg3, DCI3, Z} in the
   reference design.

6.3.  Sizing and hierarchy Hierarchy

   The operator is free to choose among a small number of larger leaf
   domains, a large number of small leaf domains domains, or a mix of small and
   large core/leaf domains.

   The operator is free to use a 2-tier design two-tier (Core/Metro) or a 3-tier
   (Core/Metro/DC). three-tier
   (Core/Metro/DC) design.

6.4.  Local Segments to Hosts/Servers

   Local segments can be programmed at any leaf node (e.g. (e.g., node Z) in
   order to identify locally-attached locally attached hosts (or VM's). Virtual Machines (VMs)).
   For example, if node Z has bound a local segment 40001 to a local
   host ZH1, then node A uses the following SRTE Policy in order to
   reach that host: {16006, 18006, 20001, 40001}.  Such a local segment
   could represent the NID (Network Interface Device) in the context of
   the SP service provider access network, or a VM in the context of the DC
   network.

6.5.  Compressed SRTE policies Policies

   As an example and according to Section 3, we assume that node A can
   reach node Z (e.g., with a low-latency SLA contract) via the SRTE
   policy
   consisting that consists of the path: path Agg1, Agg2, Agg3, DCI3/4(anycast),
   Z.  The path is represented by the segment list: list {16001, 16002, 16003,
   18006, 20001}.

   It is clear that the control-plane solution can install an SRTE
   Policy {16002, 16003, 18006} at Agg1, collect the Binding binding SID
   allocated by Agg1 to that policy (e.g. 4001) (e.g., 4001), and hence program
   node A with the compressed SRTE Policy {16001, 4001, 20001}.

   From node A, 16001 leads to Agg1.  Once at Agg1, 4001 leads to the
   DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003,
   18006}.  Once at that DCI pair, 20001 leads to Z.

   Binding SID's SIDs allocated to "intermediate" SRTE Policies allow to
   compress achieve the
   compression of end-to-end SRTE Policies.

   The segment list {16001, 4001, 20001} expresses the same path as
   {16001, 16002, 16003, 18006, 20001} but with 2 two less segments.

   The Binding binding SID also provides for an inherent churn protection.

   When the core topology changes, the control-plane control plane can update the low-
   latency
   low-latency SRTE Policy from Agg1 to the DCI pair to DC2 without
   updating the SRTE Policy from A to Z.

7.  Deployment Model

   It is expected that this design will be deployed as a green field but used in "green field"
   deployments as well in as interworking (brown field) ("brown field") deployments with
   an MPLS design across multiple domains.

8.  Benefits

   The design options illustrated in this document allow the
   interconnection
   interconnections on a very large scale.  Millions of endpoints across
   different domains can be interconnected.

8.1.  Simplified operations Operations

   Two control-plane protocols are not needed in this design: design are LDP and
   RSVP-TE.  No new protocol has been introduced.  The design leverages
   the core IP
   protocols: protocols ISIS, OSPF, BGP, and PCEP with straightforward
   SR extensions.

8.2.  Inter-domain SLA SLAs

   Fast reroute and resiliency is are provided by TI-LFA with sub-50msec FRR sub-50-ms
   fast reroute upon Link/Node/SRLG failure. failure of a link, node, or Shared Risk Link Group
   (SRLG).  TI-LFA is described in
   [I-D.bashandy-rtgwg-segment-routing-ti-lfa]. [SR-TI-LFA].

   The use of anycast SIDs also provides an improvement in improved availability and
   resiliency.

   Inter-domain SLA's SLAs can be delivered, e.g., delivered (e.g., latency vs. cost optimized
   path, cost-optimized
   paths, disjointness from backbone planes, disjointness from other
   services, disjointness between primary and backup paths. paths).

   Existing inter-domain solutions do not provide any support for SLA
   contracts.  They just provide a best-effort reachability across
   domains.

8.3.  Scale

   In addition to having eliminated two control plane protocols, per-
   service the need for LDP and RSVP-TE,
   per-service midpoint states have also been removed from the network.

8.4.  ECMP

   Each policy (intra (intra-domain or inter-domain, with or without TE) is
   expressed as a list of segments.  Since each segment is optimized for
   ECMP,
   then the entire policy is optimized for ECMP.  The ECMP gain benefit of an
   anycast prefix segment optimized for ECMP should also be considered (e.g.
   (e.g., 16001 load-
   shares load-shares across any gateway from the M1 leaf domain
   to the Core and 16002 load-
   shares load-shares across any gateway from the Core to
   the M1 leaf domain).

9.  IANA Considerations

   This document does not make any has no IANA request. actions.

10.  Manageability Considerations

   This document describes an application of Segment Routing SR over the MPLS data
   plane.  Segment Routing  SR does not introduce any change changes in the MPLS data plane.  Manageability
   The manageability considerations described in [RFC8402] apply to the
   MPLS data plane when used with Segment
   Routing. SR.

11.  Security Considerations

   This document does not introduce additional security requirements and
   mechanisms other than the ones those described in [RFC8402].

12.  Acknowledgements

   We would like to thank Giles Heron, Alexander Preusche, Steve Braaten
   and Francis Ferguson for their contribution to the content of this
   document.

13.  Contributors

   The following people have substantially contributed to the editing of
   this document:

   Dennis Cai
   Individual

   Tim Laberge
   Individual

   Steven Lin
   Google Inc.

   Steven Lin
   Google Inc.

   Bruno Decraene
   Orange

   Luay Jalil
   Verizon

   Jeff Tantsura
   Individual

   Rob Shakir
   Google

14.  Informative References

   [I-D.bashandy-rtgwg-segment-routing-ti-lfa]
              Bashandy, A., Filsfils, C., Decraene, B., Litkowski, S.,
              Francois, P., daniel.voyer@bell.ca, d., Clad, F., and P.
              Camarillo, "Topology Independent Fast Reroute using
              Segment Routing", draft-bashandy-rtgwg-segment-routing-ti-
              lfa-05 (work in progress), October 2018.

   [RFC4655]  Farrel, A., Vasseur, J., and J. Ash, "A Path Computation
              Element (PCE)-Based Architecture", RFC 4655,
              DOI 10.17487/RFC4655, August 2006,
              <https://www.rfc-editor.org/info/rfc4655>.

   [RFC5440]  Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation
              Element (PCE) Communication Protocol (PCEP)", RFC 5440,
              DOI 10.17487/RFC5440, March 2009,
              <https://www.rfc-editor.org/info/rfc5440>.

   [RFC7752]  Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and
              S. Ray, "North-Bound Distribution of Link-State and
              Traffic Engineering (TE) Information Using BGP", RFC 7752,
              DOI 10.17487/RFC7752, March 2016,
              <https://www.rfc-editor.org/info/rfc7752>.

   [RFC8402]  Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L.,
              Decraene, B., Litkowski, S., and R. Shakir, "Segment
              Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,
              July 2018, <https://www.rfc-editor.org/info/rfc8402>.

   [SR-TI-LFA]
              Litkowski, S., Bashandy, A., Filsfils, C., Decraene, B.,
              Francois, P., Voyer, D., Clad, F., and P. Camarillo,
              "Topology Independent Fast Reroute using Segment Routing",
              Work in Progress, draft-ietf-rtgwg-segment-routing-ti-lfa-
              01, March 2019.

Acknowledgements

   We would like to thank Giles Heron, Alexander Preusche, Steve
   Braaten, and Francis Ferguson for their contributions to the content
   of this document.

Contributors

   The following people substantially contributed to the editing of this
   document:

   Dennis Cai
   Individual

   Tim Laberge
   Individual

   Steven Lin
   Google Inc.

   Bruno Decraene
   Orange

   Luay Jalil
   Verizon

   Jeff Tantsura
   Individual

   Rob Shakir
   Google Inc.

Authors' Addresses

   Clarence Filsfils (editor)
   Cisco Systems, Inc.
   Brussels
   Belgium

   Email: cfilsfil@cisco.com

   Stefano Previdi
   Cisco Systems, Inc.
   Via Del Serafico, 200
   Rome  00142
   Italy
   Huawei Technologies

   Email: stefano@previdi.net
   Gaurav Dawra (editor)
   LinkedIn
   USA
   United States of America

   Email: gdawra.ietf@gmail.com

   Wim Henderickx
   Nokia
   Copernicuslaan 50
   Antwerp  2018
   Belgium

   Email: wim.henderickx@nokia.com

   Dave Cooper
   Level 3
   CenturyLink

   Email: Dave.Cooper@Level3.com Dave.Cooper@centurylink.com