Network Working GroupIndependent Submission C. Filsfils, Ed.Internet-Draft S. Previdi Intended status: InformationalRequest for Comments: 8604 Cisco Systems, Inc.Expires: September 6, 2019Category: Informational S. Previdi ISSN: 2070-1721 Huawei Technologies G. Dawra, Ed. LinkedIn W. Henderickx Nokia D. CooperLevel 3 March 5,CenturyLink June 2019 Interconnecting MillionsOfof EndpointsWithwith Segment Routingdraft-filsfils-spring-large-scale-interconnect-13Abstract This document describes an application of Segment Routing to scale the network to support hundreds of thousands of network nodes, and tens of millions of physical underlay endpoints. Thisuse-caseuse case can be applied to the interconnection of massive-scaleDCsData Centers (DCs) and/or large aggregation networks. Forwarding tables of midpoint and leaf nodes only require a few tens of thousands of entries. This may be achieved byinherertthe inherently scaleable nature of Segment Routing anddesignedthe design proposed in this document. Status of This Memo ThisInternet-Draftdocument issubmitted in full conformance withnot an Internet Standards Track specification; it is published for informational purposes. This is a contribution to theprovisionsRFC Series, independently ofBCP 78any other RFC stream. The RFC Editor has chosen to publish this document at its discretion andBCP 79. Internet-Draftsmakes no statement about its value for implementation or deployment. Documents approved for publication by the RFC Editor areworking documentsnot candidates for any level oftheInternetEngineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The listStandard; see Section 2 of RFC 7841. Information about the currentInternet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximumstatus ofsix monthsthis document, any errata, and how to provide feedback on it may beupdated, replaced, or obsoleted by other documentsobtained atany time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 6, 2019.https://www.rfc-editor.org/info/rfc8604. Copyright Notice Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 3. Reference Design . . . . . . . . . . . . . . . . . . . . . . 3 4. Control Plane . . . . . . . . . . . . . . . . . . . . . . . . 4 5. Illustration of thescaleScale . . . . . . . . . . . . . . . . . . 5 6. Design Options . . . . . . . . . . . . . . . . . . . . . . . 6 6.1. Segment Routing GlobalBlock(SRGB)Block (SRGB) Size . . . . . . . ..6 6.2. Redistribution of Routes for Aggnodes routes . .Nodes . . . . . . . . . 6 6.3. Sizing andhierarchyHierarchy . . . . . . . . . . . . . . . . . . 6 6.4. Local Segments to Hosts/Servers . . . . . . . . . . . . . 7 6.5. Compressed SRTEpoliciesPolicies . . . . . . . . . . . . . . . . 7 7. Deployment Model . . . . . . . . . . . . . . . . . . . . . . 7 8. Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8.1. SimplifiedoperationsOperations . . . . . . . . . . . . . . . . . . 8 8.2. Inter-domainSLASLAs . . . . . . . . . . . . . . . . . . . . 8 8.3. Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8.4. ECMP . . . . . . . . . . . . . . . . . . . . . . . . . . 8 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . .89 10. Manageability Considerations . . . . . . . . . . . . . . . . 9 11. Security Considerations . . . . . . . . . . . . . . . . . . . 9 12.AcknowledgementsInformative References . . . . . . . . . . . . . . . . . . . 9 Acknowledgements . . .9 13. Contributors. . . . . . . . . . . . . . . . . . . . . 10 Contributors . . . . . . .9 14. Informative References. . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction This document describes howSRSegment Routing (SR) can be used to interconnect millions ofendpoints.The following terminology is used in this document:endpoints. 2. Terminology The following terms and abbreviations are used in this document: Term Definition --------------------------------------------------------- Agg Aggregation BGP Border Gateway Protocol DC Data Center DCI Data Center Interconnect ECMPEqual Cost MultiPathingEqual-Cost Multipath FIB Forwarding Information Base LDP Label Distribution Protocol LFIB Label Forwarding Information Base MPLSMulti-ProtocolMultiprotocol Label Switching PCE Path Computation Element PCEP Path Computation Element Communication Protocol PW Pseudowire SLA ServicelevelLevel Agreement SR Segment Routing SRTE Policy Segment Routing Traffic Engineering Policy TE Traffic Engineering TI-LFA Topology Independent- Loop Free AlternativeLoop-Free Alternate 3. Reference Design The network diagramherebelowdescribesillustrates the reference network topology used in this document: +-------+ +--------+ +--------+ +-------+ +-------+ A DCI1 Agg1 Agg3 DCI3 Z | DC1 | | M1 | | C | | M2 | | DC2 | | DCI2 Agg2 Agg4 DCI4 | +-------+ +--------+ +--------+ +-------+ +-------+ Figure 1: Reference Topology The followingappliesapply to the reference topology above: o Independent ISIS-OSPF/SR instance in core (C) region. o Independent ISIS-OSPF/SR instance in Metro1 (M1) region. o Independent ISIS-OSPF/SR instance in Metro2 (M2) region. o BGP/SR in DC1. o BGP/SR in DC2. o Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M (M1 and M2) and from M to DC domains. o No other route is advertised or redistributed between regions. o The same homogeneousSRGBSegment Routing Global Block (SRGB) is used throughout the domains(e.g.(e.g., 16000-23999). o Unique SRGB sub-ranges are allocated to each metro (M) and core (C)domains:domain: * The 16000-16999 range is allocated to the core (C) domain/region. * The 17000-17999 range is allocated to the M1 domain/region. * The 18000-18999 range is allocated to the M2 domain/region. * Specifically, the Agg1 router hasSIDSegment Identifier (SID) 16001allocatedallocated, and the Agg2 router has SID 16002 allocated. * Specifically, the Agg3 router has SID 16003allocatedallocated, and the anycast SID for Agg3 and Agg4 is 16006. * Specifically, the DCI3 router has SID 18003allocatedallocated, and the anycast SID for DCI3 and DCI4 is 18006. * Specifically, at the Agg1router Bindingrouter, the binding SID 4001 leads to DCIPair DCI3, DCI4pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 18006}. o The same SRGB sub-range isre-usedreused within each DC (DC1 and DC2)region.region for eachDC: e.g. 20000-23999.DC (e.g., 20000-23999). Specifically, nodes A and Z both have SID 20001 allocated to them. 4. Control Plane This section provides a high-level description ofahow a control plane could be implemented using protocol components already defined in other RFCs. The mechanism through which SRTE Policies are defined,computedcomputed, and programmed in the sourcenodes, arenodes is outside the scope of this document. Typically, a controller or a service orchestration system programs node A with apseudowire (PW)PW to a remote next-hop node Z with a given SLA contract(e.g.(e.g., low-latency path,be disjointdisjointness from a specific core plane,be disjointdisjointness from a different PWservice, etc.).service). Node A automatically detects thatit doesnode Z is nothave reachability to Z.reachable. It then automatically sends a PCEP request to an SR PCE for an SRTE policy that provides reachabilitytoinformation for node Z with the requested SLA. The SR PCE [RFC4655] is made of twocomponents. Acomponents: a multi-domain topology and a computation engine. The multi-domain topology is continuously refreshed throughBGP-LS [RFC7752]BGP - Link State (BGP-LS) feeds [RFC7752] from each domain. Thecomputingcomputation engine isdesineddesigned toimplemet Traffic Engineering (TE)implement TE algorithms and provide output in SR Path format. Upon receiving the PCEP[RFC5440] request,request [RFC5440], the SR PCE computes the requested path. The path is expressed through a list of segments(e.g.(e.g., {16003, 18006,20001}20001}) and provided to node A. The SR PCE logs the request as a stateful query and hence iscapableable to recompute the path at each network topology change. Node A receives the PCEP reply with the path (expressed as a segment list). Node A installs the received SRTE policy in thedataplane.data plane. Node A then automatically steers the PW into that SRTE policy. 5. Illustration of thescaleScale According to the reference topologydescribedshown in Figure11, the following assumptions are made:There's 1o There is one coredomaindomain, and there are 100 leaf (metro) domains. o The core domain includes 200 nodes. o Two nodes connect each leaf (metro) domain. Each node connecting a leaf domain has a SID allocated. Each pair of nodes connecting a leaf domain also has a common anycast SID. Thisbringsyields up to 300 prefix segments in total. o A core node connects only one leaf domain. o Each leaf domain has6000 leaf node6,000 leaf-node segments. Eachleaf-nodeleaf node has 500 endpointsattached,attached and thus 500 adjacency segments.In total, it isThis yields a total of 3millionsmillion endpoints for a leaf domain. Based on the above, the network scaling numbers are as follows: o 6,000leaf nodeleaf-node segments multiplied by 100 leaf domains: 600,000 nodes. o 600,000 nodes multiplied by 500 endpoints: 300millions ofmillion endpoints. The node scaling numbers are as follows:Leaf nodeo Leaf-node segment scale: 6,000leaf nodeleaf-node segments + 300core nodecore-node segments + 500 adjacency segments = 6,800segments Core nodesegments. o Core-node segment scale: 6,000leaf domainleaf-domain segments + 300core domaincore-domain segments = 6,300segmentssegments. In the above calculations, thelink adjacencylink-adjacency segments are not taken into account. These are local segments and, typically, less than 100 per node. It has to be noted that, depending onleaf nodeleaf-node FIB capabilities, leaf domains could be split into multiple smaller domains. In the above example, the leaf domains could be split into6six smaller domains so that each leaf node onlyneedneeds to learn1000 leaf node1,000 leaf-node segments + 300core nodecore-node segments + 500 adjacencysegments which givessegments, yielding a total of 1,800 segments. 6. Design Options This section describes multiple design options to illustrate scale as described in theillustration ofprevious section. 6.1. Segment Routing GlobalBlock(SRGB)Block (SRGB) Size In the simplified illustrationsofin this document, we picked a small homogeneous SRGB range of 16000-23999. In practice, a large-scale design would use a biggerrangerange, such as16000-80000,16000-80000 or even larger.LargerA larger range provides allocations for variousTraffic EngineeringTE applications within a givendomaindomain. 6.2. Redistribution of Routes for Aggnodes routesNodes The operator might choose to not redistribute the routes for Agg nodesroutesinto the Metro/DC domains. In that case, more segments are required in order to express an inter-domain path. For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3, DCI3, Z} in order to reach Z instead of {Agg3, DCI3, Z} in the reference design. 6.3. Sizing andhierarchyHierarchy The operator is free to choose among a small number of larger leaf domains, a large number of small leafdomainsdomains, or a mix of small and large core/leaf domains. The operator is free to use a2-tier designtwo-tier (Core/Metro) ora 3-tier (Core/Metro/DC).three-tier (Core/Metro/DC) design. 6.4. Local Segments to Hosts/Servers Local segments can be programmed at any leaf node(e.g.(e.g., node Z) in order to identifylocally-attachedlocally attached hosts (orVM's).Virtual Machines (VMs)). For example, if node Z has bound a local segment 40001 to a local host ZH1, then node A uses the following SRTE Policy in order to reach that host: {16006, 18006, 20001, 40001}. Such a local segment could represent the NID (Network Interface Device) in the context of theSPservice provider access network, or a VM in the context of the DC network. 6.5. Compressed SRTEpoliciesPolicies As an example and according to Section 3, we assume that node A can reach node Z (e.g., with a low-latency SLA contract) via the SRTE policyconsistingthat consists of thepath:path Agg1, Agg2, Agg3, DCI3/4(anycast), Z. The path is represented by the segmentlist:list {16001, 16002, 16003, 18006, 20001}. It is clear that the control-plane solution can install an SRTE Policy {16002, 16003, 18006} at Agg1, collect theBindingbinding SID allocated by Agg1 to that policy(e.g. 4001)(e.g., 4001), and hence program node A with the compressed SRTE Policy {16001, 4001, 20001}. From node A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 18006}. Once at that DCI pair, 20001 leads to Z. BindingSID'sSIDs allocated to "intermediate" SRTE Policiesallow to compressachieve the compression of end-to-end SRTE Policies. The segment list {16001, 4001, 20001} expresses the same path as {16001, 16002, 16003, 18006, 20001} but with2two less segments. TheBindingbinding SID also provides foraninherent churn protection. When the core topology changes, thecontrol-planecontrol plane can update thelow- latencylow-latency SRTE Policy from Agg1 to the DCI pair to DC2 without updating the SRTE Policy from A to Z. 7. Deployment Model It is expected that this design will bedeployed as a green field butused in "green field" deployments as wellinas interworking(brown field)("brown field") deployments with an MPLS design across multiple domains. 8. Benefits The design options illustrated in this document allowthe interconnectioninterconnections on a very large scale. Millions of endpoints across different domains can be interconnected. 8.1. SimplifiedoperationsOperations Two control-plane protocolsarenot needed in thisdesign:design are LDP and RSVP-TE. No new protocol has been introduced. The design leverages the core IPprotocols:protocols ISIS, OSPF, BGP, and PCEP with straightforward SR extensions. 8.2. Inter-domainSLASLAs Fast reroute and resiliencyisare provided by TI-LFA withsub-50msec FRRsub-50-ms fast reroute uponLink/Node/SRLG failure.failure of a link, node, or Shared Risk Link Group (SRLG). TI-LFA is described in[I-D.bashandy-rtgwg-segment-routing-ti-lfa].[SR-TI-LFA]. The use of anycast SIDs also providesan improvement inimproved availability and resiliency. Inter-domainSLA'sSLAs can bedelivered, e.g.,delivered (e.g., latency vs.cost optimized path,cost-optimized paths, disjointness from backbone planes, disjointness from other services, disjointness between primary and backuppaths.paths). Existing inter-domain solutions do not provide any support for SLA contracts. They just provideabest-effort reachability across domains. 8.3. Scale In addition to having eliminatedtwo control plane protocols, per- servicethe need for LDP and RSVP-TE, per-service midpoint states have also been removed from the network. 8.4. ECMP Each policy(intra(intra-domain or inter-domain, with or without TE) is expressed as a list of segments. Since each segment is optimized for ECMP,thenthe entire policy is optimized for ECMP. TheECMP gainbenefit of an anycast prefix segment optimized for ECMP should also be considered(e.g.(e.g., 16001load- sharesload-shares across any gateway from the M1 leaf domain to the Core and 16002load- sharesload-shares across any gateway from the Core to the M1 leaf domain). 9. IANA Considerations This documentdoes not make anyhas no IANArequest.actions. 10. Manageability Considerations This document describes an application ofSegment RoutingSR over the MPLS data plane.Segment RoutingSR does not introduce anychangechanges in the MPLS data plane.ManageabilityThe manageability considerations described in [RFC8402] apply to the MPLS data plane when used withSegment Routing.SR. 11. Security Considerations This document does not introduce additional security requirements and mechanisms other thanthe onesthose described in [RFC8402].14.12. Informative References[I-D.bashandy-rtgwg-segment-routing-ti-lfa] Bashandy, A., Filsfils, C., Decraene, B., Litkowski, S., Francois, P., daniel.voyer@bell.ca, d., Clad, F., and P. Camarillo, "Topology Independent Fast Reroute using Segment Routing", draft-bashandy-rtgwg-segment-routing-ti- lfa-05 (work in progress), October 2018.[RFC4655] Farrel, A., Vasseur, J., and J. Ash, "A Path Computation Element (PCE)-Based Architecture", RFC 4655, DOI 10.17487/RFC4655, August 2006, <https://www.rfc-editor.org/info/rfc4655>. [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation Element (PCE) Communication Protocol (PCEP)", RFC 5440, DOI 10.17487/RFC5440, March 2009, <https://www.rfc-editor.org/info/rfc5440>. [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and S. Ray, "North-Bound Distribution of Link-State and Traffic Engineering (TE) Information Using BGP", RFC 7752, DOI 10.17487/RFC7752, March 2016, <https://www.rfc-editor.org/info/rfc7752>. [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., Decraene, B., Litkowski, S., and R. Shakir, "Segment Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, July 2018, <https://www.rfc-editor.org/info/rfc8402>.12.[SR-TI-LFA] Litkowski, S., Bashandy, A., Filsfils, C., Decraene, B., Francois, P., Voyer, D., Clad, F., and P. Camarillo, "Topology Independent Fast Reroute using Segment Routing", Work in Progress, draft-ietf-rtgwg-segment-routing-ti-lfa- 01, March 2019. Acknowledgements We would like to thank Giles Heron, Alexander Preusche, SteveBraatenBraaten, and Francis Ferguson for theircontributioncontributions to the content of this document.13.Contributors The following peoplehavesubstantially contributed to the editing of this document: Dennis Cai Individual Tim Laberge Individual Steven Lin Google Inc.Steven Lin Google Inc.Bruno Decraene Orange Luay Jalil Verizon Jeff Tantsura Individual Rob Shakir Google Inc. Authors' Addresses Clarence Filsfils (editor) Cisco Systems, Inc. Brussels Belgium Email: cfilsfil@cisco.com Stefano PrevidiCisco Systems, Inc. Via Del Serafico, 200 Rome 00142 ItalyHuawei Technologies Email: stefano@previdi.net Gaurav Dawra (editor) LinkedInUSAUnited States of America Email: gdawra.ietf@gmail.com Wim Henderickx Nokia Copernicuslaan 50 Antwerp 2018 Belgium Email: wim.henderickx@nokia.com Dave CooperLevel 3CenturyLink Email:Dave.Cooper@Level3.comDave.Cooper@centurylink.com