Network Working GroupIndependent Submission P.GargGarg, Ed.Internet DraftRequest for Comments: 7637 Y.WangWang, Ed.IntendedCategory: Informational MicrosoftExpires: October 12, 2015 April 13,ISSN: 2070-1721 September 2015 NVGRE: Network VirtualizationusingUsing Generic Routing Encapsulationdraft-sridharan-virtualization-nvgre-08.txtAbstract This document describes the usage of the Generic Routing Encapsulation (GRE) header for Network Virtualization (NVGRE) in multi-tenantdatacenters.data centers. Network Virtualization decouples virtual networks and addresses from physical network infrastructure, providing isolation and concurrency between multiple virtual networks on the same physical network infrastructure. This document also introduces a Network Virtualization framework to illustrate the use cases, but the focus is on specifying thedata planedata-plane aspect of NVGRE. Status ofthisThis Memo Thismemo provides information for the Internet Community. It doesdocument is notspecifyan InternetstandardStandards Track specification; it is published for informational purposes. This is a contribution to the RFC Series, independently of anykind; instead it relies onother RFC stream. The RFC Editor has chosen to publish this document at its discretion and makes no statement about its value for implementation or deployment. Documents approved for publication by the RFC Editor are not aproposed standard. Distributioncandidate for any level of Internet Standard; see Section 2 of RFC 5741. Information about the current status of thismemo is unlimited.document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc7637. Copyright Notice Copyright (c)20142015 IETF Trust and the persons identified as the document authors. All rights reserved. ThisInternet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Thisdocument is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.This Internet-Draft will expire on October 12, 2015.Table of Contents 1.Introduction...................................................2Introduction ....................................................2 1.1.Terminology...............................................4Terminology ................................................4 2. ConventionsusedUsed inthis document..............................4This Document ...............................4 3.NVGRE:Network Virtualizationusing GRE........................5Using GRE (NVGRE) ........................4 3.1. NVGREEndpoint............................................5Endpoint .............................................5 3.2. NVGREframe format........................................6Frame Format .........................................5 3.3. Inner Tag as Defined by IEEE 802.1QTag..........................................9........................8 3.4. ReservedVSID.............................................9VSID ..............................................8 4. NVGRE DeploymentConsideration................................10Considerations .................................9 4.1. ECMPSupport.............................................10Support ...............................................9 4.2. Broadcast and MulticastTraffic..........................10Traffic ............................9 4.3. UnicastTraffic..........................................10Traffic ............................................9 4.4. IPFragmentation.........................................11Fragmentation ..........................................10 4.5. Address/Policy Management& Routing......................11and Routing .....................10 4.6.Cross-subnet, Cross-premise Communication................11Cross-Subnet, Cross-Premise Communication .................10 4.7. InternetConnectivity....................................13Connectivity .....................................12 4.8. Management and ControlPlanes............................13Planes .............................12 4.9. NVGRE-AwareDevices......................................13Devices .......................................12 4.10. Network Scalability withNVGRE..........................14NVGRE ...........................13 5. SecurityConsiderations.......................................15Considerations ........................................14 6.IANA Considerations...........................................15 7. References....................................................15 7.1.NormativeReferences.....................................15 7.2. Informative References...................................16 8. Authors and Contributors......................................16 9. Acknowledgments...............................................17References ...........................................14 Contributors ......................................................16 Authors' Addresses ................................................17 1. Introduction Conventional data center network designs cater to largely static workloads and cause fragmentation of network and server capacity[6][7].[6] [7]. There are several issues that limit dynamic allocation and consolidation of capacity. Layer 2 networks use the Rapid Spanning Tree Protocol(RSTP)(RSTP), which is designed to eliminate loops by blocking redundant paths. These eliminated paths translate to wasted capacity and a highly oversubscribed network. There are alternative approaches such asTRILLthe Transparent Interconnection of Lots of Links (TRILL) that address this problem [13]. The network utilization inefficiencies are exacerbated by network fragmentation due to the use of VLANs for broadcast isolation. VLANs are used for traffic management and also as the mechanism for providing security and performance isolation among services belonging to different tenants. The Layer 2 network is carved intosmallersmaller- sized subnetstypically(typically, one subnet perVLAN,VLAN), with VLAN tags configured on all the Layer 2 switches connected to server racks that host a given tenant's services. The current VLAN limits theoretically allow for4K4,000 such subnets to be created. The size of the broadcast domain is typically restricted due to the overhead of broadcast traffic. The4K VLAN4,000-subnet limit on VLANs is no longer sufficient in a shared infrastructure servicing multiple tenants. Data center operators must be able to achieve high utilization of server and network capacity. In order to achieveefficiencyefficiency, it should be possible to assign workloads that operate in a single Layer 2 network to any server in any rack in the network. It should also be possible to migrate workloads to any server anywhere in the network while retaining the workloads' addresses. This can be achieved today by stretchingVLANs, howeverVLANs; however, when workloadsmigratemigrate, the network needs to be reconfiguredwhichand that is typically error prone. By decoupling the workload's location on the LAN from its network address, the network administrator configures the networkonce andonce, not every time a service migrates. This decoupling enables any server to become part of any server resource pool. The following are key design objectives fornext generationnext-generation data centers: a)location independentlocation-independent addressing b) the ability to a scale the number of logical Layer2/Layer2 / Layer 3networksnetworks, irrespective of the underlying physical topology or the number of VLANs c) preserving Layer 2 semantics for services and allowing them to retain their addresses as they move within and across data centers d) providing broadcast isolation as workloads move around without burdening the network control plane This document describes use of the Generic Routing Encapsulation(GRE, [3][4])(GRE) header [3] [4] for network virtualization. Network virtualization decouples a virtual network from the underlying physical network infrastructure by virtualizing network addresses. Combined with a management and control plane for the virtual-to- physical mapping, network virtualization can enable flexible virtual machine placement andmovement,movement and provide network isolation for a multi-tenantdatacenter.data center. Network virtualization enables customers to bring their own address spaces into a multi-tenantdatacenterdata center, while thedatacenterdata center administrators can place the customer virtual machines anywhere in thedatacenterdata center without reconfiguring their network switches or routers, irrespective of the customer address spaces. 1.1. Terminology Please refer to[9][11]RFCs 7364 [10] and 7365 [11] for more formaldefinitiondefinitions of terminology. The following termswereare used in this document. Customer Address (CA):These areThis is the virtual IPaddressesaddress assigned and configured on the virtualNICNetwork Interface Controller (NIC) within each VM.These areThis is the onlyaddressesaddress visible to VMs and applications running within VMs. Network Virtualization Edge (NVE):AnThis is an entity that performs the network virtualization encapsulation and decapsulation. Provider Address (PA):These areThis is the IPaddressesaddress used in the physical network.PA'sPAs are associated with VMCA'sCAs through the network virtualization mapping policy. Virtual Machine (VM):These are instancesThis is an instance ofOS'san OS running on top of the hypervisor over a physical machine or server. Multiple VMs can share the same physical server via the hypervisor, yet are completely isolated from each other in terms ofcompute,CPU usage, storage, and other OS resources. Virtual Subnet Identifier (VSID): This is a 24-bit ID that uniquely identifies a virtual subnet or virtuallayerLayer 2 broadcast domain. 2. ConventionsusedUsed inthis documentThis Document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described inRFC-2119 [RFC2119].RFC 2119 [1]. In this document, these words will appear with that interpretation only when in ALL CAPS.Lower caseLowercase uses of these words are not to be interpreted as carryingRFC-2119 significance.the significance defined in RFC 2119. 3. Network VirtualizationusingUsing GRE (NVGRE) This section describes Network Virtualization usingGRE, NVGRE.GRE (NVGRE). Network virtualization involves creating virtual Layer 2 topologies on top of a physical Layer 3 network. Connectivity in the virtual topology is provided by tunneling Ethernet frames in GRE over IP over the physical network. In NVGRE, every virtual Layer 2 network is associated with a 24-bit identifier, called a Virtual Subnet Identifier (VSID). A VSID is carried in an outer header as defined in Section 3.2., allowingThis allows unique identification of a tenant's virtual subnet to various devices in the network. A 24-bit VSID supports up to 16 million virtual subnets in the same management domain, in contrast to only4K4,000 that is achievable with VLANs. Each VSID represents a virtual Layer 2 broadcast domain, which can be used to identify a virtual subnet of a given tenant. To support multi-subnet virtual topology,datacenterdata center administrators can configure routes to facilitate communication between virtual subnets of the same tenant. GRE is aproposedProposed Standard from the IETFstandard [3][4][3] [4] and provides a way for encapsulating an arbitrary protocol over IP. NVGRE leverages the GRE header to carry VSID information in each packet. The VSID information in each packet can be used to build multi-tenant-aware tools for traffic analysis, traffic inspection, and monitoring. The following sections detail the packet format forNVGRE,NVGRE; describe the functions ofaan NVGREendpoint,endpoint; illustrate typical traffic flow both within and across datacenters,centers; and discussaddress, policy managementaddress/policy management, and deployment considerations. 3.1. NVGRE Endpoint NVGRE endpoints are the ingress/egress points between the virtual and the physical networks. The NVGRE endpoints are the NVEs as defined in theNVONetwork Virtualization over Layer 3 (NVO3) Framework document[9].[11]. Any physical server or network device can be an NVGRE endpoint. One common deployment is for the endpoint to be part of a hypervisor. The primary function of this endpoint is to encapsulate/decapsulate Ethernet data frames to and from the GRE tunnel, ensure Layer 2 semantics, and apply isolation policy scoped on VSID. The endpoint can optionally participate in routing and function as a gateway in the virtual topology. To encapsulate an Ethernet frame, the endpoint needs to know the location information for the destination address in the frame. This information can be provisioned via a managementplane,plane or obtained via a combination ofcontrol planecontrol-plane distribution ordata planedata-plane learning approaches. This document assumes that the location information, including VSID, is available to the NVGRE endpoint. 3.2. NVGREframe formatFrame Format The GRE header format as specified inRFCRFCs 2784 [3] andRFC2890[3][4][4] is used for communication between NVGRE endpoints. NVGRE leverages the Key extension specified in RFC 2890 [4] to carry the VSID. The packet format for Layer 2 encapsulation in GRE is shown in Figure 1. Outer Ethernet Header: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1Outer Ethernet Header: |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (Outer) Destination MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |(Outer)Destination MAC Address | (Outer)Source MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (Outer) Source MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Optional Ethertype=C-Tag 802.1Q| Outer VLAN Tag Information | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Ethertype 0x0800 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Outer IPv4 Header: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version|IHLHL |Type of Service| Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Time to Live | Protocol 0x2F | Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (Outer) Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (Outer) Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ GRE Header: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| |1|0| Reserved0 | Ver | Protocol Type 0x6558 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Virtual Subnet ID (VSID) | FlowID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Inner Ethernet Header +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (Inner) Destination MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |(Inner)Destination MAC Address | (Inner)Source MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | (Inner) Source MAC Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Ethertype 0x0800 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+(Continued on the next page)Inner IPv4 Header: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version|IHLHL |Type of Service| Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Time to Live | Protocol | Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Original IP Payload | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure11: GRE Encapsulation Frame Format Note: HL stands for Header Length. The outer/delivery headers include the outer Ethernet header and the outer IP header: o The outer Ethernet header: The source Ethernet address in the outer frame is set to the MAC address associated with the NVGRE endpoint. The destination endpoint may or may not be on the same physical subnet. The destination Ethernet address is set to the MAC address of thenexthopnext-hop IP address for the destination NVE. The outer VLAN tag information is optional and can be used for traffic management and broadcast scalability on the physical network. o The outer IP header: Both IPv4 and IPv6 can be used as the delivery protocol for GRE. The IPv4 header is shown for illustrative purposes.HenceforthHenceforth, the IP address in the outer frame is referred to as the Provider Address (PA). There can be one or more PAaddressassociated with an NVGRE endpoint, with policy controlling the choice of which PA to use for a given Customer Address (CA) for a customer VM.TheIn the GRE header: o The C (Checksum Present) and S (Sequence Number Present) bits in the GRE header MUST be zero. o The Kbit(Key Present) bit in the GRE header MUST be set to one. The 32-bit Key field in the GRE header is used to carry the Virtual Subnet ID(VSID),(VSID) and theFlowId:FlowID: - Virtual Subnet ID (VSID): This is a 24-bit value that is used to identify theNVGRE basedNVGRE-based Virtual Layer 2 Network. - FlowID: This is an 8-bit value that is used to provideper- flowper-flow entropy for flows in the same VSID. The FlowID MUST NOT be modified by transit devices. The encapsulating NVE SHOULD provide as much entropy as possible in theFlowId.FlowID. If a FlowID is not generated, it MUST be set to allzero.zeros. o Theprotocol typeProtocol Type field in the GRE header is set to 0x6558(transparent(Transparent Ethernetbridging)[2]. TheBridging) [2]. In the inner headers (headers of the GRE payload): o The inner Ethernet frame comprisesofan inner Ethernet header followed by optional inner IP header, followed by the IP payload. The inner frame could be any Ethernet data frame not just IP. Note that the inner Ethernet frame'sFCSFrame Check Sequence (FCS) is not encapsulated. o For illustrativepurposespurposes, IPv4 headers are shown as the inner IPheadersheaders, but IPv6 headers may be used.HenceforthHenceforth, the IP address contained in the inner frame is referred to as the Customer Address (CA). 3.3. Inner802.1QTag as Defined by IEEE 802.1Q The inner Ethernet header of NVGRE MUST NOT contain the tag as defined by IEEE 802.1Qtag.[5]. The encapsulating NVE MUST remove any existing IEEE 802.1QTagtag before encapsulation of the frame in NVGRE. A decapsulating NVE MUST drop the frame if the inner Ethernet frame contains an IEEE 802.1Q tag. 3.4. Reserved VSID The VSID range from 0-0xFFF is reserved for future use. The VSID 0xFFFFFF is reserved forvendor specific NVE-NVEvendor-specific NVE-to-NVE communication. The sender NVE SHOULD verify the receiver NVE's vendor before sending a packet using thisVSID, howeverVSID; however, such a verification mechanism is out of scope of this document. Implementations SHOULD choose a mechanism that meets their requirements. 4. NVGRE Deployment Considerations 4.1. ECMP SupportECMPEqual-Cost Multipath (ECMP) may be used to provide load balancing. If ECMP is used, it is RECOMMENDED that the ECMP hash is calculated either using the outer IP frame fields and entire Key field(32-bit)(32 bits) or the inner IP and transport frame fields. 4.2. Broadcast and Multicast Traffic To support broadcast and multicast traffic inside a virtual subnet, one or more administratively scoped multicast addresses[8][10][8] [9] can be assigned for the VSID. All multicast or broadcast traffic originating from within a VSID is encapsulated and sent to the assigned multicast address. From an administrativestandpointstandpoint, it is possible for network operators to configure a PA multicast address for each multicast address that is used inside aVSID, to facilitateVSID; this facilitates optimal multicast handling. Depending on the hardware capabilities of the physical network devices and the physical network architecture, multiple virtualsubnetsubnets mayre-useuse the same physical IP multicast address. Alternatively, based upon the configuration atNVE,the NVE, broadcast and multicast in the virtual subnet can be supported usingN-WayN-way unicast. InN-WayN-way unicast, the sender NVE would send one encapsulated packet to every NVE in the virtual subnet. The sender NVE can encapsulate and send the packet as described inthe Unicast TrafficSection4.3.4.3 ("Unicast Traffic"). This alleviates the need for multicast support in the physical network. 4.3. Unicast Traffic The NVGRE endpoint encapsulates a Layer 2 packet in GRE using the source PA associated with the endpoint with the destination PA corresponding to the location of the destination endpoint. As outlined earlier, there can be one or more PAs associated with an endpoint and policy will control which ones get used for communication. The encapsulated GRE packet is bridged and routed normally by the physical network to the destination PA. Bridging uses the outer Ethernet encapsulation for scope on the LAN. The only requirement isbi-directionalbidirectional IP connectivity from the underlying physical network. On the destination, the NVGRE endpoint decapsulates the GRE packet to recover the original Layer 2 frame. Traffic flows similarly on the reverse path. 4.4. IP Fragmentation Section 5.1 of RFC 2003 [12]Section 5.1specifies mechanisms for handling fragmentation when encapsulating IP within IP. The subset of mechanisms NVGRE selects are intended to ensure thatNVGRENVGRE- encapsulated frames are not fragmented after encapsulationen-routeen route to the destination NVGREendpoint,endpoint and that traffic sources can leverage Path MTU discovery. A sender NVE MUST NOT fragment NVGRE packets. A receiver NVE MAY discard fragmented NVGRE packets. It is RECOMMENDED that the MTU of the physical network accommodates the larger frame size due to encapsulation. Path MTU or configuration via control plane can be used to meet this requirement. 4.5. Address/Policy Management&and Routing Address acquisition is beyond the scope of this document and can be obtained statically,dynamicallydynamically, or using stateless addressauto- configuration.autoconfiguration. CA and PA space can be either IPv4 or IPv6. Infactfact, the address families don't have tomatch,match; for example, a CA can be IPv4 while the PA isIPv6IPv6, and vice versa. 4.6.Cross-subnet, Cross-premiseCross-Subnet, Cross-Premise Communication One application of this framework is that it provides a seamless path for enterprises looking to expand their virtual machine hosting capabilities into public clouds. Enterprises can bring their entire IP subnet(s) and isolation policies, thus making the transition to or from the cloud simpler. It is possible to move portions ofaan IP subnet to thecloud howevercloud; however, that requires additional configuration on the enterprise network and is not discussed in this document. Enterprises can continue to use existing communications models like site-to-site VPN to secure their traffic. A VPN gateway is used to establish a secure site-to-site tunnel over theInternetInternet, and all the enterprise services running in virtual machines in the cloud use the VPN gateway to communicate back to the enterprise. Forsimplicitysimplicity, we use a VPNGWgateway configured as a VMshown(shown in Figure22) to illustrate cross-subnet, cross-premise communication. +-----------------------+ +-----------------------+ | Server 1 | | Server 2 | | +--------+ +--------+ | | +-------------------+ | | | VM1 | | VM2 | | | | VPN Gateway | | | | IP=CA1 | | IP=CA2 | | | | Internal External| | | | | | | | | | IP=CAg IP=GAdc | | | +--------+ +--------+ | | +-------------------+ | | Hypervisor | | | Hypervisor| ^ | +-----------------------+ +-------------------:---+ | IP=PA1 | IP=PA4 | : | | | : | +-------------------------+ | : VPN +-----| Layer 3 Network |------+ : Tunnel +-------------------------+ : | : +-----------------------------------------------:--+ | : | | Internet : | | : | +-----------------------------------------------:--+ | v | +-------------------+ | | VPN Gateway | |---| | IP=GAcorp| External IP=GAcorp| +-------------------+ | +-----------------------+ | Corp Layer 3 Network | | (In CA Space) | +-----------------------+ | +---------------------------+ | Server X | | +----------+ +----------+ | | | Corp VMe1| | Corp VMe2| | | | IP=CAe1 | | IP=CAe2 | | | +----------+ +----------+ | | Hypervisor | +---------------------------+ Figure22: Cross-Subnet, Cross-Premise Communication The packet flow is similar to the unicast traffic flow betweenVMs,VMs; the key difference in this case is that the packet needs to be sent to a VPN gateway before it gets forwarded to the destination. As part of routing configuration in the CA space, a per-tenant VPN gateway is provisioned for communication back to the enterprise. The example illustrates an outbound connection between VM1 inside thedatacenterdata center and VMe1 inside the enterprise network. When the outbound packet from CA1 to CAe1 reaches the hypervisor on Server 1, the NVE in Server 1 can performanthe equivalent of a route lookup on the packet. Thecross premisecross-premise packet will match the default gatewayrulerule, as CAe1 is not part of the tenant virtual network in thedatacenter.data center. The virtualization policy will indicate the packet to be encapsulated and sent to the PA of the tenant VPN gateway (PA4) running as a VM on Server 2. The packet is decapsulated on Server 2 and delivered to the VM gateway. The gateway in turn validates and sends the packet on the site-to-site VPN tunnel back to the enterprise network. As the communication here is external to thedatacenterdata center, the PA address for the VPN tunnel is globally routable. The outer header of this packet is sourced from GAdc destined to GAcorp. This packet is routed through the Internet to the enterprise VPNgatewaygateway, which is the other end of the site-to-sitetunnel,tunnel; atwhich pointthat point, the VPN gateway decapsulates the packet and sends it inside the enterprise where the CAe1 is routable on the network. The reverse path is similar once the packet reaches the enterprise VPN gateway. 4.7. Internet Connectivity To enable connectivity to the Internet, an Internet gateway is needed that bridges the virtualized CA space to the public Internet address space. The gatewayneedneeds to perform translation between the virtualized world and the Internet. For example, the NVGRE endpoint can be part of a load balancer or aNAT, whichNAT that replaces the VPN Gateway on Server 2 shown in Figure 2. 4.8. Management and Control Planes There are several protocols that can manage and distribute policy; however, it isout ofoutside the scope of this document. Implementations SHOULD choose a mechanism that meets their scale requirements. 4.9. NVGRE-Aware Devices One example of a typical deployment consists of virtualized servers deployed across multiple racks connected by one or more layers of Layer 2switchesswitches, which in turn may be connected to alayerLayer 3 routing domain. Even though routing in the physical infrastructure will work without any modification with NVGRE, devices that perform specialized processing in the network need to be able to parse GRE to get access totenant specifictenant-specific information. Devices that understand and parse the VSID can provide richmulti-tenancy awaremulti-tenant-aware services inside the data center. As outlinedearlierearlier, it is imperative to exploit multiple paths inside the network through techniques such asEqual Cost Multipath (ECMP).ECMP. The Key field(32- bit(a 32-bit field, including both the VSID and the optional FlowID) can provide additional entropy to the switches to exploit path diversity inside the network. A diverse ecosystem is expected to emerge as more and more devices become multi-tenant aware. In the interim, without requiring any hardware upgrades, there are alternatives to exploit path diversity with GRE by associating multiple PAs with NVGRE endpoints with policy controlling the choice of which PA tobe used.use. It is expected that communication can span multiple data centers and also cross thevirtual to physicalvirtual/physical boundary. Typical scenarios that require virtual-to-physical communicationincludesinclude access to storage and databases. Scenarios demanding lossless Ethernet functionality may not be amenable toNVGRENVGRE, as traffic is carried over an IP network. NVGRE endpoints mediate between thenetwork virtualizednetwork-virtualized andnon-network virtualizednon-network-virtualized environments. This functionality can be incorporated intoTop of RackTop-of-Rack switches, storage appliances, load balancers,routers etc.routers, etc., or built as a stand-alone appliance. It is imperative to consider the impact of any solution on host performance. Today's server operating systems employ sophisticated acceleration techniques such as checksum offload, Large Send Offload (LSO), Receive Segment Coalescing (RSC), Receive Side Scaling (RSS), Virtual Machine Queue(VMQ)(VMQ), etc. These technologies should become NVGRE aware. IPsec Security Associations(SA)(SAs) can be offloaded to the NIC so that computationally expensive cryptographic operations are performed at line rate in the NIC hardware. These SAs are based on the IP addresses of the endpoints. As each packet on the wire gets translated, the NVGRE endpoint SHOULD intercept the offload requests and do the appropriate address translation. This will ensure that IPsec continues to be usable with network virtualization while taking advantage of hardware offload capabilities for improved performance. 4.10. Network Scalability with NVGRE One of the key benefits of using NVGRE is the IP address scalability and in turn MAC address table scalability that can be achieved. An NVGRE endpoint can use one PA to represent multiple CAs. This lowers the burden on the MAC address table sizes at theTop of RackTop-of-Rack switches. One obvious benefit is in the context of servervirtualizationvirtualization, which has increased the demands on the network infrastructure. By embeddingaan NVGRE endpoint in ahypervisorhypervisor, it is possible to scale significantly. This frameworkallows forenables location information to be preconfigured insideaan NVGREendpointendpoint, thus allowing broadcast ARP traffic to be proxied locally. This approach can scale tolarge sizedlarge-sized virtual subnets. These virtual subnets can be spread across multiplelayerLayer 3 physical subnets. It allows workloads to be moved around without imposing a huge burden on the network control plane. By eliminating most broadcast traffic and converting others tomulticastmulticast, the routers and switches can function moreefficientlyoptimally by building efficient multicast trees. By using server and network capacityefficientlyefficiently, it is possible to drive down the cost of building and managing data centers. 5. Security Considerations This proposal extends the Layer 2 subnet across the data center and increases the scope for spoofing attacks. Mitigations of such attacks are possible with authentication/encryption using IPsec or any otherIP basedIP-based mechanism. The control plane for policy distribution is expected to be secured by using any of the existing security protocols. Further management traffic can be isolated in a separate subnet/VLAN. The checksum in the GRE header is not supported. The mitigation of this is to deployNVGRE basedan NVGRE-based solution in a network that provides error detection along the NVGRE packet path, for example, using EthernetCRCCyclic Redundancy Check (CRC) or IPsec or any other error detection mechanism. 6.IANA Considerations This document has no IANA actions. 7. References 7.1.Normative References [1] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March1997.1997, <http://www.rfc-editor.org/info/rfc2119>. [2]Ethertypes, ftp://ftp.isi.edu/in- notes/iana/assignments/ethernet-numbersIANA, "IEEE 802 Numbers", <http://www.iana.org/assignments/ieee-802-numbers>. [3]D. Farinacci et al,Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. Traina, "Generic Routing Encapsulation (GRE)", RFC 2784,March, 2000.DOI 10.17487/RFC2784, March 2000, <http://www.rfc-editor.org/info/rfc2784>. [4]G.Dommety, G., "Key and Sequence Number Extensions to GRE", RFC 2890, DOI 10.17487/RFC2890, September2000.2000, <http://www.rfc-editor.org/info/rfc2890>. [5]Institute of ElectricalIEEE, "IEEE Standard for Local andElectronics Engineers, "Virtualmetropolitan area networks--Media Access Control (MAC) Bridges and Virtual Bridged Local Area Networks", IEEEStandard 802.1Q, 2005 Edition, May 2006. 7.2. Informative ReferencesStd 802.1Q. [6]A. GreenbergGreenberg, A., etal,al., "VL2: A Scalable and Flexible Data Center Network",Proc. SIGCOMM 2009.Communications of the ACM, DOI 10.1145/1897852.1897877, 2011. [7]A. GreenbergGreenberg, A., etal,al., "The Cost of a Cloud: Research Problems intheDataCenter",Center Networks", ACM SIGCOMM Computer CommunicationReview.Review, DOI 10.1145/1496091.1496103, 2009. [8]B.Hinden, R. and S. Deering, "IP Version 6 Addressing Architecture", RFC 4291, DOI 10.17487/RFC4291, February2006.2006, <http://www.rfc-editor.org/info/rfc4291>. [9]M. Lasserre et al, "Framework for DC Network Virtualization", RFC 7365, October 2014. [10] D.Meyer, D., "Administratively Scoped IP Multicast", BCP 23, RFC 2365, DOI 10.17487/RFC2365, July1998. [11] T. Narten et al,1998, <http://www.rfc-editor.org/info/rfc2365>. [10] Narten, T., Ed., Gray, E., Ed., Black, D., Fang, L., Kreeger, L., and M. Napierala, "Problem Statement: Overlays for Network Virtualization", RFC 7364, DOI 10.17487/RFC7364, October 2014, <http://www.rfc-editor.org/info/rfc7364>. [11] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. Rekhter, "Framework for Data Center (DC) Network Virtualization", RFC 7365, DOI 10.17487/RFC7365, October2014.2014, <http://www.rfc-editor.org/info/rfc7365>. [12]C.Perkins, C., "IP Encapsulation within IP", RFC 2003, DOI 10.17487/RFC2003, October1996.1996, <http://www.rfc-editor.org/info/rfc2003>. [13]J.Touch, J. and R. Perlman, "Transparent Interconnection of Lots of Links (TRILL): Problem and Applicability Statement", RFC 5556, DOI 10.17487/RFC5556, May2009. 8. Authors and2009, <http://www.rfc-editor.org/info/rfc5556>. ContributorsM. Sridharan A. Greenberg Y. Wang P. Garg N. Venkataramiah Microsoft K. Duda Arista Networks I. Ganga Intel G. Lin Google M. Pearson Hewlett-Packard P. Thaler Broadcom C. Tumuluri Emulex 9. Acknowledgments This document was prepared using 2-Word-v2.0.template.dot. Authors' AddressesMurari Sridharan Microsoft Corporation 1 Microsoft Way Redmond, WA 98052 United States Email: muraris@microsoft.comYu-Shun Wang Microsoft Corporation 1 Microsoft Way Redmond, WA 98052 Email: yushwang@microsoft.comAlbert Greenberg Microsoft Corporation 1 Microsoft Way Redmond, WA 98052 United States Email: albert@microsoft.comPankaj GargNarasimhan Venkataramiah Microsoft Corporation 1 Microsoft Way Redmond, WA 98052Email: pankajg@microsoft.com Narasimhan Venkataramiah Facebook Inc 1730 Minor Ave. Seattle, WA 98101United States Email: navenkat@microsoft.com Kenneth Duda Arista Networks, Inc. 5470 Great America Pkwy Santa Clara, CA 95054 United States Email: kduda@aristanetworks.com Ilango Ganga Intel Corporation 2200 Mission College Blvd. M/S: SC12-325 Santa Clara, CA-95054 United States Email: ilango.s.ganga@intel.com Geng Lin Google 1600 Amphitheatre Parkway Mountain View,CaliforniaCA 94043 United States Email: genglin@google.com Mark Pearson Hewlett-Packard Co. 8000 Foothills Blvd. Roseville, CA 95747 United States Email: mark.pearson@hp.com Patricia Thaler Broadcom Corporation 3151 Zanker Road San Jose, CA 95134 United States Email: pthaler@broadcom.com Chait Tumuluri Emulex Corporation 3333 Susan Street Costa Mesa, CA 92626 United States Email: chait@emulex.com Authors' Addresses Pankaj Garg (editor) Microsoft Corporation 1 Microsoft Way Redmond, WA 98052 United States Email: pankajg@microsoft.com Yu-Shun Wang (editor) Microsoft Corporation 1 Microsoft Way Redmond, WA 98052 United States Email: yushwang@microsoft.com