l3vpn D. Rao Internet-Draft V. Jain Intended status: Standards Track Cisco Systems Expires: January 17, 2014 July 16, 2013 L3VPN Virtual Network Overlay Multicast draft-drao-l3vpn-virtual-network-overlay-multicast-00 Abstract Virtual network overlays are extremely useful for supporting multicast applications in multi-tenant data center networks, by distributing the per-tenant multicast state and forwarding actions at the network edges with minimal control plane load and simpler forwarding in the core. A virtual overlay network may use existing encapsulations such as MPLS-in-GRE or newer IP based encapsulations such as VXLAN and NVGRE. IP multicast sources and receivers are very commonly spread out across multiple subnets. Sources and receivers may also be spread within and outside a single network domain such as a data center. Hence, a Layer-3 multicast paradigm is the most suitable and efficient approach for delivery of IP multicast traffic. BGP based MVPNs provide a good basis for providing a solution to support IP multicast across these overlay networks. An appropriate subset of the MVPN control plane and procedures are sufficient to support the requirements, providing for a simpler model of operation. This document describes the use of BGP based MVPNs alongwith the new IP-based virtual network overlay encapsulations to provide a Layer-3 virtualization solution for IP multicast traffic, and specifies mechanisms to use the new encapsulations while continuing to leverage the BGP MVPN control plane techniques and extensions. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Rao & Jain Expires January 17, 2014 [Page 1] Internet-Draft July 2013 Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 17, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 3. Solution Requirements . . . . . . . . . . . . . . . . . . . . 4 4. Network Topology . . . . . . . . . . . . . . . . . . . . . . 4 5. Solution Scenarios . . . . . . . . . . . . . . . . . . . . . 5 6. Fine-grained Transit Pruning . . . . . . . . . . . . . . . . 6 7. Originating interest at a receiver edge device . . . . . . . 6 8. Source mapping from a sender edge device . . . . . . . . . . 7 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 10. Security Considerations . . . . . . . . . . . . . . . . . . . 8 11. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . 8 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 12.1. Normative References . . . . . . . . . . . . . . . . . . 8 12.2. Informative References . . . . . . . . . . . . . . . . . 8 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 1. Introduction Rao & Jain Expires January 17, 2014 [Page 2] Internet-Draft July 2013 Virtual network overlays are extremely useful for supporting multicast applications in multi-tenant data center networks, by distributing the per-tenant multicast state and forwarding actions at the network edges with minimal control plane load and simpler forwarding in the core. A virtual overlay network may use existing encapsulations such as MPLS over GRE or newer IP based encapsulations such as VXLAN and NVGRE. IP multicast sources and receivers are very commonly spread out across multiple subnets. Sources and receivers may also be spread within and outside a single network domain such as a data center. Hence, a Layer-3 multicast paradigm is the most suitable and efficient approach for delivery of multicast traffic. To send multicast data to multiple receivers across the overlay, packets are sent on a core tree (P-tunnel) that is typically realized using an IP multicast encapsulation such as the ones mentioned above. In order to reduce the number of multicast P-tunnels that need to be set up and maintained in the core network, aggregate core trees may be used, with multiple VPNs being supported over a given P-tunnel. The P-tunnel encapsulations such as MPLS-in-GRE, VXLAN and NVGRE support a VN-ID or VPN label in the encapsulation header which can be used to distinguish the VPN. BGP based MVPNs provide a good basis for providing a solution to support IP multicast across these overlay networks. An appropriate subset of the MVPN control plane and procedures are sufficient to support the requirements, providing for a simpler model of operation. This document describes the use of BGP based MVPNs alongwith the IP- based virtual network overlay encapsulations to provide a Layer-3 virtualization solution for IP multicast traffic, and specifies mechanisms to use the new encapsulations while continuing to leverage the BGP MVPN control plane techniques and extensions. This mechanism provides an efficient incremental solution to support forwarding for IP traffic, irrespective of whether it is destined within or across an IP subnet boundary. 2. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in [RFC2119] only when they appear in all upper case. They may also appear in lower or mixed case as English words, without any normative meaning. Rao & Jain Expires January 17, 2014 [Page 3] Internet-Draft July 2013 3. Solution Requirements 1. Support for large number of senders and receivers for a given group 2. Receivers and groups spread out across large number of PEs or edge nodes 3. Support for large number of C-multicast routes 4. Support for Multi-tenancy 5. Support for optimal multicast replication within and across subnets 6. Support for optimal pruning on transit nodes 7. Minimal control plane churn 8. Minimal state in core 9. Support for multiple overlay encapsulations 10. Support for redundancy and load-balancing 4. Network Topology In an environment such as the data center where overlay networks are used, the IP multicast sources and receivers are hosts typically resident on servers attached to network access devices such as virtual software routers or physical access switches. These hosts may belong to different tenants who require isolation and segregation in forwarding state and traffic. At the same time, the physical transport or core network must be shared for traffic from multiple tenants. This requires the edge devices to support multiple VPNs facing the access interfaces, with a shared overlay encapsulation towards the core. Due to the high degree of traffic that flows among the hosts on various servers within the data center, a DC environment is likely to use uniform, densely meshed topologies, such as a spine-leaf topology, which provide high redundancy and bandwidth. The servers may be dually attached to a pair of access switches for edge redundancy. Rao & Jain Expires January 17, 2014 [Page 4] Internet-Draft July 2013 In order to scale the core multicast, as well as to efficiently support the common case where a large number of senders and receivers are present in a VPN, shared trees are used for the overlay. Bidir multicast is the preferred mode. Also, to support a large number of VPNs efficiently, aggeegate trees are used with multiple VPNs being multiplexed over a core tree. The various IP overlay encapsulations such as VXLAN, NVGRE or MPLS-in-GRE contain a VN-ID or VPN label in their header which is used as a distinguisher for a MVPN in the data plane. 5. Solution Scenarios There are a couple of overall scenarios which are applicable for these virtual overlay networks. One mode of operation is where all overlay multicast traffic is encapsulated within a fixed number of core trees or P-tunnels that all edge nodes or PEs can be a part of. Then both source and receiver edge devices can independently encapsulate and decapsulate overlay traffic, without requiring additional signaling among them. In this mode, source edge devices map a given C-flow from a locally attached source onto a specific core multicast tree or P-tunnel based on a local mapping decision. Receiver edge devices join all the available P-tunnels, and hence are able to receive traffic from these source edge devices. They then discard the traffic that they do not have local receivers for, and replicate the interesting flows towards local receivers. It is, however, desirable that all receiver edge devices do not receive all traffic and have to filter the unnecessary flows. This is especially applicable in high-bandwidth traffic environments. In order to support this ability, it is required to have signaling from a receiver edge device indicating its interest in specific C-groups. In addition, high-bandwidth sourced traffic flows may be sent on a specific P-tunnel which is determined by the source edge device, and requires interested receiver edge devices to join that core tree. In such cases, it is also beneficial if the sourced multicast traffic is sent out into the overlay only if there are known to be receivers at other edge devices or external to the overlay. For all such cases, it is required to have signaling from both source and receiver edge nodes. Signaling involves host group interest from receiver edge nodes as well as the C-flow to P-tunnel mapping signaling from source edge nodes. Rao & Jain Expires January 17, 2014 [Page 5] Internet-Draft July 2013 6. Fine-grained Transit Pruning Typically, when an overlay is used, the transit nodes in the physical topology only participate in the underlay control plane and forward all traffic based on the encpasulated packets. They are unaware of the inner header or payload. This allows them to be simpler and scale better. One consequence of using an overlay is that multicast traffic needs to make it to the receiver edge nodes before they can be pruned in case there are no downstream receivers. A widely deployed option to avoid redundant traffic from getting to all the receiver edge nodes is to use a larger number of core multicast trees, thereby reducing the ratio of overlay flows to each multicast tree, and allowing the receiver edge nodes to only join the multicast trees that the interesting flows map to. This has the side effect of increasing the underlay multicast state in the core, and hence the load on the underlay multicast protocols such as PIM. It also does not provide for complete pruning of multicast traffic. However, it is possible in certain topologies to support an alternative mechanism for fine-grained pruning of multi-destination traffic. In the uniform, meshed topologies that are used as mentioned earlier, certain transit nodes can use the receiver interest information sent by the receiver edge devices for the overlay, to filter traffic on outgoing links towards the receiver edge nodes on a per-VPN or more fine-grained basis. This allows the core multicast trees built by PIM to be smaller in number. The transit nodes do need to run MP-BGP to obtain the overlay multicast group information sent by the various receiver nodes. The transit nodes will typically obtain this information from the RRs. This assumes the core devices can look into the inner headers of packets to prune traffic based on either VN-ID or *G/S,Gs. In order for the join routes to be sent to the transit nodes, the appropriate RTs will be provisioned on the transit nodes. 7. Originating interest at a receiver edge device Rao & Jain Expires January 17, 2014 [Page 6] Internet-Draft July 2013 A receiver edge device will originate shared or source tree joins on the overlay for receivers that are locally attached or downstream to it. This will be triggered by locally received IGMP or PIM joins. The receiver edge device will also typically act as a PIM first-hop or last-hop router with attached sources and receivers. These join routes need to be propagated towards potential sources as well as the RPs. In a virtual network overlay topology, sources are attached to edge devices. So the other edge devices must receive these join routes. For a situation where there are multiple sources which are spread out across a large number of edge devices, or for a case where the groups are Bidir groups, a shared tree join route may be propagated to all edge devices. For high bandwidth sources, it is desirable to direct the specific group or source joins to the appropriate source leafs. More granular filtering is needed in this case - in addition to VRF, group based active source discovery and advertisement is used to control join propagation. A receiver leaf uses C-multicast route of type Shared Tree Join (C-RP, C-G) or Source Tree Join (C-S, C-G). To be able to propagate the shared joins to all edge devices, the join routes may be originated with an RD that is specific to the originating device. This RD can be the same value as that used by unicast routes. The routes are sent with a RT that all interested edge devices may use as an import RT for this VPN, if they need to receive the join routes. Route propagation is constrained based on policy (RT) along the path. 8. Source mapping from a sender edge device An edge device that is attached to a source may signal the active sources information on the overlay, along with the core multicast group that the edge device decides to use. Receiver edge nodes can use this information to join the core multicast trees. This mapping is advertised by using the S-PMSI A-D routes, with the PMSI Tunnel Attribute type indicating the appropriate overlay encapsulation type - VXLAN or NVGRE. Aggregate trees will be used, with the VN-ID being signaled along with the route. Rao & Jain Expires January 17, 2014 [Page 7] Internet-Draft July 2013 9. IANA Considerations To be completed 10. Security Considerations To be completed 11. Change Log 12. References 12.1. Normative References [RFC1771] Rekhter, Y. and T. Li, "A Border Gateway Protocol 4 (BGP-4)", RFC 1771, March 1995. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, March 2000. [RFC5512] Mohapatra, P. and E. Rosen, "The BGP Encapsulation Subsequent Address Family Identifier (SAFI) and the BGP Tunnel Encapsulation Attribute", RFC 5512, April 2009. 12.2. Informative References [I-D.mahalingam-dutt-dcops-vxlan] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", draft-mahalingam-dutt-dcops-vxlan-02 (work in progress), August 2012. [I-D.sridharan-virtualization-nvgre] Sridharan, M., Greenberg, A., Venkataramaiah, N., Wang, Y., Duda, K., Ganga, I., Lin, G., Pearson, M., Thaler, P., and C. Tumuluri, "NVGRE: Network Virtualization using Generic Routing Encapsulation", draft-sridharan- virtualization-nvgre-02 (work in progress), February 2013. Authors' Addresses Rao & Jain Expires January 17, 2014 [Page 8] Internet-Draft July 2013 Dhananjaya Rao Cisco Systems 170 W. Tasman Drive San Jose, CA 95124 95134 USA Email: dhrao@cisco.com Vipin Jain Cisco Systems 170 W. Tasman Drive San Jose, CA 95124 95134 USA Email: vipijain@cisco.com Rao & Jain Expires January 17, 2014 [Page 9]