I2RS Working Group R. Krishnan Internet Draft Brocade Communications Category: Informational A. Ghanwani Dell S. Kini Ericsson D. Mcdysan Verizon Expires: April 2014 October 13, 2013 I2RS Large Flow Use Case draft-krishnan-i2rs-large-flow-use-case-00 Abstract Demands on networking bandwidth are growing exponentially due to applications such as large file transfers and those with rich media. Link Aggregation Group (LAG) and Equal Cost Multipath (ECMP) are extensively deployed in networks to scale the bandwidth. However, the flow based load balancing techniques used today make inefficient use of the bandwidth in the presence of long lived large flows. This draft presents a use-case to improve the efficiency under such conditions and aims to drive requirements for the I2RS WG. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Krishnan Expires April 2014 [Page 1] Internet-Draft I2RS Large Flow September 2013 This Internet-Draft will expire on April, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119 [RFC 2119]. Table of Contents 1. Introduction...................................................3 1.1. Acronyms..................................................4 1.2. Terminology...............................................4 2. Large Flow Recognition and Signaling...........................5 2.1. Network-based Recognition of Large Flows..................5 2.2. Application-based Signaling of Large Flows................5 3. Flow Rebalancing...............................................5 3.1. Local Rebalancing.........................................5 3.2. Global Rebalancing........................................6 3.2.1. IP Networks..........................................6 3.2.2. MPLS Networks........................................7 4. Operational Considerations.....................................7 5. IANA Considerations............................................7 6. Security Considerations........................................8 7. Acknowledgements...............................................8 8. References.....................................................8 8.1. Normative References......................................8 8.2. Informative References....................................8 Authors' Addresses................................................8 Krishnan Expires April 2014 [Page 2] Internet-Draft I2RS Large Flow September 2013 1. Introduction Networks extensively deploy LAG and ECMP for bandwidth scaling. Network traffic can be predominantly categorized into two traffic types: long-lived large flows and other flows (which include long- lived small flows, short-lived small/large flows) [OPSAWG-large- flow]. Stateless hash-based techniques [ITCOM, RFC 2991, RFC 2992, and RFC 6790] are often used to distribute both long-lived large flows and other flows over the components in a LAG/ECMP. However the traffic may not be evenly distributed over the component links due to the traffic pattern. This draft describes long-lived large flow load balancing techniques for achieving the best network bandwidth utilization with LAG/ECMP and the corresponding I2RS requirements. Some of these techniques have been described in detail in [OPSAWG-large-flow]. We describe methods that can be used locally within a single router, as well as methods that can be applied across multiple network elements, where the network is under the control of single administrative entity. We refer to the former as local load balancing and the latter as global load balancing. A combination of local and global load balancing helps in achieving the best network bandwidth utilization and latency for a given network topology. From a router standpoint, long-lived large flows are typically identified using one or more fields from the packet header from the following list: . Layer 2: source MAC address, destination MAC address, VLAN ID. . IP header: IP Protocol, IP source address, IP destination address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP destination port. . MPLS Labels. For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc., flow identification is possible based on inner and/or outer headers. The above list is not exhaustive. In the remainder of this document, consistent with [OPSAWG-large- flow], we use the term "large flow" to refer to "long-lived large flows," and we use the term "small flow" to refer to any of the three other types of flows identified above. Krishnan Expires April 2014 [Page 3] Internet-Draft I2RS Large Flow September 2013 At a high-level, the technique involves recognizing large flows and rebalancing them to achieve optimal load balancing. Large flows may be recognized within a router, or using the aid of an external management entity such as an IPFIX [RFC 7011] collector or a sFlow [sFlow-v5] collector. Once a large flow has been recognized, it must be signaled to a management entity that makes the rebalancing decision. Finally, the rebalancing decision is communicated to the routers to program the forwarding plane. In subsequent sections, we describe the requirements with recognition and rebalancing as they pertain to I2RS. 1.1. Acronyms ECMP: Equal Cost Multi-path GRE: Generic Routing Encapsulation LAG: Link Aggregation Group LSR: Label Switch Router MPLS: Multiprotocol Label Switching NVGRE: Network Virtualization using Generic Routing Encapsulation PBR: Policy Based Routing QoS: Quality of Service STT: Stateless Transport Tunneling VXLAN: Virtual Extensible LAN 1.2. Terminology Large flow(s): long-lived large flow(s) Small flow(s): long-lived small flow(s) and short-lived small/large flow(s) Krishnan Expires April 2014 [Page 4] Internet-Draft I2RS Large Flow September 2013 2. Large Flow Recognition and Signaling 2.1. Network-based Recognition of Large Flows The first step is recognizing large flows. There are two ways for recognizing large flows as described in [OPSAWG-large-flow]. The first method is automatic hardware-based recognition in which the large flows are identified in hardware. Once a large flow is recognized, it needs to be communicated to a management entity (such as an SDN controller) that is capable of making rebalancing decisions. This communication is out of scope for I2RS and can be handled using protocols such as IPFIX [RFC 7011]. The next method is where sFlow or IPFIX packet sampling can be used to convey packet samples to an external entity such as sFlow or IPFIX collector. The external entity recognizes large flows and this entity signals the large flows to another management entity that is capable of making rebalancing decisions (such as an SDN controller). Once again, this communication is out of scope of the I2RS. 2.2. Application-based Signaling of Large Flows Instead of having the network recognize large flows, the large flow can be signaled by an application that is known to instantiate large flows, e.g. a backup operation, and may perhaps indicate other parameters such as the latency desired. Such flows would once again need to be signaled to the management entity capable of routing or rebalancing decisions. This communication is also outside the scope of I2RS. 3. Flow Rebalancing 3.1. Local Rebalancing In the case of local rebalancing, the utilization of the component links that are part of the LAG or ECMP are monitored and the flows are redistributed among the member links to ensure optimal load balancing across all of the component links. Typically, this involves redirecting large flows to individual ECMP or LAG components, and potentially adjusting the weights used to distribute small flows across these components, using mechanisms specified in [OPSAWG-large-flow]. This approach works regardless of whether the underlying network is IP or MPLS. Krishnan Expires April 2014 [Page 5] Internet-Draft I2RS Large Flow September 2013 To achieve this, there are two requirements for I2RS: . For redirecting large flows to a specific member, a PBR entry is required with a key that identifies the flow and a corresponding nexthop that identifies the specific LAG or ECMP component. . For adjusting the weights used to distribute traffic across components of the LAG or ECMP, a mechanism is needed that identifies ECMP entries and is able to associate weights that can be programmed for each of the components. To do this in a scalable fashion, it would be useful to have the notion of an ECMP group that is used by multiple routes. At the RIB level, the nexthop information is typically specified as an outgoing IP interface. However, in an L2/L3 switch, the IP interface may be a VLAN, and in turn there are several "bridge ports" that are members of the VLAN. Typically, the route entry is resolved to a specific port before the forwarding table is programmed. If the bridge port is a LAG, there will be member ports associated with that LAG. Typically, the resolution down to an individual port is done via means not specified in any standard. From the standpoint of I2RS, the ability to address individual ports in a router is desirable. This requires the I2RS topology to be aware of LAG members, and the ability of routers to accept route or PBR entries that map to a specific member port within a LAG. 3.2. Global Rebalancing 3.2.1. IP Networks For IP networks, this involves programming a globally optimal path for the large flow. The globally optimal path is programmed in the IP network using hop-by-hop PBR rules. For IP networks, this involves creating a globally optimal path [HEDERA-dynamic-flow-scheduling] using a network management entity which hosts an I2RS client. The globally optimal path is programmed in the IP network using hop-by-hop PBR rules. The weights of the ECMP table for different nexthops should be adjusted to factor the long-lived large flows - this is explained below with an example. As an example, consider a 4 way ECMP at node n1 with IP nexthops n11, n12, n13, n14 using links l1, l2, l3, l4 each of capacity 10 Krishnan Expires April 2014 [Page 6] Internet-Draft I2RS Large Flow September 2013 Gbps. Say, a long-lived large flow of average bandwidth 2 Gbps is admitted to one of the links l3. The ECMP nexthop table needs to be adjusted to approximately account for the long-lived large flow so that the other flows do not overload link l3 which is already used by the large flow. The ECMP nexthop table will be programmed as w1*n11, w2*n12, w3*n13, w4*n14 where w1=w2=w4=1 and w3=0.8; this needs to be done for all the routes using the same set of nexthops. Now, if there are other set of nexthops from node n1 using link l3, they should also be adjusted. Say, there is another set of IP ECMP nexthops n13, n14, n15, n16 using links l3, l4, l5, l6. The ECMP nexthop table will be programmed as w1*n13, w2*n14, w3*n15, w4*n16 where w2=w3=w4=1 and w1=0.8; this needs to be done for all the routes using the same set of nexthops. In practice, there could be multiple large flows on a single link and the ECMP nexthop table must be adjusted to factor all of these flows. As mentioned in Section 3.1. , it would be useful to have a way of addressing an ECMP group, so that all routes sharing an ECMP group are addressed together. 3.2.2. MPLS Networks There are several ways to address global load rebalancing in MPLS networks. For example: . Have multiple LSPs between ingress and egress routers. In this case, having a PBR entry at the edge LSR that forwards the large flow to specific LSP known to have the necessary bandwidth is needed. . Program a new LSP for a given large flow. Here the requirements for I2RS would be providing the ability to program PBR entries at the edge LSR, and the programming new LSPs in the network. 4. Operational Considerations Operational considerations would be similar to those specified in [OPSAWG-large-flow]. 5. IANA Considerations None. Krishnan Expires April 2014 [Page 7] Internet-Draft I2RS Large Flow September 2013 6. Security Considerations This draft specifies a use case for I2RS and does not introduce any new security requirements beyond those already under consideration for I2RS. 7. Acknowledgements The authors would like to sincerely acknowledge Dan Romascanu for his help with understanding the relationship between the Interfaces MIB and LAG. The authors would also like to thank Prasad Jogalekar, Mukhtiar Shaikh and Muhammad Durrani for the discussions on the topic of global load balancing. 8. References 8.1. Normative References 8.2. Informative References [OPSAWG-large-flow] Krishnan, R. et al., "Mechanisms for Optimal LAG/ECMP Component Link Utilization in Networks," February 2014. [HEDERA-dynamic-flow-scheduling] Al-Fares, M. et al., "Hedera: Dynamic Flow Scheduling for Data Center Networks", December 2009 [sFlow-v5] Phaal, P. and M. Lavine, "sFlow version 5," July 2004. [RFC 7011] Claise, B., "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information,", September 2013 [RFC 2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels,", March 1997 Authors' Addresses Ram Krishnan Brocade Communications ramk@brocade.com Anoop Ghanwani Dell anoop@alumni.duke.edu Sriganesh Kini Ericsson Krishnan Expires April 2014 [Page 8] Internet-Draft I2RS Large Flow September 2013 sriganesh.kini@ericsson.com Dave Mcdysan Verizon dave.mcdysan@verizon.com Krishnan Expires April 2014 [Page 9]