Network Working Group S. Whyte Internet-Draft Google Inc. Intended status: Informational M. Hines Expires: April 24, 2014 W. Kumari Google, Inc. October 21, 2013 Bulk Network Data Collection System draft-swhyte-i2rs-data-collection-system-00 Abstract Collecting large amounts of data from network infrastructure devices has never been very easy. Existing methods generate CPU and memory loads that may be unacceptable, the output varies across implementations and can be difficult to parse, and these methods are often difficult to scale. I2RS programmatic interfacing with the routing system may exacerbate this problem: state needs to be collected from nodes and fed to consumers participating in the control plane that may not be physically close to the nodes. This state includes not only control plane information, but elements of the data plane that have a direct impact on control plane behavior, like traffic engineering. This document outlines a set of use cases requiring a flexible framework to collect routing system data, and the features and functionality needed to make such a framework useful for these use cases. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any Whyte, et al. Expires April 24, 2014 [Page 1] Internet-Draft Bulk Data Collection October 2013 time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 24, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Desired functionality . . . . . . . . . . . . . . . . . . . . 3 2.1. Database Model . . . . . . . . . . . . . . . . . . . . . 4 2.2. Pub-Sub . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3. Capability Negotiation . . . . . . . . . . . . . . . . . 5 2.4. Format Agnostic . . . . . . . . . . . . . . . . . . . . . 5 2.5. Transport Options . . . . . . . . . . . . . . . . . . . . 5 2.6. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 5 2.7. Timestamps . . . . . . . . . . . . . . . . . . . . . . . 6 2.8. Introspection . . . . . . . . . . . . . . . . . . . . . . 6 2.9. Registration . . . . . . . . . . . . . . . . . . . . . . 6 3. Use cases . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1. Push . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.1. Interface counters . . . . . . . . . . . . . . . . . 7 3.1.2. Thresholds . . . . . . . . . . . . . . . . . . . . . 7 3.1.3. Streaming . . . . . . . . . . . . . . . . . . . . . . 7 3.2. Pull . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2.1. Interface counters . . . . . . . . . . . . . . . . . 8 3.2.2. RIB Dump . . . . . . . . . . . . . . . . . . . . . . 8 3.2.3. Arbitrary data collection . . . . . . . . . . . . . . 8 3.3. Dynamic subscriptions . . . . . . . . . . . . . . . . . . 8 4. Subscriber versus consumer . . . . . . . . . . . . . . . . . 8 4.1. Remapping . . . . . . . . . . . . . . . . . . . . . . . . 8 5. Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 7. Security Considerations . . . . . . . . . . . . . . . . . . . 9 Whyte, et al. Expires April 24, 2014 [Page 2] Internet-Draft Bulk Data Collection October 2013 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 9.1. Normative References . . . . . . . . . . . . . . . . . . 10 9.2. Informative References . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction Managing and monitoring a network requires getting state out of it. You can't manage what you don't measure, as the saying goes. Currently there are a limited set of tools to get data off of network nodes, and they do not lend themselves to programmatic access. The primary tool today is SNMP. SNMP can be used to both push data off a node (via traps/notifications) and pull data off the box (via queries). SNMP queries have a variety of issues, not the least of which is the fact that the protocol specification requires data structures to be created on demand on network nodes that do not match how the device's operating system data structures store the same data. Fixing this problem has the immediate benefit of reducing CPU and memory consumption of the monitored network devices, greatly increasing the deployability and relevance of a solution. SNMP traps /notifications suffer from a lack of introspection; the network management system (NMS) must be preconfigured to understand what information is being reported. Other tools include CLI scraping and Syslog. CLI scraping is a low- level pull mechanism and essentially the opposite of programmatic access. Any change in CLI implementation, whether its a simple whitespace correction, re-ordering of configuration stanzas, typographical errors, or even unit changes, can require a rewriting of monitoring software. This is compounded by the fact there is no standardized CLI specification, such that a network with multiple vendors in it requires these rewrites per vendor CLI change. Syslog is another way to push data off of a network node. Syslog has been around a long time, and while current standards provide structured data output, very few implementations exist on network nodes currently. For the most part NMSes must be trained how to consume and interpret different implementations of syslog. 2. Desired functionality Collecting large data sets with high frequency and resolution, with minimal impact to a device's CPU and memory, is the primary objective. Aspects of the over-all data collection system, such as availability or reliability or scaling, are outside of scope as they deal with the data once it has left the network node. Whyte, et al. Expires April 24, 2014 [Page 3] Internet-Draft Bulk Data Collection October 2013 We are only focusing on getting data off the node in an easily machine parsable format. 2.1. Database Model A database model is desired, whereby a network node can describe the data it has available, and the structure of that data. This gives the implementor the ability to present a database model that can be optimal with the node's internal data structure implementations. The NMS consumes and understands the database model only after it has been trained to do so by incorporating a published version of the database model from the vendor. It should be noted that all existing data collection methods outlined earlier require explicit knowledge of the method's implementation for integration into a NMS. We do not propose a solution that eliminates this, because heterogeneity of the data is not required, as we can see from existing implementations. Rather, capability negotiation and flexible formats and transports, outlined below, are desired enabling the primary objective of getting large data sets off the nodes with as little impact as possible. 2.2. Pub-Sub An underlying pub-sub model is desired for a variety of features. It provides a security model for authorization, it supports intermediaries allowing the system to scale as needed, and it provides both push and pull methods of data distribution. In the context of this draft, a pub-sub model is a general concept indicating information flow. Specific system details are obviously critical yet belong in a data model document. The high level desire is to have network nodes as publishers, with an NMS implementing subscribers. Conceptually, they are connected by a message bus, a layer of indirection between the publishers and subscribers. Having a message bus allows publisher fan-in, subscriber fan-out, and a number of other useful features outside the scope of this document. The message bus is frequently referred to as a broker inside pub-sub models. Having a message bus abstraction allows for considerable flexibility in NMS design as well. Placement of brokers in the network, their redundancy, availablility, scaling per publisher or subscriber, can all be tailored to suit an individual network's needs, from extremely simple (flat) to extremely complex with multiple layers of hierarchy. Many implementations of pub-sub models exist, scaling both in number of subscriptions and in number of messages, both of which should be considered carefully in the I2RS context. Whyte, et al. Expires April 24, 2014 [Page 4] Internet-Draft Bulk Data Collection October 2013 2.3. Capability Negotiation Capability negotiation allows a node to inform a subscriber of a number of options. Two extremely important options would be transport protocols and formats supported. Other aspects such as security options and error handling would also be negotiated during this phase. The capability negotiation phase is done via a control channel opened for the purpose of registering subscriptions with the node. This control channel should be TCP. 2.4. Format Agnostic From the I2RS perspective, this framework should be format agnostic. If a node advertises the ability to present data in XML and the subscriber agrees, then XML can be used. Other formats that have interest are JSON, HTML, and protobufs. Even interest for /proc/net formatted output exists, and would help a NMS based on this framework integrate into existing server configuration management systems. [ Editor note: even ASN.1 should be an acceptable format. This would potentially allow an extremely easy deployment into an existing SNMP based NMS.] 2.5. Transport Options Because the focus of this framework ends at getting data off the box as quickly as possible, implementations should have the freedom to choose a transport that meets their system design needs and not be restricted by a specific format. During the negotiation phase a node should advertise all the transport options it provides and allow the subscriber to select what it needs. Given the time-value of different data elements coming off the node can be quite different, it should be possible to request multiple transports and associate a subscription with the transport protocol of choice. 2.6. Filtering Once a network node has provided its database model to a subscriber, the subscriber needs a way to select parts of the model for subscription, and it needs to be able to request multiple subscriptions at a time. Whyte, et al. Expires April 24, 2014 [Page 5] Internet-Draft Bulk Data Collection October 2013 This framework should provide a standard filtering mechanism so that, independent of the database model structure and contents, a subscriber can select interesting items to collect and bucket them based on standard parameters such as frequency of collection, underlying transport required, whether the data is to be pushed or pulled, or even streaming or one-shot. 2.7. Timestamps Every piece of data collected by this framework needs a timestamp associated with it indicating when the node made it available for collection. This is not required on a per-variable basis, for example data organized into a table only requires a timestamp associated with the table. This is not to say additional timestamps are not useful for certain data sets nor that other timestamps with other semantics, for example collection time versus advertisement time, can not be used, but rather those additional timestamps are better placed in the database model supported by the device. 2.8. Introspection This framework should support introspection of the database model. Introspection provides support for data verification, easier inclusion of legacy data, and easier merging of data stream. 2.9. Registration After capabilities and a database model have been exchanged, and a filter used to select elements of the model to subscribe to, the framework should support a standard way to register for all the data desired, using whatever capabilities were advertised by the node. Once registration is complete, the control channel can be closed. Ensuring subscriptions are correct, complete, and replicated or not, is up to the overall system and not the network node. 3. Use cases Following are example use cases outlining the utility of subscribing to data with different parameters. Whyte, et al. Expires April 24, 2014 [Page 6] Internet-Draft Bulk Data Collection October 2013 3.1. Push Pushing data off the box can be done synchronously at fixed intervals, or asynchronously in an ad-hoc fashion. All data pushed is set up via registered subscriptions. 3.1.1. Interface counters Interface counters provide a use case demonstrating the need to push data off of a network node at specific intervals. In this proposed framework, a node would advertise its database model including all the interfaces it has to offer and what it can count on each. A subscriber would select the interfaces and counters of each it is interested in via a filter, use the filter to group them according to available parameters, and register with the node to have them published at agreed upon intervals. 3.1.2. Thresholds Another use case demonstrating a push capability is thresholding. Assuming a node advertises the capability to record and track a threshold for a particular data type, it would use the registered subscription to push relevant data to the subscriber whenever the threshold was crossed. As an example, a subscriber may want to set a threshold for memory consumed - if the available device memory falls below a threshold the subscriber should be informed so that the operator can investigate the issue manually or programatically. 3.1.3. Streaming Streaming data, such as RIB information, will be critical to supporting I2RS functionality. In this use case, a subscriber may desire to have all updates to a RIB streamed into the collection system, in as close to real-time as possible. 3.2. Pull Pulling data off the node will always be a one-shot function. As such it is probably the most heavy-handed way to get data into the collection system, as it requires all the overhead of setting up and tearing down the control channel, exchanging the database model, creating a filter, and receiving the data. Nevertheless, it can be a valuable option and should be supported. n.b. it is certainly possible to cache requests on publishers, and have them "replayed" via a subscription identifier. However the capability to track the state required to do so may not be available on a node, and this is somewhat counter to the overall goal of Whyte, et al. Expires April 24, 2014 [Page 7] Internet-Draft Bulk Data Collection October 2013 minimizing impact to the node. Having this capability as an optional parameter of a database model, is worth exploring. 3.2.1. Interface counters Similar to the interface counter example above, except in this case the registration includes a parameter indicating the data should be collected immediately and sent only once. 3.2.2. RIB Dump Getting a snapshot of the node's current RIB can be useful for a variety of reasons. Similar to collecting RIB information above, in this example the subscriber would register for a one-shot dump of the RIB, collected and sent immediately. 3.2.3. Arbitrary data collection Once the NMS understands a node's database model, it should be able to register for one-shot collection of any subset of that database model. Given the overheads involved, this would best be restricted to one-off collection needs, such as troubleshooting, but the use case need is solid. 3.3. Dynamic subscriptions This framework should support dynamic subscription capabilities with pre-existing monitoring protocols that currently require static configuration. For example, if a node's database model indicates it support IPFIX, using the standard registration process outlined above a subscriber should be able to set up a streaming IPFIX feed. BMP and the like should also be available via this mechanism. 4. Subscriber versus consumer It should be noted that because overall data collection system architecture is out of scope, it is opaque to this framework whether a subscriber is also the consumer of data. In order to maximize design options, including scalability of the overall system, both options should be supported. 4.1. Remapping Remapping in this context is the ability to modify a node's database model and request the modified model be used in subscriptions. While this has interesting properties, it strays far from the primary objective of getting data off of nodes as fast was with as little impact as possible, and thus should be considered out of scope. Whyte, et al. Expires April 24, 2014 [Page 8] Internet-Draft Bulk Data Collection October 2013 5. Errors Errors happen. Many classes of errors and their handling are already well-understood and don't need to be re-iterated here. There are certainly failure modes that may be unique to I2RS or this framework, however, and we should be prepared to incorporate solutions for those. For example,providing a method for a node and a subscriber to agree on resolution steps after defined error events would be very useful. A subscriber may want certain subscriptions to be available for pulling, if the push mechanism failed. There may also be value in defining how a subscriber can probe the transport layer, such that publisher responses can assist in troubleshooting protocol-specific failures. The framework needs to support standardized handling of stale data. This class of error will largely be related to handling changes and exceptions in the database models exchanged. For example what happens when a node's physical configuration changes and part of an existing subscription becomes invalid. Similar thought to logical changes, such as the disappearance of a BGP speaker, needs to be give. 6. IANA Considerations This documents makes no request of the IANA. 7. Security Considerations I2RS provides security requirements, any security requirements raised by this framework should be encompassed there. [TODO(WK, SW): This section needs more work / text ] 8. Acknowledgements The author wishes to acknowledge the contributions of a number of folk, including {TODO(WK, SW): Remember to add folk! ] Whyte, et al. Expires April 24, 2014 [Page 9] Internet-Draft Bulk Data Collection October 2013 9. References 9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 9.2. Informative References [DeBoer] De Boer, M. and J. Bosma, "Discovering Path MTU black holes on the Internet using RIPE Atlas", July 2012, . Authors' Addresses Scott Whyte Google Inc. 1600 Amphitheatre Parkway Mountain view, California 94043 USA Email: swhyte@google.com Marcus Hines Google, Inc. 1600 Amphitheatre Parkway Mountain view, California 94043 USA Email: hines@google.com Warren Kumari Google, Inc. 1600 Amphitheatre Parkway Mountain view, California 94043 USA Email: warren@kumari.net Whyte, et al. Expires April 24, 2014 [Page 10]