Network Working Group                                           S. Whyte
Internet-Draft                                               Google Inc.
Intended status: Informational                                  M. Hines
Expires: April 24, 2014                                        W. Kumari
                                                            Google, Inc.
                                                        October 21, 2013


                  Bulk Network Data Collection System
              draft-swhyte-i2rs-data-collection-system-00

Abstract

   Collecting large amounts of data from network infrastructure devices
   has never been very easy.  Existing methods generate CPU and memory
   loads that may be unacceptable, the output varies across
   implementations and can be difficult to parse, and these methods are
   often difficult to scale.  I2RS programmatic interfacing with the
   routing system may exacerbate this problem: state needs to be
   collected from nodes and fed to consumers participating in the
   control plane that may not be physically close to the nodes.  This
   state includes not only control plane information, but elements of
   the data plane that have a direct impact on control plane behavior,
   like traffic engineering.

   This document outlines a set of use cases requiring a flexible
   framework to collect routing system data, and the features and
   functionality needed to make such a framework useful for these use
   cases.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any


Whyte, et al.            Expires April 24, 2014                 [Page 1]

Internet-Draft            Bulk Data Collection              October 2013


   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 24, 2014.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Desired functionality . . . . . . . . . . . . . . . . . . . .   3
     2.1.  Database Model  . . . . . . . . . . . . . . . . . . . . .   4
     2.2.  Pub-Sub . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.3.  Capability Negotiation  . . . . . . . . . . . . . . . . .   5
     2.4.  Format Agnostic . . . . . . . . . . . . . . . . . . . . .   5
     2.5.  Transport Options . . . . . . . . . . . . . . . . . . . .   5
     2.6.  Filtering . . . . . . . . . . . . . . . . . . . . . . . .   5
     2.7.  Timestamps  . . . . . . . . . . . . . . . . . . . . . . .   6
     2.8.  Introspection . . . . . . . . . . . . . . . . . . . . . .   6
     2.9.  Registration  . . . . . . . . . . . . . . . . . . . . . .   6
   3.  Use cases . . . . . . . . . . . . . . . . . . . . . . . . . .   6
     3.1.  Push  . . . . . . . . . . . . . . . . . . . . . . . . . .   7
       3.1.1.  Interface counters  . . . . . . . . . . . . . . . . .   7
       3.1.2.  Thresholds  . . . . . . . . . . . . . . . . . . . . .   7
       3.1.3.  Streaming . . . . . . . . . . . . . . . . . . . . . .   7
     3.2.  Pull  . . . . . . . . . . . . . . . . . . . . . . . . . .   7
       3.2.1.  Interface counters  . . . . . . . . . . . . . . . . .   8
       3.2.2.  RIB Dump  . . . . . . . . . . . . . . . . . . . . . .   8
       3.2.3.  Arbitrary data collection . . . . . . . . . . . . . .   8
     3.3.  Dynamic subscriptions . . . . . . . . . . . . . . . . . .   8
   4.  Subscriber versus consumer  . . . . . . . . . . . . . . . . .   8
     4.1.  Remapping . . . . . . . . . . . . . . . . . . . . . . . .   8
   5.  Errors  . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .   9


Whyte, et al.            Expires April 24, 2014                 [Page 2]

Internet-Draft            Bulk Data Collection              October 2013


   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   9
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  10
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  10
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   Managing and monitoring a network requires getting state out of it.
   You can't manage what you don't measure, as the saying goes.
   Currently there are a limited set of tools to get data off of network
   nodes, and they do not lend themselves to programmatic access.

   The primary tool today is SNMP.  SNMP can be used to both push data
   off a node (via traps/notifications) and pull data off the box (via
   queries).  SNMP queries have a variety of issues, not the least of
   which is the fact that the protocol specification requires data
   structures to be created on demand on network nodes that do not match
   how the device's operating system data structures store the same
   data.  Fixing this problem has the immediate benefit of reducing CPU
   and memory consumption of the monitored network devices, greatly
   increasing the deployability and relevance of a solution.  SNMP traps
   /notifications suffer from a lack of introspection; the network
   management system (NMS) must be preconfigured to understand what
   information is being reported.

   Other tools include CLI scraping and Syslog.  CLI scraping is a low-
   level pull mechanism and essentially the opposite of programmatic
   access.  Any change in CLI implementation, whether its a simple
   whitespace correction, re-ordering of configuration stanzas,
   typographical errors, or even unit changes, can require a rewriting
   of monitoring software.  This is compounded by the fact there is no
   standardized CLI specification, such that a network with multiple
   vendors in it requires these rewrites per vendor CLI change.

   Syslog is another way to push data off of a network node.  Syslog has
   been around a long time, and while current standards provide
   structured data output, very few implementations exist on network
   nodes currently.  For the most part NMSes must be trained how to
   consume and interpret different implementations of syslog.

2.  Desired functionality

   Collecting large data sets with high frequency and resolution, with
   minimal impact to a device's CPU and memory, is the primary
   objective.  Aspects of the over-all data collection system, such as
   availability or reliability or scaling, are outside of scope as they
   deal with the data once it has left the network node.


Whyte, et al.            Expires April 24, 2014                 [Page 3]

Internet-Draft            Bulk Data Collection              October 2013


   We are only focusing on getting data off the node in an easily
   machine parsable format.

2.1.  Database Model

   A database model is desired, whereby a network node can describe the
   data it has available, and the structure of that data.  This gives
   the implementor the ability to present a database model that can be
   optimal with the node's internal data structure implementations.  The
   NMS consumes and understands the database model only after it has
   been trained to do so by incorporating a published version of the
   database model from the vendor.

   It should be noted that all existing data collection methods outlined
   earlier require explicit knowledge of the method's implementation for
   integration into a NMS.  We do not propose a solution that eliminates
   this, because heterogeneity of the data is not required, as we can
   see from existing implementations.  Rather, capability negotiation
   and flexible formats and transports, outlined below, are desired
   enabling the primary objective of getting large data sets off the
   nodes with as little impact as possible.

2.2.  Pub-Sub

   An underlying pub-sub model is desired for a variety of features.  It
   provides a security model for authorization, it supports
   intermediaries allowing the system to scale as needed, and it
   provides both push and pull methods of data distribution.

   In the context of this draft, a pub-sub model is a general concept
   indicating information flow.  Specific system details are obviously
   critical yet belong in a data model document.  The high level desire
   is to have network nodes as publishers, with an NMS implementing
   subscribers.  Conceptually, they are connected by a message bus, a
   layer of indirection between the publishers and subscribers.  Having
   a message bus allows publisher fan-in, subscriber fan-out, and a
   number of other useful features outside the scope of this document.
   The message bus is frequently referred to as a broker inside pub-sub
   models.

   Having a message bus abstraction allows for considerable flexibility
   in NMS design as well.  Placement of brokers in the network, their
   redundancy, availablility, scaling per publisher or subscriber, can
   all be tailored to suit an individual network's needs, from extremely
   simple (flat) to extremely complex with multiple layers of hierarchy.
   Many implementations of pub-sub models exist, scaling both in number
   of subscriptions and in number of messages, both of which should be
   considered carefully in the I2RS context.


Whyte, et al.            Expires April 24, 2014                 [Page 4]

Internet-Draft            Bulk Data Collection              October 2013


2.3.  Capability Negotiation

   Capability negotiation allows a node to inform a subscriber of a
   number of options.  Two extremely important options would be
   transport protocols and formats supported.  Other aspects such as
   security options and error handling would also be negotiated during
   this phase.

   The capability negotiation phase is done via a control channel opened
   for the purpose of registering subscriptions with the node.  This
   control channel should be TCP.

2.4.  Format Agnostic

   From the I2RS perspective, this framework should be format agnostic.
   If a node advertises the ability to present data in XML and the
   subscriber agrees, then XML can be used.  Other formats that have
   interest are JSON, HTML, and protobufs.  Even interest for /proc/net
   formatted output exists, and would help a NMS based on this framework
   integrate into existing server configuration management systems.

   [ Editor note: even ASN.1 should be an acceptable format.  This would
   potentially allow an extremely easy deployment into an existing SNMP
   based NMS.]

2.5.  Transport Options

   Because the focus of this framework ends at getting data off the box
   as quickly as possible, implementations should have the freedom to
   choose a transport that meets their system design needs and not be
   restricted by a specific format.

   During the negotiation phase a node should advertise all the
   transport options it provides and allow the subscriber to select what
   it needs.

   Given the time-value of different data elements coming off the node
   can be quite different, it should be possible to request multiple
   transports and associate a subscription with the transport protocol
   of choice.

2.6.  Filtering

   Once a network node has provided its database model to a subscriber,
   the subscriber needs a way to select parts of the model for
   subscription, and it needs to be able to request multiple
   subscriptions at a time.


Whyte, et al.            Expires April 24, 2014                 [Page 5]

Internet-Draft            Bulk Data Collection              October 2013


   This framework should provide a standard filtering mechanism so that,
   independent of the database model structure and contents, a
   subscriber can select interesting items to collect and bucket them
   based on standard parameters such as frequency of collection,
   underlying transport required, whether the data is to be pushed or
   pulled, or even streaming or one-shot.

2.7.  Timestamps

   Every piece of data collected by this framework needs a timestamp
   associated with it indicating when the node made it available for
   collection.  This is not required on a per-variable basis, for
   example data organized into a table only requires a timestamp
   associated with the table.

   This is not to say additional timestamps are not useful for certain
   data sets nor that other timestamps with other semantics, for example
   collection time versus advertisement time, can not be used, but
   rather those additional timestamps are better placed in the database
   model supported by the device.

2.8.  Introspection

   This framework should support introspection of the database model.
   Introspection provides support for data verification, easier
   inclusion of legacy data, and easier merging of data stream.

2.9.  Registration

   After capabilities and a database model have been exchanged, and a
   filter used to select elements of the model to subscribe to, the
   framework should support a standard way to register for all the data
   desired, using whatever capabilities were advertised by the node.

   Once registration is complete, the control channel can be closed.
   Ensuring subscriptions are correct, complete, and replicated or not,
   is up to the overall system and not the network node.

3.  Use cases

   Following are example use cases outlining the utility of subscribing
   to data with different parameters.


Whyte, et al.            Expires April 24, 2014                 [Page 6]

Internet-Draft            Bulk Data Collection              October 2013


3.1.  Push

   Pushing data off the box can be done synchronously at fixed
   intervals, or asynchronously in an ad-hoc fashion.  All data pushed
   is set up via registered subscriptions.

3.1.1.  Interface counters

   Interface counters provide a use case demonstrating the need to push
   data off of a network node at specific intervals.  In this proposed
   framework, a node would advertise its database model including all
   the interfaces it has to offer and what it can count on each.  A
   subscriber would select the interfaces and counters of each it is
   interested in via a filter, use the filter to group them according to
   available parameters, and register with the node to have them
   published at agreed upon intervals.

3.1.2.  Thresholds

   Another use case demonstrating a push capability is thresholding.
   Assuming a node advertises the capability to record and track a
   threshold for a particular data type, it would use the registered
   subscription to push relevant data to the subscriber whenever the
   threshold was crossed.  As an example, a subscriber may want to set a
   threshold for memory consumed - if the available device memory falls
   below a threshold the subscriber should be informed so that the
   operator can investigate the issue manually or programatically.

3.1.3.  Streaming

   Streaming data, such as RIB information, will be critical to
   supporting I2RS functionality.  In this use case, a subscriber may
   desire to have all updates to a RIB streamed into the collection
   system, in as close to real-time as possible.

3.2.  Pull

   Pulling data off the node will always be a one-shot function.  As
   such it is probably the most heavy-handed way to get data into the
   collection system, as it requires all the overhead of setting up and
   tearing down the control channel, exchanging the database model,
   creating a filter, and receiving the data.  Nevertheless, it can be a
   valuable option and should be supported.

   n.b. it is certainly possible to cache requests on publishers, and
   have them "replayed" via a subscription identifier.  However the
   capability to track the state required to do so may not be available
   on a node, and this is somewhat counter to the overall goal of


Whyte, et al.            Expires April 24, 2014                 [Page 7]

Internet-Draft            Bulk Data Collection              October 2013


   minimizing impact to the node.  Having this capability as an optional
   parameter of a database model, is worth exploring.

3.2.1.  Interface counters

   Similar to the interface counter example above, except in this case
   the registration includes a parameter indicating the data should be
   collected immediately and sent only once.

3.2.2.  RIB Dump

   Getting a snapshot of the node's current RIB can be useful for a
   variety of reasons.  Similar to collecting RIB information above, in
   this example the subscriber would register for a one-shot dump of the
   RIB, collected and sent immediately.

3.2.3.  Arbitrary data collection

   Once the NMS understands a node's database model, it should be able
   to register for one-shot collection of any subset of that database
   model.  Given the overheads involved, this would best be restricted
   to one-off collection needs, such as troubleshooting, but the use
   case need is solid.

3.3.  Dynamic subscriptions

   This framework should support dynamic subscription capabilities with
   pre-existing monitoring protocols that currently require static
   configuration.  For example, if a node's database model indicates it
   support IPFIX, using the standard registration process outlined above
   a subscriber should be able to set up a streaming IPFIX feed.  BMP
   and the like should also be available via this mechanism.

4.  Subscriber versus consumer

   It should be noted that because overall data collection system
   architecture is out of scope, it is opaque to this framework whether
   a subscriber is also the consumer of data.  In order to maximize
   design options, including scalability of the overall system, both
   options should be supported.

4.1.  Remapping

   Remapping in this context is the ability to modify a node's database
   model and request the modified model be used in subscriptions.  While
   this has interesting properties, it strays far from the primary
   objective of getting data off of nodes as fast was with as little
   impact as possible, and thus should be considered out of scope.


Whyte, et al.            Expires April 24, 2014                 [Page 8]

Internet-Draft            Bulk Data Collection              October 2013


5.  Errors

   Errors happen.  Many classes of errors and their handling are already
   well-understood and don't need to be re-iterated here.  There are
   certainly failure modes that may be unique to I2RS or this framework,
   however, and we should be prepared to incorporate solutions for
   those.

   For example,providing a method for a node and a subscriber to agree
   on resolution steps after defined error events would be very useful.
   A subscriber may want certain subscriptions to be available for
   pulling, if the push mechanism failed.

   There may also be value in defining how a subscriber can probe the
   transport layer, such that publisher responses can assist in
   troubleshooting protocol-specific failures.

   The framework needs to support standardized handling of stale data.
   This class of error will largely be related to handling changes and
   exceptions in the database models exchanged.  For example what
   happens when a node's physical configuration changes and part of an
   existing subscription becomes invalid.  Similar thought to logical
   changes, such as the disappearance of a BGP speaker, needs to be
   give.

6.  IANA Considerations

   This documents makes no request of the IANA.

7.  Security Considerations

   I2RS provides security requirements, any security requirements raised
   by this framework should be encompassed there.

   [TODO(WK, SW): This section needs more work / text ]

8.  Acknowledgements

   The author wishes to acknowledge the contributions of a number of
   folk, including

   {TODO(WK, SW): Remember to add folk! ]


Whyte, et al.            Expires April 24, 2014                 [Page 9]

Internet-Draft            Bulk Data Collection              October 2013


9.  References

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

9.2.  Informative References

   [DeBoer]   De Boer, M. and J. Bosma, "Discovering Path MTU black
              holes on the Internet using RIPE Atlas", July 2012, <http:
              //www.nlnetlabs.nl/downloads/publications/pmtu-black-
              holes-msc-thesis.pdf>.

Authors' Addresses

   Scott Whyte
   Google Inc.
   1600 Amphitheatre Parkway
   Mountain view, California  94043
   USA

   Email: swhyte@google.com


   Marcus Hines
   Google, Inc.
   1600 Amphitheatre Parkway
   Mountain view, California  94043
   USA

   Email: hines@google.com


   Warren Kumari
   Google, Inc.
   1600 Amphitheatre Parkway
   Mountain view, California  94043
   USA

   Email: warren@kumari.net


Whyte, et al.            Expires April 24, 2014                [Page 10]