rfc9232.original | rfc9232.txt | |||
---|---|---|---|---|
OPSAWG H. Song | Internet Engineering Task Force (IETF) H. Song | |||
Internet-Draft Futurewei | Request for Comments: 9232 Futurewei | |||
Intended status: Informational F. Qin | Category: Informational F. Qin | |||
Expires: 6 June 2022 China Mobile | ISSN: 2070-1721 China Mobile | |||
P. Martinez-Julia | P. Martinez-Julia | |||
NICT | NICT | |||
L. Ciavaglia | L. Ciavaglia | |||
Rakuten Mobile | Rakuten Mobile | |||
A. Wang | A. Wang | |||
China Telecom | China Telecom | |||
3 December 2021 | May 2022 | |||
Network Telemetry Framework | Network Telemetry Framework | |||
draft-ietf-opsawg-ntf-13 | ||||
Abstract | Abstract | |||
Network telemetry is a technology for gaining network insight and | Network telemetry is a technology for gaining network insight and | |||
facilitating efficient and automated network management. It | facilitating efficient and automated network management. It | |||
encompasses various techniques for remote data generation, | encompasses various techniques for remote data generation, | |||
collection, correlation, and consumption. This document describes an | collection, correlation, and consumption. This document describes an | |||
architectural framework for network telemetry, motivated by | architectural framework for network telemetry, motivated by | |||
challenges that are encountered as part of the operation of networks | challenges that are encountered as part of the operation of networks | |||
and by the requirements that ensue. This document clarifies the | and by the requirements that ensue. This document clarifies the | |||
terminologies and classifies the modules and components of a network | terminology and classifies the modules and components of a network | |||
telemetry system from different perspectives. The framework and | telemetry system from different perspectives. The framework and | |||
taxonomy help to set a common ground for the collection of related | taxonomy help to set a common ground for the collection of related | |||
work and provide guidance for related technique and standard | work and provide guidance for related technique and standard | |||
developments. | developments. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This document is not an Internet Standards Track specification; it is | |||
provisions of BCP 78 and BCP 79. | published for informational purposes. | |||
Internet-Drafts are working documents of the Internet Engineering | ||||
Task Force (IETF). Note that other groups may also distribute | ||||
working documents as Internet-Drafts. The list of current Internet- | ||||
Drafts is at https://datatracker.ietf.org/drafts/current/. | ||||
Internet-Drafts are draft documents valid for a maximum of six months | This document is a product of the Internet Engineering Task Force | |||
and may be updated, replaced, or obsoleted by other documents at any | (IETF). It represents the consensus of the IETF community. It has | |||
time. It is inappropriate to use Internet-Drafts as reference | received public review and has been approved for publication by the | |||
material or to cite them other than as "work in progress." | Internet Engineering Steering Group (IESG). Not all documents | |||
approved by the IESG are candidates for any level of Internet | ||||
Standard; see Section 2 of RFC 7841. | ||||
This Internet-Draft will expire on 6 June 2022. | Information about the current status of this document, any errata, | |||
and how to provide feedback on it may be obtained at | ||||
https://www.rfc-editor.org/info/rfc9232. | ||||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2021 IETF Trust and the persons identified as the | Copyright (c) 2022 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents | |||
license-info) in effect on the date of publication of this document. | (https://trustee.ietf.org/license-info) in effect on the date of | |||
Please review these documents carefully, as they describe your rights | publication of this document. Please review these documents | |||
and restrictions with respect to this document. Code Components | carefully, as they describe your rights and restrictions with respect | |||
extracted from this document must include Revised BSD License text as | to this document. Code Components extracted from this document must | |||
described in Section 4.e of the Trust Legal Provisions and are | include Revised BSD License text as described in Section 4.e of the | |||
provided without warranty as described in the Revised BSD License. | Trust Legal Provisions and are provided without warranty as described | |||
in the Revised BSD License. | ||||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction | |||
1.1. Applicability Statement . . . . . . . . . . . . . . . . . 4 | 1.1. Applicability Statement | |||
1.2. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 4 | 1.2. Glossary | |||
2. Background . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 2. Background | |||
2.1. Telemetry Data Coverage . . . . . . . . . . . . . . . . . 7 | 2.1. Telemetry Data Coverage | |||
2.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 | 2.2. Use Cases | |||
2.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 9 | 2.3. Challenges | |||
2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 11 | 2.4. Network Telemetry | |||
2.5. The Necessity of a Network Telemetry Framework . . . . . 13 | 2.5. The Necessity of a Network Telemetry Framework | |||
3. Network Telemetry Framework . . . . . . . . . . . . . . . . . 14 | 3. Network Telemetry Framework | |||
3.1. Top Level Modules . . . . . . . . . . . . . . . . . . . . 15 | 3.1. Top-Level Modules | |||
3.1.1. Management Plane Telemetry . . . . . . . . . . . . . 18 | 3.1.1. Management Plane Telemetry | |||
3.1.2. Control Plane Telemetry . . . . . . . . . . . . . . . 18 | 3.1.2. Control Plane Telemetry | |||
3.1.3. Forwarding Plane Telemetry . . . . . . . . . . . . . 19 | 3.1.3. Forwarding Plane Telemetry | |||
3.1.4. External Data Telemetry . . . . . . . . . . . . . . . 21 | 3.1.4. External Data Telemetry | |||
3.2. Second Level Function Components . . . . . . . . . . . . 22 | 3.2. Second-Level Function Components | |||
3.3. Data Acquisition Mechanism and Type Abstraction . . . . . 24 | 3.3. Data Acquisition Mechanism and Type Abstraction | |||
3.4. Mapping Existing Mechanisms into the Framework . . . . . 26 | 3.4. Mapping Existing Mechanisms into the Framework | |||
4. Evolution of Network Telemetry Applications . . . . . . . . . 27 | 4. Evolution of Network Telemetry Applications | |||
5. Security Considerations . . . . . . . . . . . . . . . . . . . 28 | 5. Security Considerations | |||
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 | 6. IANA Considerations | |||
7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 29 | 7. Informative References | |||
8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 29 | Appendix A. A Survey on Existing Network Telemetry Techniques | |||
9. Informative References . . . . . . . . . . . . . . . . . . . 29 | A.1. Management Plane Telemetry | |||
Appendix A. A Survey on Existing Network Telemetry Techniques . 35 | A.1.1. Push Extensions for NETCONF | |||
A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 35 | A.1.2. gRPC Network Management Interface | |||
A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 35 | A.2. Control Plane Telemetry | |||
A.1.2. gRPC Network Management Interface . . . . . . . . . . 36 | A.2.1. BGP Monitoring Protocol | |||
A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 36 | A.3. Data Plane Telemetry | |||
A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 36 | A.3.1. Alternate-Marking (AM) Technology | |||
A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 36 | A.3.2. Dynamic Network Probe | |||
A.3.1. The Alternate Marking (AM) technology . . . . . . . . 36 | A.3.3. IP Flow Information Export (IPFIX) Protocol | |||
A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 38 | A.3.4. In Situ OAM | |||
A.3.3. IP Flow Information Export (IPFIX) Protocol . . . . . 38 | A.3.5. Postcard-Based Telemetry | |||
A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 38 | A.3.6. Existing OAM for Specific Data Planes | |||
A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 39 | A.4. External Data and Event Telemetry | |||
A.3.6. Existing OAM for Specific Data Planes . . . . . . . . 39 | A.4.1. Sources of External Events | |||
A.4. External Data and Event Telemetry . . . . . . . . . . . . 39 | A.4.2. Connectors and Interfaces | |||
A.4.1. Sources of External Events . . . . . . . . . . . . . 39 | Acknowledgments | |||
A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 41 | Contributors | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 | Authors' Addresses | |||
1. Introduction | 1. Introduction | |||
Network visibility is the ability of management tools to see the | Network visibility is the ability of management tools to see the | |||
state and behavior of a network, which is essential for successful | state and behavior of a network, which is essential for successful | |||
network operation. Network Telemetry revolves around network data | network operation. Network telemetry revolves around network data | |||
that can help provide insights about the current state of the | that 1) can help provide insights about the current state of the | |||
network, including network devices, forwarding, control, and | network, including network devices, forwarding, control, and | |||
management planes, and that can be generated and obtained through a | management planes; 2) can be generated and obtained through a variety | |||
variety of techniques, including but not limited to network | of techniques, including but not limited to network instrumentation | |||
instrumentation and measurements, and that can be processed for | and measurements; and 3) can be processed for purposes ranging from | |||
purposes ranging from service assurance to network security using a | service assurance to network security using a wide variety of data | |||
wide variety of data analytical techniques. In this document, | analytical techniques. In this document, network telemetry refers to | |||
Network Telemetry refer to both the data itself (i.e., "Network | both the data itself (i.e., "Network Telemetry Data") and the | |||
Telemetry Data"), and the techniques and processes used to generate, | techniques and processes used to generate, export, collect, and | |||
export, collect, and consume that data for use by potentially | consume that data for use by potentially automated management | |||
automated management applications. Network telemetry extends beyond | applications. Network telemetry extends beyond the classical network | |||
the classical network Operations, Administration, and Management | Operations, Administration, and Management (OAM) techniques and | |||
(OAM) techniques and expects to support better flexibility, | expects to support better flexibility, scalability, accuracy, | |||
scalability, accuracy, coverage, and performance. | coverage, and performance. | |||
However, the term "network telemetry" lacks an unambiguous | However, the term "network telemetry" lacks an unambiguous | |||
definition. The scope and coverage of it cause confusion and | definition. The scope and coverage of it cause confusion and | |||
misunderstandings. It is beneficial to clarify the concept and | misunderstandings. It is beneficial to clarify the concept and | |||
provide a clear architectural framework for network telemetry, so we | provide a clear architectural framework for network telemetry, so we | |||
can articulate the technical field, and better align the related | can articulate the technical field and better align the related | |||
techniques and standard works. | techniques and standard works. | |||
To fulfill such an undertaking, we first discuss some key | To fulfill such an undertaking, we first discuss some key | |||
characteristics of network telemetry which set a clear distinction | characteristics of network telemetry that set a clear distinction | |||
from the conventional network OAM and show that some conventional OAM | from the conventional network OAM and show that some conventional OAM | |||
technologies can be considered a subset of the network telemetry | technologies can be considered a subset of the network telemetry | |||
technologies. We then provide an architectural framework for network | technologies. We then provide an architectural framework for network | |||
telemetry which includes four modules, each concerned with a | telemetry that includes four modules, each associated with a | |||
different category of telemetry data and corresponding procedures. | different category of telemetry data and corresponding procedures. | |||
All the modules are internally structured in the same way, including | All the modules are internally structured in the same way, including | |||
components that allow the operator to configure data sources in | components that allow the operator to configure data sources in | |||
regard to what data to generate and how to make that available to | regard to what data to generate and how to make that available to | |||
client applications, components that instrument the underlying data | client applications, components that instrument the underlying data | |||
sources, and components that perform the actual rendering, encoding, | sources, and components that perform the actual rendering, encoding, | |||
and exporting of the generated data. We show how the network | and exporting of the generated data. We show how the network | |||
telemetry framework can benefit the current and future network | telemetry framework can benefit current and future network | |||
operations. Based on the distinction of modules and function | operations. Based on the distinction of modules and function | |||
components, we can map the existing and emerging techniques and | components, we can map the existing and emerging techniques and | |||
protocols into the framework. The framework can also simplify | protocols into the framework. The framework can also simplify | |||
designing, maintaining, and understanding a network telemetry system. | designing, maintaining, and understanding a network telemetry system. | |||
In addition, we outline the evolution stages of the network telemetry | In addition, we outline the evolution stages of the network telemetry | |||
system and discuss the potential security concerns. | system and discuss the potential security concerns. | |||
The purpose of the framework and taxonomy is to set a common ground | The purpose of the framework and taxonomy is to set a common ground | |||
for the collection of related work and provide guidance for future | for the collection of related work and provide guidance for future | |||
technique and standard developments. To the best of our knowledge, | technique and standard developments. To the best of our knowledge, | |||
skipping to change at page 4, line 35 ¶ | skipping to change at line 175 ¶ | |||
The network telemetry framework presented in this document must not | The network telemetry framework presented in this document must not | |||
be applied to generating, exporting, collecting, analyzing, or | be applied to generating, exporting, collecting, analyzing, or | |||
retaining individual user data or any data that can identify end | retaining individual user data or any data that can identify end | |||
users or characterize their behavior without consent. Based on this | users or characterize their behavior without consent. Based on this | |||
principle, the network telemetry framework is not applicable to | principle, the network telemetry framework is not applicable to | |||
networks whose endpoints represent individual users, such as general- | networks whose endpoints represent individual users, such as general- | |||
purpose access networks. | purpose access networks. | |||
1.2. Glossary | 1.2. Glossary | |||
Before further discussion, we list some key terminology and acronyms | Before further discussion, we list some key terminology and | |||
used in this document. We make an intended differentiation between | abbreviations used in this document. There is an intended | |||
the terms of network telemetry and OAM. However, it should be | differentiation between the terms of network telemetry and OAM. | |||
understood that there is not a hard-line distinction between the two | However, it should be understood that there is not a hard-line | |||
concepts. Rather, network telemetry is considered as an extension of | distinction between the two concepts. Rather, network telemetry is | |||
OAM. It covers all the existing OAM protocols but puts more emphasis | considered an extension of OAM. It covers all the existing OAM | |||
on the newer and emerging techniques and protocols concerning all | protocols but puts more emphasis on the newer and emerging techniques | |||
aspects of network data from acquisition to consumption. | and protocols concerning all aspects of network data from acquisition | |||
to consumption. | ||||
AI: Artificial Intelligence. In the network domain, AI refers to | AI: Artificial Intelligence. In the network domain, AI | |||
the machine-learning based technologies for automated network | refers to machine-learning-based technologies for | |||
operation and other tasks. | automated network operation and other tasks. | |||
AM: Alternate Marking, a flow performance measurement method, | AM: Alternate Marking. A flow performance measurement | |||
specified in [RFC8321]. | method, as specified in [RFC8321]. | |||
BMP: BGP Monitoring Protocol, specified in [RFC7854]. | BMP: BGP Monitoring Protocol. Specified in [RFC7854]. | |||
DPI: Deep Packet Inspection, referring to the techniques that | DPI: Deep Packet Inspection. Refers to the techniques that | |||
examines packet beyond packet L3/L4 headers. | examine packets beyond packet L3/L4 headers. | |||
gNMI: gRPC Network Management Interface, a network management | gNMI: gRPC Network Management Interface. A network management | |||
protocol from OpenConfig Operator Working Group, mainly | protocol from the OpenConfig Operator Working Group, | |||
contributed by Google. See [gnmi] for details. | mainly contributed by Google. See [gnmi] for details. | |||
GPB: Google Protocol Buffer, an extensible mechanism for serializing | GPB: Google Protocol Buffer. An extensible mechanism for | |||
structured data. See [gpb] for details. | serializing structured data. See [gpb] for details. | |||
gRPC: gRPC Remote Procedure Call, an open source high performance | gRPC: gRPC Remote Procedure Call. An open-source high- | |||
RPC framework that gNMI is based on. See [grpc] for details. | performance RPC framework that gNMI is based on. See | |||
[grpc] for details. | ||||
IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. | IPFIX: IP Flow Information Export Protocol. Specified in | |||
[RFC7011]. | ||||
IOAM: In-situ OAM [I-D.ietf-ippm-ioam-data], a dataplane on-path | IOAM: In situ OAM [RFC9197]. A data plane on-path telemetry | |||
telemetry technique. | technique. | |||
JSON: An open standard file format and data interchange format that | JSON: JavaScript Object Notation. An open standard file format | |||
uses human-readable text to store and transmit data objects, | and data interchange format that uses human-readable text | |||
specified in [RFC8259]. | to store and transmit data objects, as specified in | |||
[RFC8259]. | ||||
MIB: Management Information Base, a database used for managing the | MIB: Management Information Base. A database used for | |||
entities in a network. | managing the entities in a network. | |||
NETCONF: Network Configuration Protocol, specified in [RFC6241]. | NETCONF: Network Configuration Protocol. Specified in [RFC6241]. | |||
NetFlow: A Cisco protocol for flow record collecting, described in | NetFlow: A Cisco protocol used for flow record collecting, as | |||
[RFC3954]. | described in [RFC3954]. | |||
Network Telemetry: The process and instrumentation for acquiring and | Network Telemetry: The process and instrumentation for acquiring and | |||
utilizing network data remotely for network monitoring and | utilizing network data remotely for network monitoring | |||
operation. A general term for a large set of network visibility | and operation. A general term for a large set of network | |||
techniques and protocols, concerning aspects like data generation, | visibility techniques and protocols, concerning aspects | |||
collection, correlation, and consumption. Network telemetry | like data generation, collection, correlation, and | |||
addresses the current network operation issues and enables smooth | consumption. Network telemetry addresses current network | |||
evolution toward future intent-driven autonomous networks. | operation issues and enables smooth evolution toward | |||
future intent-driven autonomous networks. | ||||
NMS: Network Management System, referring to applications that allow | NMS: Network Management System. Refers to applications that | |||
network administrators to manage a network. | allow network administrators to manage a network. | |||
OAM: Operations, Administration, and Maintenance. A group of | OAM: Operations, Administration, and Maintenance. A group of | |||
network management functions that provide network fault | network management functions that provide network fault | |||
indication, fault localization, performance information, and data | indication, fault localization, performance information, | |||
and diagnosis functions. Most conventional network monitoring | and data and diagnosis functions. Most conventional | |||
techniques and protocols belong to network OAM. | network monitoring techniques and protocols belong to | |||
network OAM. | ||||
PBT: Postcard-Based Telemetry, a dataplane on-path telemetry | PBT: Postcard-Based Telemetry. A data plane on-path telemetry | |||
technique. A representative technique is described in | technique. A representative technique is described in | |||
[I-D.ietf-ippm-ioam-direct-export]. | [IPPM-IOAM-DIRECT-EXPORT]. | |||
RESTCONF: An HTTP-based protocol that provides a programmatic | RESTCONF: An HTTP-based protocol that provides a programmatic | |||
interface for accessing data defined in YANG, using the datastore | interface for accessing data defined in YANG, using the | |||
concepts defined in NETCONF, as specified in [RFC8040]. | datastore concepts defined in NETCONF, as specified in | |||
[RFC8040]. | ||||
SMIv2: Structure of Management Information Version 2, defining MIB | SMIv2: Structure of Management Information Version 2. Defines | |||
objects, specified in [RFC2578]. | MIB objects, as specified in [RFC2578]. | |||
SNMP: Simple Network Management Protocol. Version 1, 2, and 3 are | SNMP: Simple Network Management Protocol. Versions 1, 2, and 3 | |||
specified in [RFC1157], [RFC3416], and [RFC3411], respectively. | are specified in [RFC1157], [RFC3416], and [RFC3411], | |||
respectively. | ||||
XML: Extensible Markup Language is a markup language for data | XML: Extensible Markup Language. A markup language for data | |||
encoding that is both human-readable and machine-readable, | encoding that is both human readable and machine | |||
specified by W3C [xml]. | readable, as specified by W3C [W3C.REC-xml-20081126]. | |||
YANG: YANG is a data modeling language for the definition of data | YANG: YANG is a data modeling language for the definition of | |||
sent over network management protocols such as the NETCONF and | data sent over network management protocols such as | |||
RESTCONF. YANG is defined in [RFC6020] and [RFC7950]. | NETCONF and RESTCONF. YANG is defined in [RFC6020] and | |||
[RFC7950]. | ||||
YANG ECA: A YANG model for Event-Condition-Action policies, defined | YANG ECA: A YANG model for Event-Condition-Action policies, as | |||
in [I-D.wwx-netmod-event-yang]. | defined in [NETMOD-ECA-POLICY]. | |||
YANG-Push: A mechanism that allows subscriber applications to | YANG-Push: A mechanism that allows subscriber applications to | |||
request a stream of updates from a YANG datastore on a network | request a stream of updates from a YANG datastore on a | |||
device. Details are specified in [RFC8641] and [RFC8639]. | network device. Details are specified in [RFC8639] and | |||
[RFC8641]. | ||||
2. Background | 2. Background | |||
The term "big data" is used to describe the extremely large volume of | The term "big data" is used to describe the extremely large volume of | |||
data sets that can be analyzed computationally to reveal patterns, | data sets that can be analyzed computationally to reveal patterns, | |||
trends, and associations. Networks are undoubtedly a source of big | trends, and associations. Networks are undoubtedly a source of big | |||
data because of their scale and the volume of network traffic they | data because of their scale and the volume of network traffic they | |||
forward. When a network's endpoints do not represent individual | forward. When a network's endpoints do not represent individual | |||
users (e.g. in industrial, datacenter, and infrastructure contexts), | users (e.g., in industrial, data-center, and infrastructure | |||
network operations can often benefit from large-scale data collection | contexts), network operations can often benefit from large-scale data | |||
without breaching user privacy. | collection without breaching user privacy. | |||
Today one can access advanced big data analytics capability through a | Today, one can access advanced big data analytics capability through | |||
plethora of commercial and open source platforms (e.g., Apache | a plethora of commercial and open-source platforms (e.g., Apache | |||
Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine | Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine | |||
learning). Thanks to the advance of computing and storage | learning). Thanks to the advance of computing and storage | |||
technologies, network big data analytics gives network operators an | technologies, network big data analytics give network operators an | |||
opportunity to gain network insights and move towards network | opportunity to gain network insights and move towards network | |||
autonomy. Some operators start to explore the application of | autonomy. Some operators start to explore the application of | |||
Artificial Intelligence (AI) to make sense of network data. Software | Artificial Intelligence (AI) to make sense of network data. Software | |||
tools can use the network data to detect and react on network faults, | tools can use the network data to detect and react on network faults, | |||
anomalies, and policy violations, as well as predicting future | anomalies, and policy violations, as well as predict future events. | |||
events. In turn, the network policy updates for planning, intrusion | In turn, the network policy updates for planning, intrusion | |||
prevention, optimization, and self-healing may be applied. | prevention, optimization, and self-healing may be applied. | |||
It is conceivable that an autonomic network [RFC7575] is the logical | It is conceivable that an autonomic network [RFC7575] is the logical | |||
next step for network evolution following Software Defined Networking | next step for network evolution following Software-Defined Networking | |||
(SDN), aiming to reduce (or even eliminate) human labor, make more | (SDN), which aims to reduce (or even eliminate) human labor, make | |||
efficient use of network resources, and provide better services more | more efficient use of network resources, and provide better services | |||
aligned with customer requirements. The IETF ANIMA working group is | more aligned with customer requirements. The IETF ANIMA Working | |||
dedicated to developing and maintaining protocols and procedures for | Group is dedicated to developing and maintaining protocols and | |||
automated network management and control of professionally-managed | procedures for automated network management and control of | |||
networks. The related technique of Intent-based Networking (IBN) | professionally managed networks. The related technique of | |||
[I-D.irtf-nmrg-ibn-concepts-definitions] requires network visibility | Intent-Based Networking (IBN) [NMRG-IBN-CONCEPTS-DEFINITIONS] | |||
and telemetry data in order to ensure that the network is behaving as | requires network visibility and telemetry data in order to ensure | |||
intended. | that the network is behaving as intended. | |||
However, while the data processing capability is improved and | However, while the data processing capability is improved and | |||
applications require more data to function better, the networks lag | applications require more data to function better, the networks lag | |||
behind in extracting and translating network data into useful and | behind in extracting and translating network data into useful and | |||
actionable information in efficient ways. The system bottleneck is | actionable information in efficient ways. The system bottleneck is | |||
shifting from data consumption to data supply. Both the number of | shifting from data consumption to data supply. Both the number of | |||
network nodes and the traffic bandwidth keep increasing at a fast | network nodes and the traffic bandwidth keep increasing at a fast | |||
pace. The network configuration and policy change at smaller time | pace. The network configuration and policy change at smaller time | |||
slots than before. More subtle events and fine-grained data through | slots than before. More subtle events and fine-grained data through | |||
all network planes need to be captured and exported in real time. In | all network planes need to be captured and exported in real time. In | |||
a nutshell, it is a challenge to get enough high-quality data out of | a nutshell, it is a challenge to get enough high-quality data out of | |||
the network in a manner that is efficient, timely, and flexible. | the network in a manner that is efficient, timely, and flexible. | |||
Therefore, we need to survey the existing technologies and protocols | Therefore, we need to survey the existing technologies and protocols | |||
and identify any potential gaps. | and identify any potential gaps. | |||
In the remainder of this section, first we clarify the scope of | In the remainder of this section, we first clarify the scope of | |||
network data (i.e., telemetry data) relevant in this document. Then, | network data (i.e., telemetry data) relevant in this document. Then, | |||
we discuss several key use cases for today's and future network | we discuss several key use cases for network operations of today and | |||
operations. Next, we show why the current network OAM techniques and | the future. Next, we show why the current network OAM techniques and | |||
protocols are insufficient for these use cases. The discussion | protocols are insufficient for these use cases. The discussion | |||
underlines the need for new methods, techniques, and protocols, as | underlines the need for new methods, techniques, and protocols, as | |||
well as the extensions of existing ones, which we assign under the | well as the extensions of existing ones, which we assign under the | |||
umbrella term - Network Telemetry. | umbrella term "Network Telemetry". | |||
2.1. Telemetry Data Coverage | 2.1. Telemetry Data Coverage | |||
Any information that can be extracted from networks (including data | Any information that can be extracted from networks (including the | |||
plane, control plane, and management plane) and used to gain | data plane, control plane, and management plane) and used to gain | |||
visibility or as basis for actions is considered telemetry data. It | visibility or as a basis for actions is considered telemetry data. | |||
includes statistics, event records and logs, snapshots of state, | It includes statistics, event records and logs, snapshots of state, | |||
configuration data, etc. It also covers the outputs of any active | configuration data, etc. It also covers the outputs of any active | |||
and passive measurements [RFC7799]. In some cases, raw data is | and passive measurements [RFC7799]. In some cases, raw data is | |||
processed in network before being sent to a data consumer. Such | processed in network before being sent to a data consumer. Such | |||
processed data is also considered telemetry data. The value of | processed data is also considered telemetry data. The value of | |||
telemetry data varies. In some cases, if the cost is acceptable, | telemetry data varies. In some cases, if the cost is acceptable, | |||
less but higher quality data are preferred than lots of low quality | less but higher-quality data are preferred rather than a lot of low- | |||
data. A classification of telemetry data is provided in Section 3. | quality data. A classification of telemetry data is provided in | |||
To preserve the privacy of end-users, no user packet content should | Section 3. To preserve the privacy of end users, no user packet | |||
be collected. Specifically, the data objects generated, exported, | content should be collected. Specifically, the data objects | |||
and collected by a network telemetry application should not include | generated, exported, and collected by a network telemetry application | |||
any packet payload from traffic associated with end-users systems. | should not include any packet payload from traffic associated with | |||
end-user systems. | ||||
2.2. Use Cases | 2.2. Use Cases | |||
The following set of use cases is essential for network operations. | The following set of use cases is essential for network operations. | |||
While the list is by no means exhaustive, it is enough to highlight | While the list is by no means exhaustive, it is enough to highlight | |||
the requirements for data velocity, variety, volume, and veracity, | the requirements for data velocity, variety, volume, and veracity, | |||
the attributes of big data, in networks. | the attributes of big data, in networks. | |||
* Security: Network intrusion detection and prevention systems need | * Security: Network intrusion detection and prevention systems need | |||
to monitor network traffic and activities and act upon anomalies. | to monitor network traffic and activities and act upon anomalies. | |||
Given increasingly sophisticated attack vectors coupled with | Given increasingly sophisticated attack vectors coupled with | |||
increasingly severe consequences of security breaches, new tools | increasingly severe consequences of security breaches, new tools | |||
and techniques need to be developed, relying on wider and deeper | and techniques need to be developed, relying on wider and deeper | |||
visibility into networks. The ultimate goal is to achieve | visibility into networks. The ultimate goal is to achieve | |||
security with no, or only minimal, human intervention, and without | security with no, or only minimal, human intervention and without | |||
disrupting legitimate traffic flows. | disrupting legitimate traffic flows. | |||
* Policy and Intent Compliance: Network policies are the rules that | * Policy and Intent Compliance: Network policies are the rules that | |||
constrain the services for network access, provide service | constrain the services for network access, provide service | |||
differentiation, or enforce specific treatment on the traffic. | differentiation, or enforce specific treatment on the traffic. | |||
For example, a service function chain is a policy that requires | For example, a service function chain is a policy that requires | |||
the selected flows to pass through a set of ordered network | the selected flows to pass through a set of ordered network | |||
functions. Intent, as defined in | functions. Intent, as defined in [NMRG-IBN-CONCEPTS-DEFINITIONS], | |||
[I-D.irtf-nmrg-ibn-concepts-definitions], is a set of operational | is a set of operational goals that a network should meet and | |||
goals that a network should meet and outcomes that a network is | outcomes that a network is supposed to deliver, defined in a | |||
supposed to deliver, defined in a declarative manner without | declarative manner without specifying how to achieve or implement | |||
specifying how to achieve or implement them. An intent requires a | them. An intent requires a complex translation and mapping | |||
complex translation and mapping process before being applied on | process before being applied on networks. While a policy or | |||
networks. While a policy or intent is enforced, the compliance | intent is enforced, the compliance needs to be verified and | |||
needs to be verified and monitored continuously by relying on | monitored continuously by relying on visibility that is provided | |||
visibility that is provided through network telemetry data. Any | through network telemetry data. Any violation must be reported | |||
violation must be reported immediately, potentially resulting in | immediately - this will alert the network administrator to the | |||
updates to how the policy or intent is applied in the network to | policy or intent violation and will potentially result in updates | |||
ensure that it remains in force, or otherwise alerting the network | to how the policy or intent is applied in the network to ensure | |||
administrator to the policy or intent violation. | that it remains in force. | |||
* SLA Compliance: A Service-Level Agreement (SLA) is a service | * SLA Compliance: A Service Level Agreement (SLA) is a service | |||
contract between a service provider and a client, which include | contract between a service provider and a client, which includes | |||
the metrics for the service measurement and remedy/penalty | the metrics for the service measurement and remedy/penalty | |||
procedures when the service level misses the agreement. Users | procedures when the service level misses the agreement. Users | |||
need to check if they get the service as promised and network | need to check if they get the service as promised, and network | |||
operators need to evaluate how they can deliver services that can | operators need to evaluate how they can deliver services that meet | |||
meet the SLA based on realtime network telemetry data, including | the SLA based on real-time network telemetry data, including data | |||
data from network measurements. | from network measurements. | |||
* Root Cause Analysis: Many network failure can be the effect of a | * Root Cause Analysis: Many network failures can be the effect of a | |||
sequence of chained events. Troubleshooting and recovery require | sequence of chained events. Troubleshooting and recovery require | |||
quick identification of the root cause of any observable issues. | quick identification of the root cause of any observable issues. | |||
However, the root cause is not always straightforward to identify, | However, the root cause is not always straightforward to identify, | |||
especially when the failure is sporadic and the number of event | especially when the failure is sporadic and the number of event | |||
messages, both related and unrelated to the same cause, is | messages, both related and unrelated to the same cause, is | |||
overwhelming. While technologies such as machine learning can be | overwhelming. While technologies such as machine learning can be | |||
used for root cause analysis, it is up to the network to sense and | used for root cause analysis, it is up to the network to sense and | |||
provide the relevant diagnostic data which are either actively fed | provide the relevant diagnostic data that are either actively fed | |||
into, or passively retrieved by, the root cause analysis | into or passively retrieved by the root cause analysis | |||
applications. | applications. | |||
* Network Optimization: This covers all short-term and long-term | * Network Optimization: This covers all short-term and long-term | |||
network optimization techniques, including load balancing, Traffic | network optimization techniques, including load balancing, Traffic | |||
Engineering (TE), and network planning. Network operators are | Engineering (TE), and network planning. Network operators are | |||
motivated to optimize their network utilization and differentiate | motivated to optimize their network utilization and differentiate | |||
services for better Return On Investment (ROI) or lower Capital | services for better Return on Investment (ROI) or lower Capital | |||
Expenditures (CAPEX). The first step is to know the real-time | Expenditure (CAPEX). The first step is to know the real-time | |||
network conditions before applying policies for traffic | network conditions before applying policies for traffic | |||
manipulation. In some cases, micro-bursts need to be detected in | manipulation. In some cases, microbursts need to be detected in a | |||
a very short time-frame so that fine-grained traffic control can | very short time frame so that fine-grained traffic control can be | |||
be applied to avoid network congestion. Long-term planning of | applied to avoid network congestion. Long-term planning of | |||
network capacity and topology requires analysis of real-world | network capacity and topology requires analysis of real-world | |||
network telemetry data that is obtained over long periods of time. | network telemetry data that is obtained over long periods of time. | |||
* Event Tracking and Prediction: The visibility into traffic path | * Event Tracking and Prediction: The visibility into traffic path | |||
and performance is critical for services and applications that | and performance is critical for services and applications that | |||
rely on healthy network operation. Numerous related network | rely on healthy network operation. Numerous related network | |||
events are of interest to network operators. For example, Network | events are of interest to network operators. For example, network | |||
operators want to learn where and why packets are dropped for an | operators want to learn where and why packets are dropped for an | |||
application flow. They also want to be warned of issues in | application flow. They also want to be warned of issues in | |||
advance, so proactive actions can be taken to avoid catastrophic | advance, so proactive actions can be taken to avoid catastrophic | |||
consequences. | consequences. | |||
2.3. Challenges | 2.3. Challenges | |||
For a long time, network operators have relied upon SNMP [RFC3416], | For a long time, network operators have relied upon SNMP [RFC3416], | |||
Command-Line Interface (CLI), or Syslog [RFC5424] to monitor the | Command-Line Interface (CLI), or Syslog [RFC5424] to monitor the | |||
network. Some other OAM techniques as described in [RFC7276] are | network. Some other OAM techniques as described in [RFC7276] are | |||
also used to facilitate network troubleshooting. These conventional | also used to facilitate network troubleshooting. These conventional | |||
techniques are not sufficient to support the above use cases for the | techniques are not sufficient to support the above use cases for the | |||
following reasons: | following reasons: | |||
* Most use cases need to continuously monitor the network and | * Most use cases need to continuously monitor the network and | |||
dynamically refine the data collection in real-time. Poll-based | dynamically refine the data collection in real time. Poll-based | |||
low-frequency data collection is ill-suited for these | low-frequency data collection is ill-suited for these | |||
applications. Subscription-based streaming data directly pushed | applications. Subscription-based streaming data directly pushed | |||
from the data source (e.g., the forwarding chip) is preferred to | from the data source (e.g., the forwarding chip) is preferred to | |||
provide sufficient data quantity and precision at scale. | provide sufficient data quantity and precision at scale. | |||
* Comprehensive data is needed, ranging from packet processing | * Comprehensive data is needed, ranging from packet processing | |||
engines to traffic manager, from line cards to main control board, | engines to traffic managers, line cards to main control boards, | |||
from user flows to control protocol packets, from device | user flows to control protocol packets, device configurations to | |||
configurations to operations, and from physical layer to | operations, and physical layers to application layers. | |||
application layer. Conventional OAM only covers a narrow range of | Conventional OAM only covers a narrow range of data (e.g., SNMP | |||
data (e.g., SNMP only handles data from the Management Information | only handles data from the Management Information Base (MIB)). | |||
Base (MIB)). Classical network devices cannot provide all the | Classical network devices cannot provide all the necessary probes. | |||
necessary probes. More open and programmable network devices are | More open and programmable network devices are therefore needed. | |||
therefore needed. | ||||
* Many application scenarios need to correlate network-wide data | * Many application scenarios need to correlate network-wide data | |||
from multiple sources (i.e., from distributed network devices, | from multiple sources (i.e., from distributed network devices, | |||
different components of a network device, or different network | different components of a network device, or different network | |||
planes). A piecemeal solution is often lacking the capability to | planes). A piecemeal solution is often lacking the capability to | |||
consolidate the data from multiple sources. The composition of a | consolidate the data from multiple sources. The composition of a | |||
complete solution, as partly proposed by Autonomic Resource | complete solution, as partly proposed by Autonomic Resource | |||
Control Architecture(ARCA) | Control Architecture (ARCA) [NMRG-ANTICIPATED-ADAPTATION], will be | |||
[I-D.pedro-nmrg-anticipated-adaptation], will be empowered and | empowered and guided by a comprehensive framework. | |||
guided by a comprehensive framework. | ||||
* Some conventional OAM techniques (e.g., CLI and Syslog) lack a | * Some conventional OAM techniques (e.g., CLI and Syslog) lack a | |||
formal data model. The unstructured data hinder the tool | formal data model. The unstructured data hinder the tool | |||
automation and application extensibility. Standardized data | automation and application extensibility. Standardized data | |||
models are essential to support the programmable networks. | models are essential to support the programmable networks. | |||
* Although some conventional OAM techniques support data push (e.g., | * Although some conventional OAM techniques support data push (e.g., | |||
SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow [RFC3176]), the | SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow [RFC3176]), the | |||
pushed data are limited to only predefined management plane | pushed data are limited to only predefined management plane | |||
warnings (e.g., SNMP Trap) or sampled user packets (e.g., sFlow). | warnings (e.g., SNMP Trap) or sampled user packets (e.g., sFlow). | |||
Network operators require the data with arbitrary source, | Network operators require the data with arbitrary source, | |||
granularity, and precision which are beyond the capability of the | granularity, and precision, which is beyond the capability of the | |||
existing techniques. | existing techniques. | |||
* The conventional passive measurement techniques can either consume | * Conventional passive measurement techniques can either consume | |||
excessive network resources and produce excessive redundant data, | excessive network resources and produce excessive redundant data | |||
or lead to inaccurate results; on the other hand, the conventional | or lead to inaccurate results; on the other hand, conventional | |||
active measurement techniques can interfere with the user traffic | active measurement techniques can interfere with the user traffic, | |||
and their results are indirect. Techniques that can collect | and their results are indirect. Techniques that can collect | |||
direct and on-demand data from user traffic are more favorable. | direct and on-demand data from user traffic are more favorable. | |||
These challenges were addressed by newer standards and techniques | These challenges were addressed by newer standards and techniques | |||
(e.g., IPFIX/Netflow, Packet Sampling (PSAMP), IOAM, and YANG-Push) | (e.g., IPFIX/Netflow, Packet Sampling (PSAMP), IOAM, and YANG-Push), | |||
and more are emerging. These standards and techniques need to be | and more are emerging. These standards and techniques need to be | |||
recognized and accommodated in a new framework. | recognized and accommodated in a new framework. | |||
2.4. Network Telemetry | 2.4. Network Telemetry | |||
Network telemetry has emerged as a mainstream technical term to refer | Network telemetry has emerged as a mainstream technical term to refer | |||
to the network data collection and consumption techniques. Several | to the network data collection and consumption techniques. Several | |||
network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and | network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and | |||
gRPC [grpc]) have been widely deployed. Network telemetry allows | gRPC [grpc]) have been widely deployed. Network telemetry allows | |||
separate entities to acquire data from network devices so that data | separate entities to acquire data from network devices so that data | |||
can be visualized and analyzed to support network monitoring and | can be visualized and analyzed to support network monitoring and | |||
operation. Network telemetry covers the conventional network OAM and | operation. Network telemetry covers the conventional network OAM and | |||
has a wider scope. For instance, it is expected that network | has a wider scope. For instance, it is expected that network | |||
telemetry can provide the necessary network insight for autonomous | telemetry can provide the necessary network insight for autonomous | |||
networks and address the shortcomings of conventional OAM techniques. | networks and address the shortcomings of conventional OAM techniques. | |||
Network telemetry usually assumes machines as data consumers rather | Network telemetry usually assumes machines as data consumers rather | |||
than human operators. Hence, the network telemetry can directly | than human operators. Hence, network telemetry can directly trigger | |||
trigger the automated network operation, while in contrast some | the automated network operation, while in contrast, some conventional | |||
conventional OAM tools were designed and used to help human operators | OAM tools were designed and used to help human operators to monitor | |||
to monitor and diagnose the networks and guide manual network | and diagnose the networks and guide manual network operations. Such | |||
operations. Such a proposition leads to very different techniques. | a proposition leads to very different techniques. | |||
Although new network telemetry techniques are emerging and subject to | Although new network telemetry techniques are emerging and subject to | |||
continuous evolution, several characteristics of network telemetry | continuous evolution, several characteristics of network telemetry | |||
have been well accepted. Note that network telemetry is intended to | have been well accepted. Note that network telemetry is intended to | |||
be an umbrella term covering a wide spectrum of techniques, so the | be an umbrella term covering a wide spectrum of techniques, so the | |||
following characteristics are not expected to be held by every | following characteristics are not expected to be held by every | |||
specific technique. | specific technique. | |||
* Push and Streaming: Instead of polling data from network devices, | * Push and Streaming: Instead of polling data from network devices, | |||
telemetry collectors subscribe to streaming data pushed from data | telemetry collectors subscribe to streaming data pushed from data | |||
sources in network devices. | sources in network devices. | |||
* Volume and Velocity: The telemetry data is intended to be consumed | * Volume and Velocity: Telemetry data is intended to be consumed by | |||
by machines rather than by human being. Therefore, the data | machines rather than by human beings. Therefore, the data volume | |||
volume can be huge and the processing is optimized for the needs | can be huge, and the processing is optimized for the needs of | |||
of automation in realtime. | automation in real time. | |||
* Normalization and Unification: Telemetry aims to address the | * Normalization and Unification: Telemetry aims to address the | |||
overall network automation needs. Efforts are made to normalize | overall network automation needs. Efforts are made to normalize | |||
the data representation and unify the protocols, so as to simplify | the data representation and unify the protocols, so as to simplify | |||
data analysis and provide integrated analysis across heterogeneous | data analysis and provide integrated analysis across heterogeneous | |||
devices and data sources across a network. | devices and data sources across a network. | |||
* Model-based: The telemetry data is modeled in advance which allows | * Model-Based: Telemetry data is modeled in advance, which allows | |||
applications to configure and consume data with ease. | applications to configure and consume data with ease. | |||
* Data Fusion: The data for a single application can come from | * Data Fusion: The data for a single application can come from | |||
multiple data sources (e.g., cross-domain, cross-device, and | multiple data sources (e.g., cross-domain, cross-device, and | |||
cross-layer) based on common naming/ID and needs to be correlated | cross-layer) that are based on a common name/ID and need to be | |||
to take effect. | correlated to take effect. | |||
* Dynamic and Interactive: Since the network telemetry means to be | * Dynamic and Interactive: Since the network telemetry means to be | |||
used in a closed control loop for network automation, it needs to | used in a closed control loop for network automation, it needs to | |||
run continuously and adapt to the dynamic and interactive queries | run continuously and adapt to the dynamic and interactive queries | |||
from the network operation controller. | from the network operation controller. | |||
In addition, an ideal network telemetry solution may also have the | In addition, an ideal network telemetry solution may also have the | |||
following features or properties: | following features or properties: | |||
* In-Network Customization: The data that is generated can be | * In-Network Customization: The data that is generated can be | |||
customized in network at run-time to cater to the specific need of | customized in network at runtime to cater to the specific need of | |||
applications. This needs the support of a programmable data plane | applications. This needs the support of a programmable data | |||
which allows probes with custom functions to be deployed at | plane, which allows probes with custom functions to be deployed at | |||
flexible locations. | flexible locations. | |||
* In-Network Data Aggregation and Correlation: Network devices and | * In-Network Data Aggregation and Correlation: Network devices and | |||
aggregation points can work out which events and what data needs | aggregation points can work out which events and what data needs | |||
to be stored, reported, or discarded thus reducing the load on the | to be stored, reported, or discarded, thus reducing the load on | |||
central collection and processing points while still ensuring that | the central collection and processing points while still ensuring | |||
the right information is ready to be processed in a timely way. | that the right information is ready to be processed in a timely | |||
way. | ||||
* In-Network Processing: Sometimes it is not necessary or feasible | * In-Network Processing: Sometimes it is not necessary or feasible | |||
to gather all information to a central point to be processed and | to gather all information to a central point to be processed and | |||
acted upon. It is possible for the data processing to be done in | acted upon. It is possible for the data processing to be done in | |||
network, allowing reactive actions to be taken locally. | network, allowing reactive actions to be taken locally. | |||
* Direct Data Plane Export: The data originated from the data plane | * Direct Data Plane Export: The data originated from data plane | |||
forwarding chips can be directly exported to the data consumer for | forwarding chips can be directly exported to the data consumer for | |||
efficiency, especially when the data bandwidth is large and the | efficiency, especially when the data bandwidth is large and real- | |||
real-time processing is required. | time processing is required. | |||
* In-band Data Collection: In addition to the passive and active | * In-Band Data Collection: In addition to the passive and active | |||
data collection approaches, the new hybrid approach allows to | data collection approaches, the new hybrid approach allows to | |||
directly collect data for any target flow on its entire forwarding | directly collect data for any target flow on its entire forwarding | |||
path [I-D.song-opsawg-ifit-framework]. | path [OPSAWG-IFIT-FRAMEWORK]. | |||
It is worth noting that a network telemetry system should not be | It is worth noting that a network telemetry system should not be | |||
intrusive to normal network operations by avoiding the pitfall of the | intrusive to normal network operations by avoiding the pitfall of the | |||
"observer effect". That is, it should not change the network | "observer effect". That is, it should not change the network | |||
behavior and affect the forwarding performance. Moreover, high- | behavior and affect the forwarding performance. Moreover, high- | |||
volume telemetry traffic may cause network congestion unless proper | volume telemetry traffic may cause network congestion unless proper | |||
isolation or traffic engineering techniques are in place, or | isolation or traffic engineering techniques are in place, or | |||
congestion control mechanisms ensure that telemetry traffic backs off | congestion control mechanisms ensure that telemetry traffic backs off | |||
if it exceeds the network capacity. [RFC8084] and [RFC8085] are | if it exceeds the network capacity. [RFC8084] and [RFC8085] are | |||
relevant Best Current Practices (BCP) in this space. | relevant Best Current Practices (BCPs) in this space. | |||
Although in many cases a system for network telemetry involves a | Although in many cases a system for network telemetry involves a | |||
remote data collecting and consuming entity, it is important to | remote data collecting and consuming entity, it is important to | |||
understand that there are no inherent assumptions about how a system | understand that there are no inherent assumptions about how a system | |||
should be architected. While a network architecture with centralized | should be architected. While a network architecture with a | |||
controller (e.g., SDN) seems a natural fit for network telemetry, | centralized controller (e.g., SDN) seems to be a natural fit for | |||
network telemetry can work in distributed fashions as well. For | network telemetry, network telemetry can work in distributed fashions | |||
example, telemetry data producers and consumers can have a peer-to- | as well. For example, telemetry data producers and consumers can | |||
peer relationship, in which a network node can be the direct consumer | have a peer-to-peer relationship, in which a network node can be the | |||
of telemetry data from other nodes. | direct consumer of telemetry data from other nodes. | |||
2.5. The Necessity of a Network Telemetry Framework | 2.5. The Necessity of a Network Telemetry Framework | |||
Network data analytics (e.g., machine learning) is applied for | Network data analytics (e.g., machine learning) is applied for | |||
network operation automation, relying on abundant and coherent data | network operation automation, relying on abundant and coherent data | |||
from networks. Data acquisition that is limited to a single source | from networks. Data acquisition that is limited to a single source | |||
and static in nature will in many cases not be sufficient to meet an | and static in nature will in many cases not be sufficient to meet an | |||
application's telemetry data needs. As a result, multiple data | application's telemetry data needs. As a result, multiple data | |||
sources, involving a variety of techniques and standards, will need | sources, involving a variety of techniques and standards, will need | |||
to be integrated. It is desirable to have a framework that | to be integrated. It is desirable to have a framework that | |||
classifies and organizes different telemetry data source and types, | classifies and organizes different telemetry data sources and types, | |||
defines different components of a network telemetry system and their | defines different components of a network telemetry system and their | |||
interactions, and helps coordinate and integrate multiple telemetry | interactions, and helps coordinate and integrate multiple telemetry | |||
approaches across layers. This allows flexible combinations of data | approaches across layers. This allows flexible combinations of data | |||
for different applications, while normalizing and simplifying | for different applications, while normalizing and simplifying | |||
interfaces. In detail, such a framework would benefit the | interfaces. In detail, such a framework would benefit the | |||
development of network operation applications for the following | development of network operation applications for the following | |||
reasons: | reasons: | |||
* Future networks, autonomous or otherwise, depend on holistic and | * Future networks, autonomous or otherwise, depend on holistic and | |||
comprehensive network visibility. The use cases and applications | comprehensive network visibility. Use cases and applications are | |||
are better to be supported uniformly and coherently using an | better when supported uniformly and coherently using an | |||
integrated, converged mechanism and common telemetry data | integrated, converged mechanism and common telemetry data | |||
representations wherever feasible. Therefore, the protocols and | representations wherever feasible. Therefore, the protocols and | |||
mechanisms should be consolidated into a minimum yet comprehensive | mechanisms should be consolidated into a minimum yet comprehensive | |||
set. A telemetry framework can help to normalize the technique | set. A telemetry framework can help to normalize the technique | |||
developments. | developments. | |||
* Network visibility presents multiple viewpoints. For example, the | * Network visibility presents multiple viewpoints. For example, the | |||
device viewpoint takes the network infrastructure as the | device viewpoint takes the network infrastructure as the | |||
monitoring object from which the network topology and device | monitoring object from which the network topology and device | |||
status can be acquired; the traffic viewpoint takes the flows or | status can be acquired, and the traffic viewpoint takes the flows | |||
packets as the monitoring object from which the traffic quality | or packets as the monitoring object from which the traffic quality | |||
and path can be acquired. An application may need to switch its | and path can be acquired. An application may need to switch its | |||
viewpoint during operation. It may also need to correlate a | viewpoint during operation. It may also need to correlate a | |||
service and its impact on user experience to acquire the | service and its impact on user experience (UE) to acquire the | |||
comprehensive information. | comprehensive information. | |||
* Applications require network telemetry to be elastic in order to | * Applications require network telemetry to be elastic in order to | |||
make efficient use of network resources and reduce the impact of | make efficient use of network resources and reduce the impact of | |||
processing related to network telemetry on network performance. | processing related to network telemetry on network performance. | |||
For example, routine network monitoring should cover the entire | For example, routine network monitoring should cover the entire | |||
network with a low data sampling rate. Only when issues arise or | network with a low data sampling rate. Only when issues arise or | |||
critical trends emerge should telemetry data sources be modified | critical trends emerge should telemetry data sources be modified | |||
and telemetry data rates boosted as needed. | and telemetry data rates be boosted as needed. | |||
* Efficient data aggregation is critical for applications to reduce | * Efficient data aggregation is critical for applications to reduce | |||
the overall quantity of data and improve the accuracy of analysis. | the overall quantity of data and improve the accuracy of analysis. | |||
A telemetry framework collects together all the telemetry-related | A telemetry framework collects all the telemetry-related works from | |||
works from different sources and working groups within IETF. This | different sources and working groups within the IETF. This makes it | |||
makes it possible to assemble a comprehensive network telemetry | possible to assemble a comprehensive network telemetry system and to | |||
system and to avoid repetitious or redundant work. The framework | avoid repetitious or redundant work. The framework should cover the | |||
should cover the concepts and components from the standardization | concepts and components from the standardization perspective. This | |||
perspective. This document describes the modules which make up a | document describes the modules that make up a network telemetry | |||
network telemetry framework and decomposes the telemetry system into | framework and decomposes the telemetry system into a set of distinct | |||
a set of distinct components that existing and future work can easily | components that existing and future work can easily map to. | |||
map to. | ||||
3. Network Telemetry Framework | 3. Network Telemetry Framework | |||
The top level network telemetry framework partitions the network | The top-level network telemetry framework partitions the network | |||
telemetry into four modules based on the telemetry data object source | telemetry into four modules based on the telemetry data object source | |||
and represents their relationship. Once the network operation | and represents their relationship. Once the network operation | |||
applications acquire the data from these modules, they can apply data | applications acquire the data from these modules, they can apply data | |||
analytics and take actions. At the next level, the framework | analytics and take actions. At the next level, the framework | |||
decomposes each module into separate components. Each of the modules | decomposes each module into separate components. Each of these | |||
follows the same underlying structure, with one component dedicated | modules follows the same underlying structure, with one component | |||
to the configuration of data subscriptions and data sources, a second | dedicated to the configuration of data subscriptions and data | |||
component dedicated to encoding and exporting data, and a third | sources, a second component dedicated to encoding and exporting data, | |||
component instrumenting the generation of telemetry related to the | and a third component instrumenting the generation of telemetry | |||
underlying resources. Throughout the framework, the same set of | related to the underlying resources. Throughout the framework, the | |||
abstract data acquiring mechanisms and data types (Section 3.3) are | same set of abstract data-acquiring mechanisms and data types | |||
applied. The two-level architecture with the uniform data | (Section 3.3) are applied. The two-level architecture with the | |||
abstraction helps accurately pinpoint a protocol or technique to its | uniform data abstraction helps accurately pinpoint a protocol or | |||
position in a network telemetry system or disaggregate a network | technique to its position in a network telemetry system or | |||
telemetry system into manageable parts. | disaggregates a network telemetry system into manageable parts. | |||
3.1. Top Level Modules | 3.1. Top-Level Modules | |||
Telemetry can be applied on the forwarding plane, the control plane, | Telemetry can be applied on the forwarding plane, control plane, and | |||
and the management plane in a network, as well as other sources out | management plane in a network, as well as on other sources out of the | |||
of the network, as shown in Figure 1. Therefore, we categorize the | network, as shown in Figure 1. Therefore, we categorize the network | |||
network telemetry into four distinct modules (management plane, | telemetry into four distinct modules (management plane, control | |||
control plane, forwarding plane, and external data and event | plane, forwarding plane, and external data and event telemetry) with | |||
telemetry) with each having its own interface to Network Operation | each having its own interface to network operation applications. | |||
Applications. | ||||
+------------------------------+ | +------------------------------+ | |||
| | | | | | |||
| Network Operation |<-------+ | | Network Operation |<-------+ | |||
| Applications | | | | Applications | | | |||
| | | | | | | | |||
+------------------------------+ | | +------------------------------+ | | |||
^ ^ ^ | | ^ ^ ^ | | |||
| | | | | | | | | | |||
V V | V | V V | V | |||
skipping to change at page 15, line 39 ¶ | skipping to change at line 709 ¶ | |||
| Management | ^ V | | Telemetry | | | Management | ^ V | | Telemetry | | |||
| Plane +-------|-------+ | | | | Plane +-------|-------+ | | | |||
| Telemetry | V | +-----------+ | | Telemetry | V | +-----------+ | |||
| | Forwarding | | | | Forwarding | | |||
| | Plane | | | | Plane | | |||
| <---> | | | <---> | | |||
| | Telemetry | | | | Telemetry | | |||
| | | | | | | | |||
+--------------+---------------+ | +--------------+---------------+ | |||
Figure 1: Modules in Layer Category of NTF | Figure 1: Modules in Layer Category of the Network Telemetry | |||
Framework | ||||
The rationale of this partition lies in the different telemetry data | The rationale of this partition lies in the different telemetry data | |||
objects which result in different data source and export locations. | objects that result in different data sources and export locations. | |||
Such differences have profound implications on in-network data | Such differences have profound implications on in-network data | |||
programming and processing capability, data encoding and transport | programming and processing capability, data encoding and the | |||
protocol, and required data bandwidth and latency. Data can be sent | transport protocol, and required data bandwidth and latency. Data | |||
directly, or proxied via the control and management planes. There | can be sent directly or proxied via the control and management | |||
are advantages/disadvantages to both approaches. | planes. There are advantages/disadvantages to both approaches. | |||
Note that in some cases the network controller itself may be the | Note that in some cases, the network controller itself may be the | |||
source of telemetry data that is unique to it or derived from the | source of telemetry data that is unique to it or derived from the | |||
telemetry data collected from the network elements. Some of the | telemetry data collected from the network elements. Some of the | |||
principles and taxonomy specific to the control plane and management | principles and taxonomy specific to the control plane and management | |||
plane telemetry could also be applied to the controller when it is | plane telemetry could also be applied to the controller when it is | |||
required to provide the telemetry data to Network Operation | required to provide the telemetry data to network operation | |||
Applications hosted outside. The scope of the document is focused on | applications hosted outside. The scope of this document is focused | |||
the network elements telemetry and further details related to | on the network elements telemetry, and further details related to | |||
controllers are thus out of scope. | controllers are thus out of scope. | |||
We summarize the major differences of the four modules in the | We summarize the major differences of the four modules in Table 1. | |||
following table. They are compared from six angles: | They are compared from six angles: | |||
* Data Object | * Data Object | |||
* Data Export Location | * Data Export Location | |||
* Data Model | * Data Model | |||
* Data Encoding | * Data Encoding | |||
* Telemetry Application Protocol | * Telemetry Application Protocol | |||
skipping to change at page 16, line 34 ¶ | skipping to change at line 754 ¶ | |||
Data Object is the target and source of each module. Because the | Data Object is the target and source of each module. Because the | |||
data source varies, the location where data is mostly conveniently | data source varies, the location where data is mostly conveniently | |||
exported also varies. For example, forwarding plane data mainly | exported also varies. For example, forwarding plane data mainly | |||
originates as data exported from the forwarding Application-Specific | originates as data exported from the forwarding Application-Specific | |||
Integrated Circuits (ASICs), while control plane data mainly | Integrated Circuits (ASICs), while control plane data mainly | |||
originates from the protocol daemons running on the control CPU(s). | originates from the protocol daemons running on the control CPU(s). | |||
For convenience and efficiency, it is preferred to export the data | For convenience and efficiency, it is preferred to export the data | |||
off the device from locations near the source. Because the locations | off the device from locations near the source. Because the locations | |||
that can export data have different capabilities, different choices | that can export data have different capabilities, different choices | |||
of data model, encoding, and transport method are made to balance the | of data models, encoding, and transport methods are made to balance | |||
performance and cost. For example, the forwarding chip has high | the performance and cost. For example, the forwarding chip has high | |||
throughput but limited capacity for processing complex data and | throughput but limited capacity for processing complex data and | |||
maintaining state, while the main control CPU is capable of complex | maintaining state, while the main control CPU is capable of complex | |||
data and state processing, but has limited bandwidth for high | data and state processing but has limited bandwidth for high | |||
throughput data. As a result, the suitable telemetry protocol for | throughput data. As a result, the suitable telemetry protocol for | |||
each module can be different. Some representative techniques are | each module can be different. Some representative techniques are | |||
shown in the corresponding table blocks to highlight the technical | shown in the corresponding table blocks to highlight the technical | |||
diversity of these modules. Note that the selected techniques just | diversity of these modules. Note that the selected techniques just | |||
reflect the de facto state of the art and are by no means exhaustive | reflect the de facto state of the art and are by no means exhaustive | |||
(e.g., IPFIX can also be implemented over TCP and SCTP, but that is | (e.g., IPFIX can also be implemented over TCP and SCTP, but that is | |||
not recommended for forwarding plane). The key point is that one | not recommended for the forwarding plane). The key point is that one | |||
cannot expect to use a universal protocol to cover all the network | cannot expect to use a universal protocol to cover all the network | |||
telemetry requirements. | telemetry requirements. | |||
+-----------+-------------+-------------+--------------+----------+ | +=============+===============+==========+==========+===============+ | |||
| Module |Management |Control |Forwarding |External | | |Module |Management |Control |Forwarding|External Data | | |||
| |Plane |Plane |Plane |Data | | | |Plane |Plane |Plane | | | |||
+-----------+-------------+-------------+--------------+----------+ | +=============+===============+==========+==========+===============+ | |||
|Object |config. & |control |flow & packet |terminal, | | |Object |configuration |control |flow and |terminal, | | |||
| |operation |protocol & |QoS, traffic |social & | | | |and operation |protocol |packet |social, and | | |||
| |state |signaling, |stat., buffer |environ- | | | |state |and |QoS, |environmental | | |||
| | |RIB |& queue stat.,|mental | | | | |signaling,|traffic | | | |||
| | | |ACL, FIB | | | | | |RIB |stat., | | | |||
+-----------+-------------+-------------+--------------+----------+ | | | | |buffer and| | | |||
|Export |main control |main control |fwding chip |various | | | | | |queue | | | |||
|Location |CPU |CPU, |or linecard | | | | | | |stat., | | | |||
| | |linecard CPU |CPU; main | | | | | | |FIB, | | | |||
| | |or forwarding|control CPU | | | | | | |Access | | | |||
| | |chip |unlikely | | | | | | |Control | | | |||
+-----------+-------------+-------------+--------------+----------+ | | | | |List (ACL)| | | |||
|Data |YANG, MIB, |YANG, |YANG |YANG, | | +-------------+---------------+----------+----------+---------------+ | |||
|Model |syslog |custom |custom, |custom | | |Export |main control |main |forwarding|various | | |||
+-----------+-------------+-------------+--------------+----------+ | |Location |CPU |control |chip or | | | |||
|Data |GPB, JSON, |GPB, JSON, |plain text |GPB, JSON | | | | |CPU, |linecard | | | |||
|Encoding |XML |XML, | |XML, plain| | | | |linecard |CPU; main | | | |||
| | |plain text | |text | | | | |CPU, or |control | | | |||
+-----------+-------------+-------------+--------------+----------+ | | | |forwarding|CPU | | | |||
|Application|gRPC,NETCONF,|gRPC,NETCONF,|IPFIX, traffic|gRPC | | | | |chip |unlikely | | | |||
|Protocol |RESTCONF |IPFIX,traffic|mirroring, | | | +-------------+---------------+----------+----------+---------------+ | |||
| | |mirroring |gRPC, NETFLOW | | | |Data Model |YANG, MIB, |YANG, |YANG, |YANG, custom | | |||
+-----------+-------------+-------------+--------------+----------+ | | |syslog |custom |custom | | | |||
|Data |HTTP(S), TCP |HTTP(S), TCP,|UDP |HTTP(S), | | +-------------+---------------+----------+----------+---------------+ | |||
|Transport | |UDP | |TCP, UDP | | |Data Encoding|GPB, JSON, XML |GPB, JSON,|plain text|GPB, JSON, XML,| | |||
+-----------+-------------+-------------+--------------+----------+ | | | |XML, plain| |plain text | | |||
| | |text | | | | ||||
+-------------+---------------+----------+----------+---------------+ | ||||
|Application |gRPC, NETCONF, |gRPC, |IPFIX, |gRPC | | ||||
|Protocol |RESTCONF |NETCONF, |traffic | | | ||||
| | |IPFIX, |mirroring,| | | ||||
| | |traffic |gRPC, | | | ||||
| | |mirroring |NETFLOW | | | ||||
+-------------+---------------+----------+----------+---------------+ | ||||
|Data |HTTP(S), TCP |HTTP(S), |UDP |HTTP(S), TCP, | | ||||
|Transport | |TCP, UDP | |UDP | | ||||
+-------------+---------------+----------+----------+---------------+ | ||||
Figure 2: Comparison of the Data Object Modules | Table 1: Comparison of Data Object Modules | |||
Note that the interaction with the applications that consume network | Note that the interaction with the applications that consume network | |||
telemetry data can be indirect. Some in-device data transfer is | telemetry data can be indirect. Some in-device data transfer is | |||
possible. For example, in the management plane telemetry, the | possible. For example, in the management plane telemetry, the | |||
management plane will need to acquire data from the data plane. Some | management plane will need to acquire data from the data plane. Some | |||
operational states can only be derived from data plane data sources | operational states can only be derived from data plane data sources | |||
such as the interface status and statistics. As another example, | such as the interface status and statistics. As another example, | |||
obtaining control plane telemetry data may require the ability to | obtaining control plane telemetry data may require the ability to | |||
access the Forwarding Information Base (FIB) of the data plane. | access the Forwarding Information Base (FIB) of the data plane. | |||
skipping to change at page 18, line 13 ¶ | skipping to change at line 835 ¶ | |||
the control plane telemetry. | the control plane telemetry. | |||
The requirements and challenges for each module are summarized as | The requirements and challenges for each module are summarized as | |||
follows (note that the requirements may pertain across all telemetry | follows (note that the requirements may pertain across all telemetry | |||
modules; however, we emphasize those that are most pronounced for a | modules; however, we emphasize those that are most pronounced for a | |||
particular plane). | particular plane). | |||
3.1.1. Management Plane Telemetry | 3.1.1. Management Plane Telemetry | |||
The management plane of network elements interacts with the Network | The management plane of network elements interacts with the Network | |||
Management System (NMS), and provides information such as performance | Management System (NMS) and provides information such as performance | |||
data, network logging data, network warning and defects data, and | data, network logging data, network warning and defects data, and | |||
network statistics and state data. The management plane includes | network statistics and state data. The management plane includes | |||
many protocols, including the classical SNMP and syslog. Regardless | many protocols, including the classical SNMP and syslog. Regardless | |||
the protocol, management plane telemetry must address the following | the protocol, management plane telemetry must address the following | |||
requirements: | requirements: | |||
* Convenient Data Subscription: An application should have the | * Convenient Data Subscription: An application should have the | |||
freedom to choose which data is exported (see section 4.3) and the | freedom to choose which data is exported (see Section 3.3) and the | |||
means and frequency of how that data is exported (e.g., on-change | means and frequency of how that data is exported (e.g., on-change | |||
or periodic subscription). | or periodic subscription). | |||
* Structured Data: For automatic network operation, machines will | * Structured Data: For automatic network operation, machines will | |||
replace human for network data comprehension. Data modeling | replace humans for network data comprehension. Data modeling | |||
languages, such as YANG, can efficiently describe structured data | languages, such as YANG, can efficiently describe structured data | |||
and normalize data encoding and transformation. | and normalize data encoding and transformation. | |||
* High Speed Data Transport: In order to keep up with the velocity | * High-Speed Data Transport: In order to keep up with the velocity | |||
of information, a data source needs to be able to send large | of information, a data source needs to be able to send large | |||
amounts of data at high frequency. Compact encoding formats or | amounts of data at high frequency. Compact encoding formats or | |||
data compression schemes are needed to reduce the quantity of data | data compression schemes are needed to reduce the quantity of data | |||
and improve the data transport efficiency. The subscription mode, | and improve the data transport efficiency. The subscription mode, | |||
by replacing the query mode, reduces the interactions between | by replacing the query mode, reduces the interactions between | |||
clients and servers and helps to improve the data source's | clients and servers and helps to improve the data source's | |||
efficiency. | efficiency. | |||
* Network Congestion Avoidance: The application must protect the | * Network Congestion Avoidance: The application must protect the | |||
network from congestion by congestion control mechanisms or at | network from congestion with congestion control mechanisms or, at | |||
least circuit breakers. [RFC8084] and [RFC8085] provide some | minimum, with circuit breakers. [RFC8084] and [RFC8085] provide | |||
solutions in this space. | some solutions in this space. | |||
3.1.2. Control Plane Telemetry | 3.1.2. Control Plane Telemetry | |||
The control plane telemetry refers to the health condition monitoring | The control plane telemetry refers to the health condition monitoring | |||
of different network control protocols at all layers of the protocol | of different network control protocols at all layers of the protocol | |||
stack. Keeping track of the operational status of these protocols is | stack. Keeping track of the operational status of these protocols is | |||
beneficial for detecting, localizing, and even predicting various | beneficial for detecting, localizing, and even predicting various | |||
network issues, as well as network optimization, in real-time and | network issues, as well as for network optimization, in real time and | |||
with fine granularity. Some particular challenges and issues faced | with fine granularity. Some particular challenges and issues faced | |||
by the control plane telemetry are as follows: | by the control plane telemetry are as follows: | |||
* One challenging problem for the control plane telemetry is how to | * How to correlate the End-to-End (E2E) Key Performance Indicators | |||
correlate the End-to-End (E2E) Key Performance Indicators (KPI) to | (KPIs) to a specific layer's KPIs. For example, IPTV users may | |||
a specific layer's KPIs. For example, IPTV users may describe | describe their UE by the video smoothness and definition. Then in | |||
their User Experience (UE) by the video smoothness and definition. | case of an unusually poor UE KPI or a service disconnection, it is | |||
Then in case of an unusually poor UE KPI or a service | non-trivial to delimit and pinpoint the issue in the responsible | |||
disconnection, it is non-trivial to delimit and pinpoint the issue | protocol layer (e.g., the transport layer or the network layer), | |||
in the responsible protocol layer (e.g., the Transport Layer or | the responsible protocol (e.g., IS-IS or BGP at the network | |||
the Network Layer), the responsible protocol (e.g., ISIS or BGP at | layer), and finally the responsible device(s) with specific | |||
the Network Layer), and finally the responsible device(s) with | reasons. | |||
specific reasons. | ||||
* Conventional OAM-based approaches for control plane KPI | * Conventional OAM-based approaches for control plane KPI | |||
measurement include Ping (L3), Traceroute (L3), Y.1731 [y1731] | measurement, which include Ping (L3), Traceroute (L3), Y.1731 | |||
(L2), and so on. One common issue behind these methods is that | [y1731] (L2), and so on. One common issue behind these methods is | |||
they only measure the KPIs instead of reflecting the actual | that they only measure the KPIs instead of reflecting the actual | |||
running status of these protocols, making them less effective or | running status of these protocols, making them less effective or | |||
efficient for control plane troubleshooting and network | efficient for control plane troubleshooting and network | |||
optimization. | optimization. | |||
* An example of the control plane telemetry is the BGP monitoring | * How more research is needed for the BGP monitoring protocol (BMP). | |||
protocol (BMP). It is currently used for monitoring the BGP | BMP is an example of the control plane telemetry; it is currently | |||
routes and enables rich applications, such as BGP peer analysis, | used for monitoring BGP routes and enables rich applications, such | |||
AS analysis, prefix analysis, and security analysis. However, the | as BGP peer analysis, Autonomous System (AS) analysis, prefix | |||
monitoring of other layers, protocols and the cross-layer, cross- | analysis, and security analysis. However, the monitoring of other | |||
protocol KPI correlations are still in their infancy (e.g., IGP | layers, protocols, and the cross-layer, cross-protocol KPI | |||
monitoring is not as extensive as BMP), which require further | correlations are still in their infancy (e.g., IGP monitoring is | |||
research. | not as extensive as BMP), which requires further research. | |||
* The requirement and solutions for network congestion avoidance are | Note that the requirement and solutions for network congestion | |||
also applicable to the control plane telemetry. | avoidance are also applicable to the control plane telemetry. | |||
3.1.3. Forwarding Plane Telemetry | 3.1.3. Forwarding Plane Telemetry | |||
An effective forwarding plane telemetry system relies on the data | An effective forwarding plane telemetry system relies on the data | |||
that the network device can expose. The quality, quantity, and | that the network device can expose. The quality, quantity, and | |||
timeliness of data must meet some stringent requirements. This | timeliness of data must meet some stringent requirements. This | |||
raises some challenges to the network data plane devices where the | raises some challenges for the network data plane devices where the | |||
first-hand data originates. | first-hand data originates. | |||
* A data plane device's main function is user traffic processing and | * A data plane device's main function is user traffic processing and | |||
forwarding. While supporting network visibility is important, the | forwarding. While supporting network visibility is important, the | |||
telemetry is just an auxiliary function, and it should strive to | telemetry is just an auxiliary function, and it should strive to | |||
not impede normal traffic processing and forwarding (i.e., the | not impede normal traffic processing and forwarding (i.e., the | |||
forwarding behavior should not be altered and the trade-off | forwarding behavior should not be altered, and the trade-off | |||
between forwarding performance and telemetry should be well- | between forwarding performance and telemetry should be well- | |||
balanced). | balanced). | |||
* Network operation applications require end-to-end visibility | * Network operation applications require end-to-end visibility | |||
across various sources, which can result in a huge volume of data. | across various sources, which can result in a huge volume of data. | |||
However, the sheer quantity of data must not exhaust the network | However, the sheer quantity of data must not exhaust the network | |||
bandwidth, regardless of the data delivery approach (i.e., whether | bandwidth, regardless of the data delivery approach (i.e., whether | |||
through in-band or out-of-band channels). | through in-band or out-of-band channels). | |||
* The data plane devices must provide timely data with the minimum | * The data plane devices must provide timely data with the minimum | |||
possible delay. Long processing, transport, storage, and analysis | possible delay. Long processing, transport, storage, and analysis | |||
delay can impact the effectiveness of the control loop and even | delay can impact the effectiveness of the control loop and even | |||
render the data useless. | render the data useless. | |||
* The data should be structured and labeled, and easy for | * The data should be structured, labeled, and easy for applications | |||
applications to parse and consume. At the same time, the data | to parse and consume. At the same time, the data types needed by | |||
types needed by applications can vary significantly. The data | applications can vary significantly. The data plane devices need | |||
plane devices need to provide enough flexibility and | to provide enough flexibility and programmability to support the | |||
programmability to support the precise data provision for | precise data provision for applications. | |||
applications. | ||||
* The data plane telemetry should support incremental deployment and | * The data plane telemetry should support incremental deployment and | |||
work even though some devices are unaware of the system. | work even though some devices are unaware of the system. | |||
* The requirement and solutions for network congestion avoidance are | * The requirement and solutions for network congestion avoidance are | |||
also applicable to the forwarding plane telemetry. | also applicable to the forwarding plane telemetry. | |||
Although not specific to the forwarding plane, these challenges are | Although not specific to the forwarding plane, these challenges are | |||
more difficult to the forwarding plane because of the limited | more difficult for the forwarding plane because of the limited | |||
resource and flexibility. Data plane programmability is essential to | resources and flexibility. Data plane programmability is essential | |||
support network telemetry. Newer data plane forwarding chips are | to support network telemetry. Newer data plane forwarding chips are | |||
equipped with advanced telemetry features and provide flexibility to | equipped with advanced telemetry features and provide flexibility to | |||
support customized telemetry functions. | support customized telemetry functions. | |||
Technique Taxonomy: concerning about how one instruments the | Technique Taxonomy: This pertains to how one instruments the | |||
telemetry, there can be multiple possible dimensions to classify the | telemetry; there can be multiple possible dimensions to classify the | |||
forwarding plane telemetry techniques. | forwarding plane telemetry techniques. | |||
* Active, Passive, and Hybrid: This dimension concerns about the | * Active, Passive, and Hybrid: This dimension pertains to the end- | |||
end-to-end measurement. Active and passive methods (as well as | to-end measurement. Active and passive methods (as well as the | |||
the hybrid types) are well documented in [RFC7799]. Passive | hybrid types) are well documented in [RFC7799]. Passive methods | |||
methods include TCPDUMP, IPFIX [RFC7011], sFlow, and traffic | include TCPDUMP, IPFIX [RFC7011], sFlow, and traffic mirroring. | |||
mirroring. These methods usually have low data coverage. The | These methods usually have low data coverage. The bandwidth cost | |||
bandwidth cost is very high in order to improve the data coverage. | is very high in order to improve the data coverage. On the other | |||
On the other hand, active methods include Ping, OWAMP [RFC4656], | hand, active methods include Ping, the One-Way Active Measurement | |||
TWAMP [RFC5357], STAMP [RFC8762], and Cisco's SLA Protocol | Protocol (OWAMP) [RFC4656], the Two-Way Active Measurement | |||
[RFC6812]. These methods are intrusive and only provide indirect | Protocol (TWAMP) [RFC5357], the Simple Two-way Active Measurement | |||
network measurements. Hybrid methods, including in-situ OAM | Protocol (STAMP) [RFC8762], and Cisco's SLA Protocol [RFC6812]. | |||
[I-D.ietf-ippm-ioam-data], Alternate-Marking (AM) [RFC8321], and | These methods are intrusive and only provide indirect network | |||
Multipoint Alternate Marking [RFC8889], provide a well-balanced | measurements. Hybrid methods, including IOAM [RFC9197], Alternate | |||
and more flexible approach. However, these methods are also more | Marking (AM) [RFC8321], and Multipoint Alternate Marking | |||
complex to implement. | [RFC8889], provide a well-balanced and more flexible approach. | |||
However, these methods are also more complex to implement. | ||||
* In-Band and Out-of-Band: Telemetry data carried in user packets | * In-Band and Out-of-Band: Telemetry data carried in user packets | |||
before being exported to a data collector is considered in-band | before being exported to a data collector is considered in-band | |||
(e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]). Telemetry data | (e.g., IOAM [RFC9197]). Telemetry data that is directly exported | |||
that is directly exported to a data collector without modifying | to a data collector without modifying user packets is considered | |||
user packets is considered out-of-band (e.g., the postcard-based | out-of-band (e.g., the postcard-based approach described in | |||
approach described in Appendix A.3.5). It is also possible to | Appendix A.3.5). It is also possible to have hybrid methods, | |||
have hybrid methods, where only the telemetry instruction or | where only the telemetry instruction or partial data is carried by | |||
partial data is carried by user packets (e.g., AM [RFC8321]). | user packets (e.g., AM [RFC8321]). | |||
* End-to-End and In-Network: End-to-End methods start from, and end | * End-to-End and In-Network: End-to-end methods start from, and end | |||
at, the network end hosts (e.g., Ping). In-Network methods work | at, the network end hosts (e.g., Ping). In-network methods work | |||
in networks and are transparent to end hosts. However, if needed, | in networks and are transparent to end hosts. However, if needed, | |||
In-Network methods can be easily extended into end hosts. | in-network methods can be easily extended into end hosts. | |||
* Data Subject: Depending on the telemetry objective, the methods | * Data Subject: Depending on the telemetry objective, the methods | |||
can be flow-based (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]), | can be flow based (e.g., IOAM [RFC9197]), path based (e.g., | |||
path-based (e.g., Traceroute), and node-based (e.g., IPFIX | Traceroute), and node based (e.g., IPFIX [RFC7011]). The various | |||
[RFC7011]). The various data objects can be packet, flow record, | data objects can be packet, flow record, measurement, states, and | |||
measurement, states, and signal. | signal. | |||
3.1.4. External Data Telemetry | 3.1.4. External Data Telemetry | |||
Events that occur outside the boundaries of the network system are | Events that occur outside the boundaries of the network system are | |||
another important source of network telemetry. Correlating both | another important source of network telemetry. Correlating both | |||
internal telemetry data and external events with the requirements of | internal telemetry data and external events with the requirements of | |||
network systems, as presented in | network systems, as presented in [NMRG-ANTICIPATED-ADAPTATION], | |||
[I-D.pedro-nmrg-anticipated-adaptation], provides a strategic and | provides a strategic and functional advantage to management | |||
functional advantage to management operations. | operations. | |||
As with other sources of telemetry information, the data and events | As with other sources of telemetry information, the data and events | |||
must meet strict requirements, especially in terms of timeliness, | must meet strict requirements, especially in terms of timeliness, | |||
which is essential to properly incorporate external event information | which is essential to properly incorporate external event information | |||
into network management applications. The specific challenges are | into network management applications. The specific challenges are | |||
described as follows: | described as follows: | |||
* The role of the external event detector can be played by multiple | * The role of the external event detector can be played by multiple | |||
elements, including hardware (e.g., physical sensors, such as | elements, including hardware (e.g., physical sensors, such as | |||
seismometers) and software (e.g., Big Data sources that can | seismometers) and software (e.g., big data sources that can | |||
analyze streams of information, such as Twitter messages). Thus, | analyze streams of information, such as Twitter messages). Thus, | |||
the transmitted data must support different shapes but, at the | the transmitted data must support different shapes but, at the | |||
same time, follow a common but extensible schema. | same time, follow a common but extensible schema. | |||
* Since the main function of the external event detectors is to | * Since the main function of the external event detectors is to | |||
perform the notifications, their timeliness is assumed. However, | perform the notifications, their timeliness is assumed. However, | |||
once messages have been dispatched, they must be quickly collected | once messages have been dispatched, they must be quickly collected | |||
and inserted into the control plane with variable priority, which | and inserted into the control plane with variable priority, which | |||
is higher for important sources and events and lower for secondary | is higher for important sources and events and lower for secondary | |||
ones. | ones. | |||
* The schema used by external detectors must be easily adopted by | * The schema used by external detectors must be easily adopted by | |||
current and future devices and applications. Therefore, it must | current and future devices and applications. Therefore, it must | |||
be easily mapped to current data models, such as in terms of YANG. | be easily mapped to current data models, such as in terms of YANG. | |||
* As the communication with external entities outside the boundary | * As the communication with external entities outside the boundary | |||
of a provider network may be realized over the Internet, the risk | of a provider network may be realized over the Internet, the risk | |||
of congestion is even more relevant in this context and proper | of congestion is even more relevant in this context and proper | |||
counter-measures must be taken. Solutions such as network | countermeasures must be taken. Solutions such as network | |||
transport circuit breakers are needed as well. | transport circuit breakers are needed as well. | |||
Organizing both internal and external telemetry information together | Organizing both internal and external telemetry information together | |||
will be key for the general exploitation of the management | will be key for the general exploitation of the management | |||
possibilities of current and future network systems, as reflected in | possibilities of current and future network systems, as reflected in | |||
the incorporation of cognitive capabilities to new hardware and | the incorporation of cognitive capabilities to new hardware and | |||
software (virtual) elements. | software (virtual) elements. | |||
3.2. Second Level Function Components | 3.2. Second-Level Function Components | |||
The telemetry module at each plane can be further partitioned into | The telemetry module at each plane can be further partitioned into | |||
five distinct conceptual components: | five distinct conceptual components: | |||
* Data Query, Analysis, and Storage: This component works at the | * Data Query, Analysis, and Storage: This component works at the | |||
network operation application block in Figure 1. It is normally a | network operation application block in Figure 1. It is normally a | |||
part of the network management system at the receiver side. On | part of the network management system at the receiver side. On | |||
the one hand, it is responsible for issuing data requirements. | one hand, it is responsible for issuing data requirements. The | |||
The data of interest can be modeled data through configuration or | data of interest can be modeled data through configuration or | |||
custom data through programming. The data requirements can be | custom data through programming. The data requirements can be | |||
queries for one-shot data or subscriptions for events or streaming | queries for one-shot data or subscriptions for events or streaming | |||
data. On the other hand, it receives, stores, and processes the | data. On the other hand, it receives, stores, and processes the | |||
returned data from network devices. Data analysis can be | returned data from network devices. Data analysis can be | |||
interactive to initiate further data queries. This component can | interactive to initiate further data queries. This component can | |||
reside in either network devices or remote controllers. It can be | reside in either network devices or remote controllers. It can be | |||
centralized and distributed, and involve one or more instances. | centralized and distributed and involve one or more instances. | |||
* Data Configuration and Subscription: This component manages data | * Data Configuration and Subscription: This component manages data | |||
queries on devices. It determines the protocol and channel for | queries on devices. It determines the protocol and channel for | |||
applications to acquire desired data. This component is also | applications to acquire desired data. This component is also | |||
responsible for configuring the desired data that might not be | responsible for configuring the desired data that might not be | |||
directly available from data sources. The subscription data can | directly available from data sources. The subscription data can | |||
be described by models, templates, or programs. | be described by models, templates, or programs. | |||
* Data Encoding and Export: This component determines how telemetry | * Data Encoding and Export: This component determines how telemetry | |||
data is delivered to the data analysis and storage component with | data is delivered to the data analysis and storage component with | |||
skipping to change at page 23, line 30 ¶ | skipping to change at line 1075 ¶ | |||
vary due to the data export location. | vary due to the data export location. | |||
* Data Generation and Processing: The requested data needs to be | * Data Generation and Processing: The requested data needs to be | |||
captured, filtered, processed, and formatted in network devices | captured, filtered, processed, and formatted in network devices | |||
from raw data sources. This may involve in-network computing and | from raw data sources. This may involve in-network computing and | |||
processing on either the fast path or the slow path in network | processing on either the fast path or the slow path in network | |||
devices. | devices. | |||
* Data Object and Source: This component determines the monitoring | * Data Object and Source: This component determines the monitoring | |||
objects and original data sources provisioned in the device. A | objects and original data sources provisioned in the device. A | |||
data source usually just provides raw data which needs further | data source usually just provides raw data that needs further | |||
processing. Each data source can be considered a probe. Some | processing. Each data source can be considered a probe. Some | |||
data sources can be dynamically installed, while others will be | data sources can be dynamically installed, while others will be | |||
more static. | more static. | |||
+----------------------------------------+ | +----------------------------------------+ | |||
+----------------------------------------+ | | +----------------------------------------+ | | |||
| | | | | | | | |||
| Data Query, Analysis, & Storage | | | | Data Query, Analysis, & Storage | | | |||
| | + | | | + | |||
+-------+++ -----------------------------+ | +-------+++ -----------------------------+ | |||
||| ^^^ | ||| ^^^ | |||
||| ||| | ||| ||| | |||
||V ||| | ||V ||| | |||
+--+V--------------------+++------------+ | +--+V--------------------+++------------+ | |||
+-----V---------------------+------------+ | | +-----V---------------------+------------+ | | |||
+---------------------+-------+----------+ | | | +---------------------+-------+----------+ | | | |||
| Data Configuration | | | | | | Data Configuration | | | | | |||
| & Subscription | Data Encoding | | | | | & Subscription | Data Encoding | | | | |||
| (model, template, | & Export | | | | | (model, template, | & Export | | | | |||
| & program) | | | | | | & program) | | | | | |||
+---------------------+------------------| | | | +---------------------+------------------| | | | |||
| | | | | | | | | | |||
| Data Generation | | | | | Data Generation | | | | |||
| & Processing | | | | | & Processing | | | | |||
| | | | | | | | | | |||
+----------------------------------------| | | | +----------------------------------------| | | | |||
| | | | | | | | | | |||
| Data Object and Source | |-+ | | Data Object and Source | |-+ | |||
| |-+ | | |-+ | |||
+----------------------------------------+ | +----------------------------------------+ | |||
Figure 3: Components in the Network Telemetry Framework | Figure 2: Components in the Network Telemetry Framework | |||
3.3. Data Acquisition Mechanism and Type Abstraction | 3.3. Data Acquisition Mechanism and Type Abstraction | |||
Broadly speaking, network data can be acquired through subscription | Broadly speaking, network data can be acquired through subscription | |||
(push) and query (poll). A subscription is a contract between | (push) and query (poll). A subscription is a contract between | |||
publisher and subscriber. After initial setup, the subscribed data | publisher and subscriber. After initial setup, the subscribed data | |||
is automatically delivered to registered subscribers until the | is automatically delivered to registered subscribers until the | |||
subscription expires. There are two variations of subscription. The | subscription expires. There are two variations of subscription. The | |||
subscriptions can be either pre-defined, or the subscribers are | subscriptions can be predefined, or the subscribers are allowed to | |||
allowed to configure and tailor the published data to their specific | configure and tailor the published data to their specific needs. | |||
needs. | ||||
In contrast, queries are used when a client expects immediate and | In contrast, queries are used when a client expects immediate and | |||
one-off feedback from network devices. The queried data may be | one-off feedback from network devices. The queried data may be | |||
directly extracted from some specific data source, or synthesized and | directly extracted from some specific data source or synthesized and | |||
processed from raw data. Queries work well for interactive network | processed from raw data. Queries work well for interactive network | |||
telemetry applications. | telemetry applications. | |||
In general, data can be pulled (i.e., queried) whenever needed, but | In general, data can be pulled (i.e., queried) whenever needed, but | |||
in many cases, pushing the data (i.e., subscription) is more | in many cases, pushing the data (i.e., subscription) is more | |||
efficient, and can reduce the latency of a client detecting a change. | efficient, and it can reduce the latency of a client detecting a | |||
From the data consumer point of view, there are four types of data | change. From the data consumer point of view, there are four types | |||
from network devices that a telemetry data consumer can subscribe or | of data from network devices that a telemetry data consumer can | |||
query: | subscribe or query: | |||
* Simple Data: The data that are steadily available from some | * Simple Data: Data that are steadily available from some datastore | |||
datastore or static probes in network devices. | or static probes in network devices. | |||
* Derived Data: The data need to be synthesized or processed in | * Derived Data: Data that need to be synthesized or processed in the | |||
network from raw data from one or more network devices. The data | network from raw data from one or more network devices. The data | |||
processing function can be statically or dynamically loaded into | processing function can be statically or dynamically loaded into | |||
network devices. | network devices. | |||
* Event-triggered Data: The data are conditionally acquired based on | * Event-triggered Data: Data that are conditionally acquired based | |||
the occurrence of some events. An example of event-triggered data | on the occurrence of some events. An example of event-triggered | |||
could be an interface changing operational state between up and | data could be an interface changing operational state between up | |||
down. Such data can be actively pushed through subscription or | and down. Such data can be actively pushed through subscription | |||
passively polled through query. There are many ways to model | or passively polled through query. There are many ways to model | |||
events, including using Finite State Machine (FSM) or Event | events, including using Finite State Machine (FSM) or Event | |||
Condition Action (ECA) [I-D.wwx-netmod-event-yang]. | Condition Action (ECA) [NETMOD-ECA-POLICY]. | |||
* Streaming Data: The data are continuously generated. It can be | * Streaming Data: Data that are continuously generated. It can be a | |||
time series or the dump of databases. For example, an interface | time series or the dump of databases. For example, an interface | |||
packet counter is exported every second. The streaming data | packet counter is exported every second. The streaming data | |||
reflect realtime network states and metrics and require large | reflect real-time network states and metrics and require large | |||
bandwidth and processing power. The streaming data are always | bandwidth and processing power. The streaming data are always | |||
actively pushed to the subscribers. | actively pushed to the subscribers. | |||
The above telemetry data types are not mutually exclusive. Rather, | The above telemetry data types are not mutually exclusive. Rather, | |||
they are often composite. Derived data is composed of simple data; | they are often composite. Derived data is composed of simple data; | |||
Event-triggered data can be simple or derived; streaming data can be | event-triggered data can be simple or derived; and streaming data can | |||
based on some recurring event. The relationships of these data types | be based on some recurring event. The relationships of these data | |||
are illustrated in Figure 4. | types are illustrated in Figure 3. | |||
+----------------------+ +-----------------+ | +----------------------+ +-----------------+ | |||
| Event-triggered Data |<----+ Streaming Data | | | Event-Triggered Data |<----+ Streaming Data | | |||
+-------+---+----------+ +-----+---+-------+ | +-------+---+----------+ +-----+---+-------+ | |||
| | | | | | | | | | |||
| | | | | | | | | | |||
| | +--------------+ | | | | | +--------------+ | | | |||
| +-->| Derived Data |<--+ | | | +-->| Derived Data |<--+ | | |||
| +------+------ + | | | +------+------ + | | |||
| | | | | | | | |||
| V | | | V | | |||
| +--------------+ | | | +--------------+ | | |||
+------>| Simple Data |<------+ | +------>| Simple Data |<------+ | |||
+--------------+ | +--------------+ | |||
Figure 4: Data Type Relationship | Figure 3: Data Type Relationship | |||
Subscription usually deals with event-triggered data and streaming | Subscription usually deals with event-triggered data and streaming | |||
data, and query usually deals with simple data and derived data. But | data, and query usually deals with simple data and derived data. But | |||
the other ways are also possible. Advanced network telemetry | the other ways are also possible. Advanced network telemetry | |||
techniques are designed mainly for event-triggered or streaming data | techniques are designed mainly for event-triggered or streaming data | |||
subscription, and derived data query. | subscription and derived data query. | |||
3.4. Mapping Existing Mechanisms into the Framework | 3.4. Mapping Existing Mechanisms into the Framework | |||
The following table shows how the existing mechanisms (mainly | The following table shows how the existing mechanisms (mainly | |||
published in IETF and with the emphasis on the latest new | published in IETF and with the emphasis on the latest new | |||
technologies) are positioned in the framework. Given the vast body | technologies) are positioned in the framework. Given the vast body | |||
of existing work, we cannot provide an exhaustive list, so the | of existing work, we cannot provide an exhaustive list, so the | |||
mechanisms in the tables should be considered as just examples. | mechanisms in the tables should be considered as just examples. | |||
Also, some comprehensive protocols and techniques may cover multiple | Also, some comprehensive protocols and techniques may cover multiple | |||
aspects or modules of the framework, so a name in a block only | aspects or modules of the framework, so a name in a block only | |||
emphasizes one particular characteristic of it. More details about | emphasizes one particular characteristic of it. More details about | |||
some listed mechanisms can be found in Appendix A. | some listed mechanisms can be found in Appendix A. | |||
+-------------+-----------------+---------------+--------------+ | +===============+=================+================+============+ | |||
| | Management | Control | Forwarding | | | | Management | Control Plane | Forwarding | | |||
| | Plane | Plane | Plane | | | | Plane | | Plane | | |||
+-------------+-----------------+---------------+--------------+ | +===============+=================+================+============+ | |||
| data config.| gNMI, NETCONF, | gNMI, NETCONF,| NETCONF, | | | data | gNMI, NETCONF, | gNMI, NETCONF, | NETCONF, | | |||
| & subscribe | RESTCONF, SNMP, | RESTCONF, | RESTCONF, | | | configuration | RESTCONF, SNMP, | RESTCONF, | RESTCONF, | | |||
| | YANG-Push | YANG-Push | YANG-Push | | | and subscribe | YANG-Push | YANG-Push | YANG-Push | | |||
+-------------+-----------------+---------------+--------------+ | +---------------+-----------------+----------------+------------+ | |||
| data gen. & | MIB, | YANG | IOAM, PSAMP | | | data | MIB, YANG | YANG | IOAM, | | |||
| process | YANG | | PBT, AM, | | | generation | | | PSAMP, | | |||
+-------------+-----------------+---------------+--------------+ | | and process | | | PBT, AM | | |||
| data encode.| gRPC, HTTP, TCP | BMP, TCP | IPFIX, UDP | | +---------------+-----------------+----------------+------------+ | |||
| & export | | | | | | data encoding | gRPC, HTTP, TCP | BMP, TCP | IPFIX, UDP | | |||
+-------------+-----------------+---------------+--------------+ | | and export | | | | | |||
Figure 5: Existing Work Mapping | +---------------+-----------------+----------------+------------+ | |||
Table 2: Existing Work Mapping | ||||
Although the framework is generally suitable for any network | Although the framework is generally suitable for any network | |||
environments, the multi-domain telemetry has some unique challenges | environments, the multi-domain telemetry has some unique challenges | |||
which deserve further architectural consideration, which is out of | that deserve further architectural consideration, which is out of the | |||
the scope of this document. | scope of this document. | |||
4. Evolution of Network Telemetry Applications | 4. Evolution of Network Telemetry Applications | |||
Network telemetry is an evolving technical area. As the network | Network telemetry is an evolving technical area. As the network | |||
moves towards the automated operation, network telemetry applications | moves towards the automated operation, network telemetry applications | |||
undergo several stages of evolution which add new layer of | undergo several stages of evolution, which add a new layer of | |||
requirements to the underlying network telemetry techniques. Each | requirements to the underlying network telemetry techniques. Each | |||
stage is built upon the techniques adopted by the previous stages | stage is built upon the techniques adopted by the previous stages | |||
plus some new requirements. | plus some new requirements. | |||
Stage 0 - Static Telemetry: The telemetry data source and type are | Stage 0 - Static Telemetry: The telemetry data source and type are | |||
determined at design time. The network operator can only | determined at design time. The network operator can only | |||
configure how to use it with limited flexibility. | configure how to use it with limited flexibility. | |||
Stage 1 - Dynamic Telemetry: The custom telemetry data can be | Stage 1 - Dynamic Telemetry: The custom telemetry data can be | |||
dynamically programmed or configured at runtime without | dynamically programmed or configured at runtime without | |||
interrupting the network operation, allowing a trade-off among | interrupting the network operation, allowing a trade-off among | |||
resource, performance, flexibility, and coverage. | resource, performance, flexibility, and coverage. | |||
Stage 2 - Interactive Telemetry: The network operator can | Stage 2 - Interactive Telemetry: The network operator can | |||
continuously customize and fine tune the telemetry data in real | continuously customize and fine tune the telemetry data in real | |||
time to reflect the network operation's visibility requirements. | time to reflect the network operation's visibility requirements. | |||
Compared with Stage 1, the changes are frequent based on the real- | Compared with Stage 1, the changes are frequent based on the real- | |||
time feedback. At this stage, some tasks can be automated, but | time feedback. At this stage, some tasks can be automated, but | |||
human operators still need to sit in the middle to make decisions. | human operators still need to sit in the middle to make decisions. | |||
Stage 3 - Closed-loop Telemetry: The telemetry is free from the | Stage 3 - Closed-Loop Telemetry: The telemetry is free from the | |||
interference of human operators, except for generating the | interference of human operators, except for generating the | |||
reports. The intelligent network operation engine automatically | reports. The intelligent network operation engine automatically | |||
issues the telemetry data requests, analyzes the data, and updates | issues the telemetry data requests, analyzes the data, and updates | |||
the network operations in closed control loops. | the network operations in closed control loops. | |||
Existing technologies are ready for stage 0 and stage 1. Individual | Existing technologies are ready for Stages 0 and 1. Individual | |||
stage 2 and stage 3 applications are also possible now. However, the | applications for Stages 2 and 3 are also possible now. However, the | |||
future autonomic networks may need a comprehensive operation | future autonomic networks may need a comprehensive operation | |||
management system which works at stage 2 and stage 3 to cover all the | management system that works at Stages 2 and 3 to cover all the | |||
network operation tasks. A well-defined network telemetry framework | network operation tasks. A well-defined network telemetry framework | |||
is the first step towards this direction. | is the first step towards this direction. | |||
5. Security Considerations | 5. Security Considerations | |||
The complexity of network telemetry raises significant security | The complexity of network telemetry raises significant security | |||
implications. For example, telemetry data can be manipulated to | implications. For example, telemetry data can be manipulated to | |||
exhaust various network resources at each plane as well as the data | exhaust various network resources at each plane as well as the data | |||
consumer; falsified or tampered data can mislead the decision-making | consumer; falsified or tampered data can mislead the decision-making | |||
and paralyze networks; wrong configuration and programming for | process and paralyze networks; and wrong configuration and | |||
telemetry is equally harmful. The telemetry data is highly | programming for telemetry is equally harmful. The telemetry data is | |||
sensitive, which exposes a lot of information about the network and | highly sensitive, which exposes a lot of information about the | |||
its configuration. Some of that information can make designing | network and its configuration. Some of that information can make | |||
attacks against the network much easier (e.g., exact details of what | designing attacks against the network much easier (e.g., exact | |||
software and patches have been installed), and allows an attacker to | details of what software and patches have been installed) and allows | |||
determine whether a device may be subject to unprotected security | an attacker to determine whether a device may be subject to | |||
vulnerabilities. | unprotected security vulnerabilities. | |||
Given that this document has proposed a framework for network | Given that this document has proposed a framework for network | |||
telemetry and the telemetry mechanisms discussed are more extensive | telemetry and the telemetry mechanisms discussed are more extensive | |||
(in both message frequency and traffic amount) than the conventional | (in both message frequency and traffic amount) than the conventional | |||
network OAM concepts, we must also reflect that various new security | network OAM concepts, we must also anticipate that new security | |||
considerations may also arise. A number of techniques already exist | considerations that may also arise. A number of techniques already | |||
for securing the forwarding plane, the control plane, and the | exist for securing the forwarding plane, control plane, and | |||
management plane in a network, but it is important to consider if any | management plane in a network, but it is important to consider if any | |||
new threat vectors are now being enabled via the use of network | new threat vectors are now being enabled via the use of network | |||
telemetry procedures and mechanisms. | telemetry procedures and mechanisms. | |||
This document proposes a conceptual architectural for collecting, | This document proposes a conceptual architectural for collecting, | |||
transporting, and analyzing a wide variety of data sources in support | transporting, and analyzing a wide variety of data sources in support | |||
of network applications. The protocols, data formats, and | of network applications. The protocols, data formats, and | |||
configurations chosen to implement this framework will dictate the | configurations chosen to implement this framework will dictate the | |||
specific security considerations. These considerations may include: | specific security considerations. These considerations may include: | |||
* Telemetry framework trust and policy model; | * Telemetry framework trust and policy models; | |||
* Role management and access control for enabling and disabling | * Role management and access control for enabling and disabling | |||
telemetry capabilities; | telemetry capabilities; | |||
* Protocol transport used for telemetry data and its inherent | * Protocol transport used for telemetry data and its inherent | |||
security capabilities; | security capabilities; | |||
* Telemetry data stores, storage encryption, methods of access, and | * Telemetry data stores, storage encryption, methods of access, and | |||
retention practices; | retention practices; | |||
* Tracking telemetry events and any abnormalities that might | * Tracking telemetry events and any abnormalities that might | |||
identify malicious attacks using telemetry interfaces. | identify malicious attacks using telemetry interfaces. | |||
* Authentication and integrity protection of telemetry data to make | * Authentication and integrity protection of telemetry data to make | |||
data more trustworthy. | data more trustworthy; and | |||
* Segregating the telemetry data traffic from the data traffic | * Segregating the telemetry data traffic from the data traffic | |||
carried over the network (e.g., historically management access and | carried over the network (e.g., historically management access and | |||
management data may be carried via an independent management | management data may be carried via an independent management | |||
network). | network). | |||
Some security considerations highlighted above may be minimized or | Some security considerations highlighted above may be minimized or | |||
negated with policy management of network telemetry. In a network | negated with policy management of network telemetry. In a network | |||
telemetry deployment it would be advantageous to separate telemetry | telemetry deployment, it would be advantageous to separate telemetry | |||
capabilities into different classes of policies, i.e., Role Based | capabilities into different classes of policies, i.e., Role-Based | |||
Access Control and Event-Condition-Action policies. Also, potential | Access Control and Event-Condition-Action policies. Also, potential | |||
conflicts between network telemetry mechanisms must be detected | conflicts between network telemetry mechanisms must be detected | |||
accurately and resolved quickly to avoid unnecessary network | accurately and resolved quickly to avoid unnecessary network | |||
telemetry traffic propagation escalating into an unintended or | telemetry traffic propagation escalating into an unintended or | |||
intended denial of service attack. | intended denial-of-service attack. | |||
Further study of the security issues will be required, and it is | Further study of the security issues will be required, and it is | |||
expected that the security mechanisms and protocols are developed and | expected that the security mechanisms and protocols are developed and | |||
deployed along with a network telemetry system. | deployed along with a network telemetry system. | |||
6. IANA Considerations | 6. IANA Considerations | |||
This document includes no request to IANA. | This document has no IANA actions. | |||
7. Contributors | ||||
The other contributors of this document are Tianran Zhou, Zhenbin Li, | ||||
Zhenqiang Li, Daniel King, Adrian Farrel, and Alexander Clemm | ||||
8. Acknowledgments | ||||
We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe | ||||
Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe | ||||
Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra, | ||||
Ben Schwartz, Alexey Melnikov, Michael Scharf, Dhruv Dhody, Martin | ||||
Duke, Roman Danyliw, Warren Kumari, Sheng Jiang, Lars Eggert, Eric | ||||
Vyncke, Jean-Michel Combes, Erik Kline, Benjamin Kaduk, and many | ||||
others who have provided helpful comments and suggestions to improve | ||||
this document. | ||||
9. Informative References | 7. Informative References | |||
[gnmi] "gNMI - gRPC Network Management Interface", | [gnmi] Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, | |||
<https://github.com/openconfig/reference/tree/master/rpc/ | C., and C. Marrow, "gRPC Network Management Interface", | |||
gnmi>. | IETF 98, March 2017, | |||
<https://datatracker.ietf.org/meeting/98/materials/slides- | ||||
98-rtgwg-gnmi-intro-draft-openconfig-rtgwg-gnmi-spec-00>. | ||||
[gpb] "Google Protocol Buffers", | [gpb] Google Developers, "Protocol Buffers", | |||
<https://developers.google.com/protocol-buffers>. | <https://developers.google.com/protocol-buffers>. | |||
[grpc] "gPPC, A high performance, open-source universal RPC | [grpc] gRPC, "gPPC: A high performance, open source universal RPC | |||
framework", <https://grpc.io>. | framework", <https://grpc.io>. | |||
[I-D.ietf-grow-bmp-local-rib] | [IPPM-IOAM-DIRECT-EXPORT] | |||
Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, | ||||
"Support for Local RIB in BGP Monitoring Protocol (BMP)", | ||||
Work in Progress, Internet-Draft, draft-ietf-grow-bmp- | ||||
local-rib-13, 31 August 2021, | ||||
<https://www.ietf.org/archive/id/draft-ietf-grow-bmp- | ||||
local-rib-13.txt>. | ||||
[I-D.ietf-ippm-ioam-data] | ||||
Brockners, F., Bhandari, S., and T. Mizrahi, "Data Fields | ||||
for In-situ OAM", Work in Progress, Internet-Draft, draft- | ||||
ietf-ippm-ioam-data-16, 8 November 2021, | ||||
<https://www.ietf.org/archive/id/draft-ietf-ippm-ioam- | ||||
data-16.txt>. | ||||
[I-D.ietf-ippm-ioam-direct-export] | ||||
Song, H., Gafni, B., Zhou, T., Li, Z., Brockners, F., | Song, H., Gafni, B., Zhou, T., Li, Z., Brockners, F., | |||
Bhandari, S., Sivakolundu, R., and T. Mizrahi, "In-situ | Bhandari, S., Ed., Sivakolundu, R., and T. Mizrahi, Ed., | |||
OAM Direct Exporting", Work in Progress, Internet-Draft, | "In-situ OAM Direct Exporting", Work in Progress, | |||
draft-ietf-ippm-ioam-direct-export-07, 13 October 2021, | Internet-Draft, draft-ietf-ippm-ioam-direct-export-07, 13 | |||
<https://www.ietf.org/archive/id/draft-ietf-ippm-ioam- | October 2021, <https://datatracker.ietf.org/doc/html/ | |||
direct-export-07.txt>. | draft-ietf-ippm-ioam-direct-export-07>. | |||
[I-D.ietf-netconf-distributed-notif] | [IPPM-POSTCARD-BASED-TELEMETRY] | |||
Song, H., Mirsky, G., Filsfils, C., Abdelsalam, A., Zhou, | ||||
T., Li, Z., Mishra, G., Shin, J., and K. Lee, "In-Situ OAM | ||||
Marking-based Direct Export", Work in Progress, Internet- | ||||
Draft, draft-song-ippm-postcard-based-telemetry-12, 12 May | ||||
2022, <https://datatracker.ietf.org/doc/html/draft-song- | ||||
ippm-postcard-based-telemetry-12>. | ||||
[NETCONF-DISTRIB-NOTIF] | ||||
Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, | Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, | |||
"Subscription to Distributed Notifications", Work in | "Subscription to Distributed Notifications", Work in | |||
Progress, Internet-Draft, draft-ietf-netconf-distributed- | Progress, Internet-Draft, draft-ietf-netconf-distributed- | |||
notif-02, 6 May 2021, <https://www.ietf.org/archive/id/ | notif-03, 10 January 2022, | |||
draft-ietf-netconf-distributed-notif-02.txt>. | <https://datatracker.ietf.org/doc/html/draft-ietf-netconf- | |||
distributed-notif-03>. | ||||
[I-D.ietf-netconf-udp-notif] | [NETCONF-UDP-NOTIF] | |||
Zheng, G., Zhou, T., Graf, T., Francois, P., Feng, A. H., | Zheng, G., Zhou, T., Graf, T., Francois, P., Feng, A. H., | |||
and P. Lucente, "UDP-based Transport for Configured | and P. Lucente, "UDP-based Transport for Configured | |||
Subscriptions", Work in Progress, Internet-Draft, draft- | Subscriptions", Work in Progress, Internet-Draft, draft- | |||
ietf-netconf-udp-notif-04, 21 October 2021, | ietf-netconf-udp-notif-05, 4 March 2022, | |||
<https://www.ietf.org/archive/id/draft-ietf-netconf-udp- | <https://datatracker.ietf.org/doc/html/draft-ietf-netconf- | |||
notif-04.txt>. | udp-notif-05>. | |||
[I-D.irtf-nmrg-ibn-concepts-definitions] | [NETMOD-ECA-POLICY] | |||
Clemm, A., Ciavaglia, L., Granville, L. Z., and J. | Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise, | |||
Tantsura, "Intent-Based Networking - Concepts and | "A YANG Data model for ECA Policy Management", Work in | |||
Definitions", Work in Progress, Internet-Draft, draft- | Progress, Internet-Draft, draft-ietf-netmod-eca-policy-01, | |||
irtf-nmrg-ibn-concepts-definitions-05, 2 September 2021, | 19 February 2021, <https://datatracker.ietf.org/doc/html/ | |||
<https://www.ietf.org/archive/id/draft-irtf-nmrg-ibn- | draft-ietf-netmod-eca-policy-01>. | |||
concepts-definitions-05.txt>. | ||||
[I-D.pedro-nmrg-anticipated-adaptation] | [NMRG-ANTICIPATED-ADAPTATION] | |||
Martinez-Julia, P., "Exploiting External Event Detectors | Martinez-Julia, P., Ed., "Exploiting External Event | |||
to Anticipate Resource Requirements for the Elastic | Detectors to Anticipate Resource Requirements for the | |||
Adaptation of SDN/NFV Systems", Work in Progress, | Elastic Adaptation of SDN/NFV Systems", Work in Progress, | |||
Internet-Draft, draft-pedro-nmrg-anticipated-adaptation- | Internet-Draft, draft-pedro-nmrg-anticipated-adaptation- | |||
02, 29 June 2018, <https://www.ietf.org/archive/id/draft- | 02, 29 June 2018, <https://datatracker.ietf.org/doc/html/ | |||
pedro-nmrg-anticipated-adaptation-02.txt>. | draft-pedro-nmrg-anticipated-adaptation-02>. | |||
[I-D.song-ippm-postcard-based-telemetry] | ||||
Song, H., Mirsky, G., Filsfils, C., Abdelsalam, A., Zhou, | ||||
T., Li, Z., Shin, J., and K. Lee, "In-Situ OAM Marking- | ||||
based Direct Export", Work in Progress, Internet-Draft, | ||||
draft-song-ippm-postcard-based-telemetry-11, 15 November | ||||
2021, <https://www.ietf.org/archive/id/draft-song-ippm- | ||||
postcard-based-telemetry-11.txt>. | ||||
[I-D.song-opsawg-dnp4iq] | [NMRG-IBN-CONCEPTS-DEFINITIONS] | |||
Song, H. and J. Gong, "Requirements for Interactive Query | Clemm, A., Ciavaglia, L., Granville, L. Z., and J. | |||
with Dynamic Network Probes", Work in Progress, Internet- | Tantsura, "Intent-Based Networking - Concepts and | |||
Draft, draft-song-opsawg-dnp4iq-01, 19 June 2017, | Definitions", Work in Progress, Internet-Draft, draft- | |||
<https://www.ietf.org/archive/id/draft-song-opsawg-dnp4iq- | irtf-nmrg-ibn-concepts-definitions-09, 24 March 2022, | |||
01.txt>. | <https://datatracker.ietf.org/doc/html/draft-irtf-nmrg- | |||
ibn-concepts-definitions-09>. | ||||
[I-D.song-opsawg-ifit-framework] | [OPSAWG-DNP4IQ] | |||
Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "In- | Song, H., Ed. and J. Gong, "Requirements for Interactive | |||
situ Flow Information Telemetry", Work in Progress, | Query with Dynamic Network Probes", Work in Progress, | |||
Internet-Draft, draft-song-opsawg-ifit-framework-16, 21 | Internet-Draft, draft-song-opsawg-dnp4iq-01, 19 June 2017, | |||
October 2021, <https://www.ietf.org/archive/id/draft-song- | <https://datatracker.ietf.org/doc/html/draft-song-opsawg- | |||
opsawg-ifit-framework-16.txt>. | dnp4iq-01>. | |||
[I-D.wwx-netmod-event-yang] | [OPSAWG-IFIT-FRAMEWORK] | |||
Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise, | Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "A | |||
"A YANG Data model for ECA Policy Management", Work in | Framework for In-situ Flow Information Telemetry", Work in | |||
Progress, Internet-Draft, draft-wwx-netmod-event-yang-10, | Progress, Internet-Draft, draft-song-opsawg-ifit- | |||
1 November 2020, <https://www.ietf.org/archive/id/draft- | framework-17, 22 February 2022, | |||
wwx-netmod-event-yang-10.txt>. | <https://datatracker.ietf.org/doc/html/draft-song-opsawg- | |||
ifit-framework-17>. | ||||
[RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, | [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, | |||
"Simple Network Management Protocol (SNMP)", RFC 1157, | "Simple Network Management Protocol (SNMP)", RFC 1157, | |||
DOI 10.17487/RFC1157, May 1990, | DOI 10.17487/RFC1157, May 1990, | |||
<https://www.rfc-editor.org/info/rfc1157>. | <https://www.rfc-editor.org/info/rfc1157>. | |||
[RFC2578] McCloghrie, K., Ed., Perkins, D., Ed., and J. | [RFC2578] McCloghrie, K., Ed., Perkins, D., Ed., and J. | |||
Schoenwaelder, Ed., "Structure of Management Information | Schoenwaelder, Ed., "Structure of Management Information | |||
Version 2 (SMIv2)", STD 58, RFC 2578, | Version 2 (SMIv2)", STD 58, RFC 2578, | |||
DOI 10.17487/RFC2578, April 1999, | DOI 10.17487/RFC2578, April 1999, | |||
skipping to change at page 35, line 22 ¶ | skipping to change at line 1578 ¶ | |||
Hybrid Performance Monitoring", RFC 8889, | Hybrid Performance Monitoring", RFC 8889, | |||
DOI 10.17487/RFC8889, August 2020, | DOI 10.17487/RFC8889, August 2020, | |||
<https://www.rfc-editor.org/info/rfc8889>. | <https://www.rfc-editor.org/info/rfc8889>. | |||
[RFC8924] Aldrin, S., Pignataro, C., Ed., Kumar, N., Ed., Krishnan, | [RFC8924] Aldrin, S., Pignataro, C., Ed., Kumar, N., Ed., Krishnan, | |||
R., and A. Ghanwani, "Service Function Chaining (SFC) | R., and A. Ghanwani, "Service Function Chaining (SFC) | |||
Operations, Administration, and Maintenance (OAM) | Operations, Administration, and Maintenance (OAM) | |||
Framework", RFC 8924, DOI 10.17487/RFC8924, October 2020, | Framework", RFC 8924, DOI 10.17487/RFC8924, October 2020, | |||
<https://www.rfc-editor.org/info/rfc8924>. | <https://www.rfc-editor.org/info/rfc8924>. | |||
[xml] "Extensible Markup Language (XML) 1.0 (Fifth Edition)", | [RFC9069] Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, | |||
<https://www.w3.org/TR/2008/REC-xml-20081126/>. | "Support for Local RIB in the BGP Monitoring Protocol | |||
(BMP)", RFC 9069, DOI 10.17487/RFC9069, February 2022, | ||||
<https://www.rfc-editor.org/info/rfc9069>. | ||||
[y1731] "ITU-T Y.1731: OAM Functions and Mechanisms for Ethernet | [RFC9197] Brockners, F., Ed., Bhandari, S., Ed., and T. Mizrahi, | |||
based networks, 2015", | Ed., "Data Fields for In Situ Operations, Administration, | |||
and Maintenance (IOAM)", RFC 9197, DOI 10.17487/RFC9197, | ||||
May 2022, <https://www.rfc-editor.org/info/rfc9197>. | ||||
[W3C.REC-xml-20081126] | ||||
Bray, T., Paoli, J., Sperberg-McQueen, M., Maler, E., and | ||||
F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth | ||||
Edition)", World Wide Web Consortium Recommendation REC- | ||||
xml-20081126, November 2008, | ||||
<https://www.w3.org/TR/2008/REC-xml-20081126>. | ||||
[y1731] ITU-T, "Operations, administration and maintenance (OAM) | ||||
functions and mechanisms for Ethernet-based networks", | ||||
ITU-T Recommendation G.8013/Y.1731, August 2015, | ||||
<https://www.itu.int/rec/T-REC-Y.1731/en>. | <https://www.itu.int/rec/T-REC-Y.1731/en>. | |||
Appendix A. A Survey on Existing Network Telemetry Techniques | Appendix A. A Survey on Existing Network Telemetry Techniques | |||
In this non-normative appendix, we provide an overview of some | In this non-normative appendix, we provide an overview of some | |||
existing techniques and standard proposals for each network telemetry | existing techniques and standard proposals for each network telemetry | |||
module. | module. | |||
A.1. Management Plane Telemetry | A.1. Management Plane Telemetry | |||
A.1.1. Push Extensions for NETCONF | A.1.1. Push Extensions for NETCONF | |||
NETCONF [RFC6241] is a popular network management protocol | NETCONF [RFC6241] is a popular network management protocol | |||
recommended by IETF. Its core strength is for managing | recommended by IETF. Its core strength is for managing | |||
configuration, but can also be used for data collection. YANG-Push | configuration, but it can also be used for data collection. | |||
[RFC8641] [RFC8639] extends NETCONF and enables subscriber | YANG-Push [RFC8639] [RFC8641] extends NETCONF and enables subscriber | |||
applications to request a continuous, customized stream of updates | applications to request a continuous, customized stream of updates | |||
from a YANG datastore. Providing such visibility into changes made | from a YANG datastore. Providing such visibility into changes made | |||
upon YANG configuration and operational objects enables new | upon YANG configuration and operational objects enables new | |||
capabilities based on the remote mirroring of configuration and | capabilities based on the remote mirroring of configuration and | |||
operational state. Moreover, distributed data collection mechanism | operational state. Moreover, a distributed data collection mechanism | |||
[I-D.ietf-netconf-distributed-notif] via UDP based publication | [NETCONF-DISTRIB-NOTIF] via a UDP-based publication channel | |||
channel [I-D.ietf-netconf-udp-notif] provides enhanced efficiency for | [NETCONF-UDP-NOTIF] provides enhanced efficiency for the NETCONF- | |||
the NETCONF based telemetry. | based telemetry. | |||
A.1.2. gRPC Network Management Interface | A.1.2. gRPC Network Management Interface | |||
gRPC Network Management Interface (gNMI) [gnmi] is a network | gRPC Network Management Interface (gNMI) [gnmi] is a network | |||
management protocol based on the gRPC [grpc] RPC (Remote Procedure | management protocol based on the gRPC [grpc] Remote Procedure Call | |||
Call) framework. With a single gRPC service definition, both | (RPC) framework. With a single gRPC service definition, both | |||
configuration and telemetry can be covered. gRPC is an HTTP/2 | configuration and telemetry can be covered. gRPC is an open-source | |||
[RFC7540]-based open-source micro-service communication framework. | micro-service communication framework based on HTTP/2 [RFC7540]. It | |||
It provides a number of capabilities which are well-suited for | provides a number of capabilities that are well-suited for network | |||
network telemetry, including: | telemetry, including: | |||
* Full-duplex streaming transport model combined with a binary | * A full-duplex streaming transport model; when combined with a | |||
encoding mechanism provides good telemetry efficiency. | binary encoding mechanism, it provides good telemetry efficiency. | |||
* gRPC provides higher-level features consistency across platforms | * A higher-level feature consistency across platforms that common | |||
that common HTTP/2 libraries typically do not. This | HTTP/2 libraries typically do not provide. This characteristic is | |||
characteristic is especially valuable for the fact that telemetry | especially valuable for the fact that telemetry data collectors | |||
data collectors normally reside on a large variety of platforms. | normally reside on a large variety of platforms. | |||
* The built-in load-balancing and failover mechanism. | * A built-in load-balancing and failover mechanism. | |||
A.2. Control Plane Telemetry | A.2. Control Plane Telemetry | |||
A.2.1. BGP Monitoring Protocol | A.2.1. BGP Monitoring Protocol | |||
BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP | BMP [RFC7854] is used to monitor BGP sessions and is intended to | |||
sessions and is intended to provide a convenient interface for | provide a convenient interface for obtaining route views. | |||
obtaining route views. | ||||
The BGP routing information is collected from the monitored device(s) | BGP routing information is collected from the monitored device(s) to | |||
to the BMP monitoring station by setting up the BMP TCP session. The | the BMP monitoring station by setting up the BMP TCP session. The | |||
BGP peers are monitored by the BMP Peer Up and Peer Down | BGP peers are monitored by the BMP Peer Up and Peer Down | |||
Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], | notifications. The BGP routes (including Adj_RIB_In [RFC7854], | |||
Adjacency_RIB_out [RFC8671], and Local_Rib | Adj_RIB_out [RFC8671], and local RIB [RFC9069]) are encapsulated in | |||
[I-D.ietf-grow-bmp-local-rib]) are encapsulated in the BMP Route | the BMP Route Monitoring Message and the BMP Route Mirroring Message, | |||
Monitoring Message and the BMP Route Mirroring Message, providing | providing both an initial table dump and real-time route updates. In | |||
both an initial table dump and real-time route updates. In addition, | addition, BGP statistics are reported through the BMP Stats Report | |||
BGP statistics are reported through the BMP Stats Report Message, | Message, which could be either timer triggered or event-driven. | |||
which could be either timer triggered or event-driven. Future BMP | Future BMP extensions could further enrich BGP monitoring | |||
extensions could further enrich BGP monitoring applications. | applications. | |||
A.3. Data Plane Telemetry | A.3. Data Plane Telemetry | |||
A.3.1. The Alternate Marking (AM) technology | A.3.1. Alternate-Marking (AM) Technology | |||
The Alternate Marking method enables efficient measurements of packet | The Alternate-Marking method enables efficient measurements of packet | |||
loss, delay, and jitter both in IP and Overlay Networks, as presented | loss, delay, and jitter both in IP and Overlay Networks, as presented | |||
in [RFC8321] and [RFC8889]. | in [RFC8321] and [RFC8889]. | |||
This technique can be applied to point-to-point and multipoint-to- | This technique can be applied to point-to-point and multipoint-to- | |||
multipoint flows. Alternate Marking creates batches of packets by | multipoint flows. Alternate Marking creates batches of packets by | |||
alternating the value of 1 bit (or a label) of the packet header. | alternating the value of 1 bit (or a label) of the packet header. | |||
These batches of packets are unambiguously recognized over the | These batches of packets are unambiguously recognized over the | |||
network and the comparison of packet counters for each batch allows | network, and the comparison of packet counters for each batch allows | |||
the packet loss calculation. The same idea can be applied to delay | the packet loss calculation. The same idea can be applied to delay | |||
measurement by selecting ad hoc packets with a marking bit dedicated | measurement by selecting ad hoc packets with a marking bit dedicated | |||
for delay measurements. | for delay measurements. | |||
Alternate Marking method needs two counters each marking period for | The Alternate-Marking method needs two counters each marking period | |||
each flow under monitor. For instance, by considering n measurement | for each flow under monitor. For instance, by considering n | |||
points and m monitored flows, the order of magnitude of the packet | measurement points and m monitored flows, the order of magnitude of | |||
counters for each time interval is n*m*2 (1 per color). | the packet counters for each time interval is n*m*2 (1 per color). | |||
Since networks offer rich sets of network performance measurement | Since networks offer rich sets of network performance measurement | |||
data (e.g., packet counters), conventional approaches run into | data (e.g., packet counters), conventional approaches run into | |||
limitations. The bottleneck is the generation and export of the data | limitations. The bottleneck is the generation and export of the data | |||
and the amount of data that can be reasonably collected from the | and the amount of data that can be reasonably collected from the | |||
network. In addition, management tasks related to determining and | network. In addition, management tasks related to determining and | |||
configuring which data to generate lead to significant deployment | configuring which data to generate lead to significant deployment | |||
challenges. | challenges. | |||
The Multipoint Alternate Marking approach, described in [RFC8889], | The Multipoint Alternate-Marking approach, described in [RFC8889], | |||
aims to resolve this issue and make the performance monitoring more | aims to resolve this issue and make the performance monitoring more | |||
flexible in case a detailed analysis is not needed. | flexible in case a detailed analysis is not needed. | |||
An application orchestrates network performance measurements tasks | An application orchestrates network performance measurement tasks | |||
across the network to allow for optimized monitoring. The | across the network to allow for optimized monitoring. The | |||
application can choose how roughly or precisely to configure | application can choose how roughly or precisely to configure | |||
measurement points depending on the application's requirements. | measurement points depending on the application's requirements. | |||
Using Alternate Marking, it is possible to monitor a Multipoint | Using Alternate Marking, it is possible to monitor a Multipoint | |||
Network without in depth examination by using the Network Clustering | Network without in-depth examination by using Network Clustering | |||
(subnetworks that are portions of the entire network that preserve | (subnetworks that are portions of the entire network that preserve | |||
the same property of the entire network, called clusters). So in the | the same property of the entire network, called clusters). So in the | |||
case that there is packet loss or the delay is too high then the | case where there is packet loss or the delay is too high, the | |||
specific filtering criteria could be applied to gather a more | specific filtering criteria could be applied to gather a more | |||
detailed analysis by using a different combination of clusters up to | detailed analysis by using a different combination of clusters up to | |||
a per-flow measurement as described in Alternate-Marking (AM) | a per-flow measurement as described in the Alternate-Marking document | |||
[RFC8321]. | [RFC8321]. | |||
In summary, an application can configure end-to-end network | In summary, an application can configure end-to-end network | |||
monitoring. If the network does not experience issues, this | monitoring. If the network does not experience issues, this | |||
approximate monitoring is good enough and is very cheap in terms of | approximate monitoring is good enough and is very cheap in terms of | |||
network resources. However, in case of problems, the application | network resources. However, in case of problems, the application | |||
becomes aware of the issues from this approximate monitoring and, in | becomes aware of the issues from this approximate monitoring and, in | |||
order to localize the portion of the network that has issues, | order to localize the portion of the network that has issues, | |||
configures the measurement points more extensively, allowing more | configures the measurement points more extensively, allowing more | |||
detailed monitoring to be performed. After the detection and | detailed monitoring to be performed. After the detection and | |||
resolution of the problem, the initial approximate monitoring can be | resolution of the problem, the initial approximate monitoring can be | |||
used again. | used again. | |||
A.3.2. Dynamic Network Probe | A.3.2. Dynamic Network Probe | |||
Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] | A hardware-based Dynamic Network Probe (DNP) [OPSAWG-DNP4IQ] provides | |||
proposes a programmable means to customize the data that an | a programmable means to customize the data that an application | |||
application collects from the data plane. A direct benefit of DNP is | collects from the data plane. A direct benefit of DNP is the | |||
the reduction of the exported data. A full DNP solution covers | reduction of the exported data. A full DNP solution covers several | |||
several components including data source, data subscription, and data | components including data source, data subscription, and data | |||
generation. The data subscription needs to define the derived data | generation. The data subscription needs to define the derived data | |||
which can be composed and derived from the raw data sources. The | that can be composed and derived from raw data sources. The data | |||
data generation takes advantage of the moderate in-network computing | generation takes advantage of the moderate in-network computing to | |||
to produce the desired data. | produce the desired data. | |||
While DNP can introduce unforeseeable flexibility to the data plane | While DNP can introduce unforeseeable flexibility to the data plane | |||
telemetry, it also faces some challenges. It requires a flexible | telemetry, it also faces some challenges. It requires a flexible | |||
data plane that can be dynamically reprogrammed at run-time. The | data plane that can be dynamically reprogrammed at runtime. The | |||
programming API is yet to be defined. | programming Application Programming Interface (API) is yet to be | |||
defined. | ||||
A.3.3. IP Flow Information Export (IPFIX) Protocol | A.3.3. IP Flow Information Export (IPFIX) Protocol | |||
Traffic on a network can be seen as a set of flows passing through | Traffic on a network can be seen as a set of flows passing through | |||
network elements. IP Flow Information Export (IPFIX) [RFC7011] | network elements. IPFIX [RFC7011] provides a means of transmitting | |||
provides a means of transmitting traffic flow information for | traffic flow information for administrative or other purposes. A | |||
administrative or other purposes. A typical IPFIX enabled system | typical IPFIX-enabled system includes a pool of Metering Processes | |||
includes a pool of Metering Processes that collects data packets at | that collects data packets at one or more Observation Points, | |||
one or more Observation Points, optionally filters them and | optionally filters them, and aggregates information about these | |||
aggregates information about these packets. An Exporter then gathers | packets. An Exporter then gathers each of the Observation Points | |||
each of the Observation Points together into an Observation Domain | together into an Observation Domain and sends this information via | |||
and sends this information via the IPFIX protocol to a Collector. | the IPFIX protocol to a Collector. | |||
A.3.4. In-Situ OAM | A.3.4. In Situ OAM | |||
Classical passive and active monitoring and measurement techniques | Classical passive and active monitoring and measurement techniques | |||
are either inaccurate or resource-consuming. It is preferable to | are either inaccurate or resource consuming. It is preferable to | |||
directly acquire data associated with a flow's packets when the | directly acquire data associated with a flow's packets when the | |||
packets pass through a network. In-situ OAM (iOAM) | packets pass through a network. IOAM [RFC9197], a data generation | |||
[I-D.ietf-ippm-ioam-data], a data generation technique, embeds a new | technique, embeds a new instruction header to user packets, and the | |||
instruction header to user packets and the instruction directs the | instruction directs the network nodes to add the requested data to | |||
network nodes to add the requested data to the packets. Thus, at the | the packets. Thus, at the path's end, the packet's experience gained | |||
path end, the packet's experience gained on the entire forwarding | on the entire forwarding path can be collected. Such firsthand data | |||
path can be collected. Such firsthand data is invaluable to many | is invaluable to many network OAM applications. | |||
network OAM applications. | ||||
However, iOAM also faces some challenges. The issues on performance | However, IOAM also faces some challenges. The issues on performance | |||
impact, security, scalability and overhead limits, encapsulation | impact, security, scalability and overhead limits, encapsulation | |||
difficulties in some protocols, and cross-domain deployment need to | difficulties in some protocols, and cross-domain deployment need to | |||
be addressed. | be addressed. | |||
A.3.5. Postcard Based Telemetry | A.3.5. Postcard-Based Telemetry | |||
The postcard-based telemetry, as embodied in IOAM DEX | The postcard-based telemetry, as embodied in IOAM Direct Export (DEX) | |||
[I-D.ietf-ippm-ioam-direct-export] and IOAM Marking | [IPPM-IOAM-DIRECT-EXPORT] and IOAM Marking | |||
[I-D.song-ippm-postcard-based-telemetry], is a complementary | [IPPM-POSTCARD-BASED-TELEMETRY], is a complementary technique to the | |||
technique to the passport-based IOAM. PBT directly exports data at | passport-based IOAM [RFC9197]. PBT directly exports data at each | |||
each node through an independent packet. At the cost of higher | node through an independent packet. At the cost of higher bandwidth | |||
bandwidth overhead and the need for data correlation, PBT shows | overhead and the need for data correlation, PBT shows several unique | |||
several unique advantages. It can also help to identify packet drop | advantages. It can also help to identify packet drop location in | |||
location in case a packet is dropped on its forwarding path. | case a packet is dropped on its forwarding path. | |||
A.3.6. Existing OAM for Specific Data Planes | A.3.6. Existing OAM for Specific Data Planes | |||
Various data planes raise unique OAM requirements. IETF has | Various data planes raise unique OAM requirements. IETF has | |||
published OAM technique and framework documents (e.g., [RFC8924] and | published OAM technique and framework documents (e.g., [RFC8924] and | |||
[RFC5085]) targeting different data planes such as Multi-Protocol | [RFC5085]) targeting different data planes such as Multiprotocol | |||
Label Switching (MPLS), L2 Virtual Private Network (L2-VPN), Network | Label Switching (MPLS), L2 Virtual Private Network (VPN), Network | |||
Virtualization Overlays (NVO3), Virtual Extensible LAN (VXLAN), Bit | Virtualization over Layer 3 (NVO3), Virtual Extensible LAN (VXLAN), | |||
Indexed Explicit Replication (BIER), Service Function Chaining (SFC), | Bit Index Explicit Replication (BIER), Service Function Chaining | |||
Segment Routing (SR), and Deterministic Networking (DETNET). The | (SFC), Segment Routing (SR), and Deterministic Networking (DETNET). | |||
aforementioned data plane telemetry techniques can be used to enhance | The aforementioned data plane telemetry techniques can be used to | |||
the OAM capability on such data planes. | enhance the OAM capability on such data planes. | |||
A.4. External Data and Event Telemetry | A.4. External Data and Event Telemetry | |||
A.4.1. Sources of External Events | A.4.1. Sources of External Events | |||
To ensure that the information provided by external event detectors | To ensure that the information provided by external event detectors | |||
and used by the network management solutions is meaningful for | and used by the network management solutions is meaningful for | |||
management purposes, the network telemetry framework must ensure that | management purposes, the network telemetry framework must ensure that | |||
such detectors (sources) are easily connected to the management | such detectors (sources) are easily connected to the management | |||
solutions (sinks). This requires the specification of a list of | solutions (sinks). This requires the specification of a list of | |||
potential external data sources that could be of interest in network | potential external data sources that could be of interest in network | |||
management and match it to the connectors and/or interfaces required | management and matching it to the connectors and/or interfaces | |||
to connect them. | required to connect them. | |||
Categories of external event sources that may be of interest to | Categories of external event sources that may be of interest to | |||
network management include:: | network management include: | |||
* Smart objects and sensors. With the consolidation of the Internet | * Smart objects and sensors. With the consolidation of the Internet | |||
of Things~(IoT) any network system will have many smart objects | of Things (IoT), any network system will have many smart objects | |||
attached to its physical surroundings and logical operation | attached to its physical surroundings and logical operation | |||
environments. Most of these objects will be essentially based on | environments. Most of these objects will be essentially based on | |||
sensors of many kinds (e.g., temperature, humidity, presence) and | sensors of many kinds (e.g., temperature, humidity, and presence), | |||
the information they provide can be very useful for the management | and the information they provide can be very useful for the | |||
of the network, even when they are not specifically deployed for | management of the network, even when they are not specifically | |||
such purpose. Elements of this source type will usually provide a | deployed for such purpose. Elements of this source type will | |||
specific protocol for interaction, especially one of those | usually provide a specific protocol for interaction, especially | |||
protocols related to IoT, such as the Constrained Application | one of the protocols related to IoT, such as the Constrained | |||
Protocol (CoAP). | Application Protocol (CoAP). | |||
* Online news reporters. Several online news services have the | * Online news reporters. Several online news services have the | |||
ability to provide enormous quantity of information about | ability to provide an enormous quantity of information about | |||
different events occurring in the world. Some of those events can | different events occurring in the world. Some of those events can | |||
impact on the network system managed by a specific framework and, | have an impact on the network system managed by a specific | |||
therefore, such information may be of interest to the management | framework; therefore, such information may be of interest to the | |||
solution. For instance, diverse security reports, such as the | management solution. For instance, diverse security reports, such | |||
Common Vulnerabilities and Exposures (CVE), can be issued by the | as Common Vulnerabilities and Exposures (CVEs), can be issued by | |||
corresponding authority and used by the management solution to | the corresponding authority and used by the management solution to | |||
update the managed system if needed. Instead of a specific | update the managed system, if needed. Instead of a specific | |||
protocol and data format, the sources of this kind of information | protocol and data format, the sources of this kind of information | |||
usually follow a relaxed but structured format. This format will | usually follow a relaxed but structured format. This format will | |||
be part of both the ontology and information model of the | be part of both the ontology and information model of the | |||
telemetry framework. | telemetry framework. | |||
* Global event analyzers. The advance of Big Data analyzers | * Global event analyzers. The advance of big data analyzers | |||
provides a huge amount of information and, more interestingly, the | provides a huge amount of information and, more interestingly, the | |||
identification of events detected by analyzing many data streams | identification of events detected by analyzing many data streams | |||
from different origins. In contrast with the other types of | from different origins. In contrast with the other types of | |||
sources, which are focused on specific events, the detectors of | sources, which are focused on specific events, the detectors of | |||
this source type will detect generic events. For example, during | this source type will detect generic events. For example, during | |||
a sport event some unexpected movement makes it fascinating and | a sports event, some unexpected movement makes it fascinating, and | |||
many people connect to sites that are reporting on the event. The | many people connect to sites that are reporting on the event. The | |||
underlying networks supporting the services that cover the event | underlying networks supporting the services that cover the event | |||
can be affected by such situation, so their management solutions | can be affected by such situation, so their management solutions | |||
should be aware of it. In contrast with the other source types, a | should be aware of it. In contrast with the other source types, a | |||
new information model, format, and reporting protocol is required | new information model, format, and reporting protocol is required | |||
to integrate the detectors of this type with the management | to integrate the detectors of this type with the management | |||
solution. | solution. | |||
Additional types of detector types can be added to the system, but | Additional detector types can be added to the system, but generally | |||
they will be generally the result of composing the properties offered | they will be the result of composing the properties offered by these | |||
by these main classes. | main classes. | |||
A.4.2. Connectors and Interfaces | A.4.2. Connectors and Interfaces | |||
For allowing external event detectors to be properly integrated with | For allowing external event detectors to be properly integrated with | |||
other management solutions, both elements must expose interfaces and | other management solutions, both elements must expose interfaces and | |||
protocols that are subject to their particular objective. Since | protocols that are subject to their particular objective. Since | |||
external event detectors will be focused on providing their | external event detectors will be focused on providing their | |||
information to their main consumers, which generally will not be | information to their main consumers, which generally will not be | |||
limited to the network management solutions, the framework must | limited to the network management solutions, the framework must | |||
include the definition of the required connectors for ensuring the | include the definition of the required connectors for ensuring the | |||
interconnection between detectors (sources) and their consumers | interconnection between detectors (sources) and their consumers | |||
within the management systems (sinks) are effective. | within the management systems (sinks) are effective. | |||
In some situations, the interconnection between the external event | In some situations, the interconnection between external event | |||
detectors and the management system is via the management plane. For | detectors and the management system is via the management plane. For | |||
those situations there will be a special connector that provides the | those situations, there will be a special connector that provides the | |||
typical interfaces found in most other elements connected to the | typical interfaces found in most other elements connected to the | |||
management plane. For instance, the interfaces could accomplish this | management plane. For instance, the interfaces could accomplish this | |||
with a specific data model (YANG) and specific telemetry protocol, | with a specific data model (YANG) and specific telemetry protocol, | |||
such as NETCONF, YANG-Push, or gRPC. | such as NETCONF, YANG-Push, or gRPC. | |||
Acknowledgments | ||||
We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe | ||||
Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe | ||||
Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra, | ||||
Ben Schwartz, Alexey Melnikov, Michael Scharf, Dhruv Dhody, Martin | ||||
Duke, Roman Danyliw, Warren Kumari, Sheng Jiang, Lars Eggert, Éric | ||||
Vyncke, Jean-Michel Combes, Erik Kline, Benjamin Kaduk, and many | ||||
others who have provided helpful comments and suggestions to improve | ||||
this document. | ||||
Contributors | ||||
The other contributors of this document are Tianran Zhou, Zhenbin Li, | ||||
Zhenqiang Li, Daniel King, Adrian Farrel, and Alexander Clemm. | ||||
Authors' Addresses | Authors' Addresses | |||
Haoyu Song | Haoyu Song | |||
Futurewei | Futurewei | |||
United States of America | United States of America | |||
Email: haoyu.song@futurewei.com | Email: haoyu.song@futurewei.com | |||
Fengwei Qin | Fengwei Qin | |||
China Mobile | China Mobile | |||
P.R. China | China | |||
Email: qinfengwei@chinamobile.com | Email: qinfengwei@chinamobile.com | |||
Pedro Martinez-Julia | Pedro Martinez-Julia | |||
NICT | NICT | |||
Japan | Japan | |||
Email: pedro@nict.go.jp | Email: pedro@nict.go.jp | |||
Laurent Ciavaglia | Laurent Ciavaglia | |||
Rakuten Mobile | Rakuten Mobile | |||
France | France | |||
Email: laurent.ciavaglia@rakuten.com | Email: laurent.ciavaglia@rakuten.com | |||
Aijun Wang | Aijun Wang | |||
China Telecom | China Telecom | |||
P.R. China | China | |||
Email: wangaj3@chinatelecom.cn | ||||
Email: wangaj.bri@chinatelecom.cn | ||||
End of changes. 219 change blocks. | ||||
703 lines changed or deleted | 722 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |