rfc9417.original | rfc9417.txt | |||
---|---|---|---|---|
OPSAWG B. Claise | Internet Engineering Task Force (IETF) B. Claise | |||
Internet-Draft J. Quilbeuf | Request for Comments: 9417 J. Quilbeuf | |||
Intended status: Informational Huawei | Category: Informational Huawei | |||
Expires: 7 July 2023 D. Lopez | ISSN: 2070-1721 D. Lopez | |||
Telefonica I+D | Telefonica I+D | |||
D. Voyer | D. Voyer | |||
Bell Canada | Bell Canada | |||
T. Arumugam | T. Arumugam | |||
Cisco Systems, Inc. | Consultant | |||
3 January 2023 | June 2023 | |||
Service Assurance for Intent-based Networking Architecture | Service Assurance for Intent-Based Networking Architecture | |||
draft-ietf-opsawg-service-assurance-architecture-13 | ||||
Abstract | Abstract | |||
This document describes an architecture that aims at assuring that | This document describes an architecture that provides some assurance | |||
service instances are running as expected. As services rely upon | that service instances are running as expected. As services rely | |||
multiple sub-services provided by a variety of elements including the | upon multiple subservices provided by a variety of elements, | |||
underlying network devices and functions, getting the assurance of a | including the underlying network devices and functions, getting the | |||
healthy service is only possible with a holistic view of all involved | assurance of a healthy service is only possible with a holistic view | |||
elements. This architecture not only helps to correlate the service | of all involved elements. This architecture not only helps to | |||
degradation with symptoms of a specific network component but also to | correlate the service degradation with symptoms of a specific network | |||
list the services impacted by the failure or degradation of a | component but, it also lists the services impacted by the failure or | |||
specific network component. | degradation of a specific network component. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This document is not an Internet Standards Track specification; it is | |||
provisions of BCP 78 and BCP 79. | published for informational purposes. | |||
Internet-Drafts are working documents of the Internet Engineering | ||||
Task Force (IETF). Note that other groups may also distribute | ||||
working documents as Internet-Drafts. The list of current Internet- | ||||
Drafts is at https://datatracker.ietf.org/drafts/current/. | ||||
Internet-Drafts are draft documents valid for a maximum of six months | This document is a product of the Internet Engineering Task Force | |||
and may be updated, replaced, or obsoleted by other documents at any | (IETF). It represents the consensus of the IETF community. It has | |||
time. It is inappropriate to use Internet-Drafts as reference | received public review and has been approved for publication by the | |||
material or to cite them other than as "work in progress." | Internet Engineering Steering Group (IESG). Not all documents | |||
approved by the IESG are candidates for any level of Internet | ||||
Standard; see Section 2 of RFC 7841. | ||||
This Internet-Draft will expire on 7 July 2023. | Information about the current status of this document, any errata, | |||
and how to provide feedback on it may be obtained at | ||||
https://www.rfc-editor.org/info/rfc9417. | ||||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2023 IETF Trust and the persons identified as the | Copyright (c) 2023 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents | |||
license-info) in effect on the date of publication of this document. | (https://trustee.ietf.org/license-info) in effect on the date of | |||
Please review these documents carefully, as they describe your rights | publication of this document. Please review these documents | |||
and restrictions with respect to this document. Code Components | carefully, as they describe your rights and restrictions with respect | |||
extracted from this document must include Revised BSD License text as | to this document. Code Components extracted from this document must | |||
described in Section 4.e of the Trust Legal Provisions and are | include Revised BSD License text as described in Section 4.e of the | |||
provided without warranty as described in the Revised BSD License. | Trust Legal Provisions and are provided without warranty as described | |||
in the Revised BSD License. | ||||
Table of Contents | Table of Contents | |||
1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction | |||
2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 | 2. Terminology | |||
3. A Functional Architecture . . . . . . . . . . . . . . . . . . 7 | 3. A Functional Architecture | |||
3.1. Translating a Service Instance Configuration into an | 3.1. Translating a Service Instance Configuration into an | |||
Assurance Graph . . . . . . . . . . . . . . . . . . . . . 10 | Assurance Graph | |||
3.1.1. Circular Dependencies . . . . . . . . . . . . . . . . 12 | 3.1.1. Circular Dependencies | |||
3.2. Intent and Assurance Graph . . . . . . . . . . . . . . . 16 | 3.2. Intent and Assurance Graph | |||
3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 17 | 3.3. Subservices | |||
3.4. Building the Expression Graph from the Assurance Graph . 18 | 3.4. Building the Expression Graph from the Assurance Graph | |||
3.5. Open Interfaces with YANG Modules . . . . . . . . . . . . 19 | 3.5. Open Interfaces with YANG Modules | |||
3.6. Handling Maintenance Windows . . . . . . . . . . . . . . 20 | 3.6. Handling Maintenance Windows | |||
3.7. Flexible Functional Architecture . . . . . . . . . . . . 21 | 3.7. Flexible Functional Architecture | |||
3.8. Time window for symptoms history . . . . . . . . . . . . 23 | 3.8. Time Window for Symptoms' History | |||
3.9. New Assurance Graph Generation . . . . . . . . . . . . . 23 | 3.9. New Assurance Graph Generation | |||
4. Security Considerations . . . . . . . . . . . . . . . . . . . 24 | 4. IANA Considerations | |||
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 | 5. Security Considerations | |||
6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 25 | 6. References | |||
7. References . . . . . . . . . . . . . . . . . . . . . . . . . 25 | 6.1. Normative References | |||
7.1. Normative References . . . . . . . . . . . . . . . . . . 25 | 6.2. Informative References | |||
7.2. Informative References . . . . . . . . . . . . . . . . . 25 | Acknowledgements | |||
Appendix A. Changes between revisions . . . . . . . . . . . . . 27 | Contributors | |||
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 28 | Authors' Addresses | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 | ||||
1. Terminology | ||||
SAIN agent: A functional component that communicates with a device, a | ||||
set of devices, or another agent to build an expression graph from a | ||||
received assurance graph and perform the corresponding computation of | ||||
the health status and symptoms. A SAIN agent might be running | ||||
directly on the device it monitors. | ||||
Assurance case: "An assurance case is a structured argument, | ||||
supported by evidence, intended to justify that a system is | ||||
acceptably assured relative to a concern (such as safety or security) | ||||
in the intended operating environment" [Piovesan2017]. | ||||
Service instance: A specific instance of a service. | ||||
Intent: "A set of operational goals (that a network should meet) and | ||||
outcomes (that a network is supposed to deliver), defined in a | ||||
declarative manner without specifying how to achieve or implement | ||||
them" [RFC9315]. | ||||
Subservice: Part or functionality of the network system that can be | ||||
independently assured as a single entity in assurance graph. | ||||
Assurance graph: A Directed Acyclic Graph (DAG) representing the | ||||
assurance case for one or several service instances. The nodes (also | ||||
known as vertices in the context of DAG) are the service instances | ||||
themselves and the subservices, the edges indicate a dependency | ||||
relation. | ||||
SAIN collector: A functional component that fetches or receives the | ||||
computer-consumable output of the SAIN agent(s) and process it | ||||
locally (including displaying it in a user-friendly form). | ||||
DAG: Directed Acyclic Graph. | ||||
ECMP: Equal Cost Multiple Paths | ||||
Expression graph: A generic term for a DAG representing a computation | ||||
in SAIN. More specific terms are: | ||||
* Subservice expressions: Is an expression graph representing all | ||||
the computations to execute for a subservice. | ||||
* Service expressions: Is an expression graph representing all the | ||||
computations to execute for a service instance, i.e., including | ||||
the computations for all dependent subservices. | ||||
* Global computation graph: Is an expression graph representing all | ||||
the computations to execute for all services instances (i.e., all | ||||
computations performed). | ||||
Dependency: The directed relationship between subservice instances in | ||||
the assurance graph. | ||||
Metric: A piece of information retrieved from the network running the | ||||
assured service. | ||||
Metric engine: A functional component, part of the SAIN agent, that | ||||
maps metrics to a list of candidate metric implementations depending | ||||
on the network element. | ||||
Metric implementation: Actual way of retrieving a metric from a | ||||
network element. | ||||
Network service YANG module: describes the characteristics of a | ||||
service as agreed upon with consumers of that service [RFC8199]. | ||||
Service orchestrator: Quoting RFC8199, "Network Service YANG Modules | ||||
describe the characteristics of a service, as agreed upon with | ||||
consumers of that service. That is, a service module does not expose | ||||
the detailed configuration parameters of all participating network | ||||
elements and features but describes an abstract model that allows | ||||
instances of the service to be decomposed into instance data | ||||
according to the Network Element YANG Modules of the participating | ||||
network elements. The service-to-element decomposition is a separate | ||||
process; the details depend on how the network operator chooses to | ||||
realize the service. For the purpose of this document, the term | ||||
"orchestrator" is used to describe a system implementing such a | ||||
process." | ||||
SAIN orchestrator: A functional component that is in charge of | ||||
fetching the configuration specific to each service instance and | ||||
converting it into an assurance graph. | ||||
Health status: Score and symptoms indicating whether a service | ||||
instance or a subservice is "healthy". A non-maximal score must | ||||
always be explained by one or more symptoms. | ||||
Health score: Integer ranging from 0 to 100 indicating the health of | ||||
a subservice. A score of 0 means that the subservice is broken, a | ||||
score of 100 means that the subservice in question is operating as | ||||
expected. The special value -1 can be used to specify that no value | ||||
could be computed for that health-score, for instance if some metric | ||||
needed for that computation could not be collected. | ||||
Strongly connected component: subset of a directed graph such that | ||||
there is a (directed) path from any node of the subset to any other | ||||
node. A DAG does not contain any strongly connected component. | ||||
Symptom: Reason explaining why a service instance or a subservice is | ||||
not completely healthy. | ||||
2. Introduction | 1. Introduction | |||
Network service YANG modules [RFC8199] describe the configuration, | Network Service YANG Modules [RFC8199] describe the configuration, | |||
state data, operations, and notifications of abstract representations | state data, operations, and notifications of abstract representations | |||
of services implemented on one or multiple network elements. | of services implemented on one or multiple network elements. | |||
Service orchestrators use Network service YANG modules that will | Service orchestrators use Network Service YANG Modules that will | |||
infer network-wide configuration and, therefore the invocation of the | infer network-wide configuration and, therefore, the invocation of | |||
appropriate device modules (Section 3 of [RFC8969]). Knowing that a | the appropriate device modules (Section 3 of [RFC8969]). Knowing | |||
configuration is applied doesn't imply that the provisioned service | that a configuration is applied doesn't imply that the provisioned | |||
instance is up and running as expected. For instance, the service | service instance is up and running as expected. For instance, the | |||
might be degraded because of a failure in the network, the service | service might be degraded because of a failure in the network, the | |||
quality may be degraded, or a service function may be reachable at | service quality may be degraded, or a service function may be | |||
the IP level but does not provide its intended function. Thus, the | reachable at the IP level but does not provide its intended function. | |||
network operator must monitor the service's operational data at the | Thus, the network operator must monitor the service's operational | |||
same time as the configuration (Section 3.3 of [RFC8969]). To feed | data at the same time as the configuration (Section 3.3 of | |||
that task, the industry has been standardizing on telemetry to push | [RFC8969]). To fuel that task, the industry has been standardizing | |||
network element performance information (e.g., | on telemetry to push network element performance information (e.g., | |||
[I-D.ietf-opsawg-yang-vpn-service-pm]). | [RFC9375]). | |||
A network administrator needs to monitor their network and services | A network administrator needs to monitor its network and services as | |||
as a whole, independently of the management protocols. With | a whole, independently of the management protocols. With different | |||
different protocols come different data models, and different ways to | protocols come different data models and different ways to model the | |||
model the same type of information. When network administrators deal | same type of information. When network administrators deal with | |||
with multiple management protocols, the network management entities | multiple management protocols, the network management entities have | |||
have to perform the difficult and time-consuming job of mapping data | to perform the difficult and time-consuming job of mapping data | |||
models: e.g., the model used for configuration with the model used | models, e.g., the model used for configuration with the model used | |||
for monitoring when separate models or protocols are used. This | for monitoring when separate models or protocols are used. This | |||
problem is compounded by a large, disparate set of data sources (MIB | problem is compounded by a large, disparate set of data sources | |||
modules, YANG models [RFC7950], IPFIX information elements [RFC7011], | (e.g., MIB modules, YANG data models [RFC7950], IP Flow Information | |||
syslog plain text [RFC5424], TACACS+ [RFC8907], RADIUS [RFC2865], | Export (IPFIX) information elements [RFC7011], syslog plain text | |||
etc.). In order to avoid this data model mapping, the industry | [RFC5424], Terminal Access Controller Access-Control System Plus | |||
converged on model-driven telemetry to stream the service operational | (TACACS+) [RFC8907], RADIUS [RFC2865], etc.). In order to avoid this | |||
data, reusing the YANG models used for configuration. Model-driven | data model mapping, the industry converged on model-driven telemetry | |||
telemetry greatly facilitates the notion of closed-loop automation | to stream the service operational data, reusing the YANG data models | |||
whereby events and updated operational state streamed from the | used for configuration. Model-driven telemetry greatly facilitates | |||
network drive remediation changes back into the network. | the notion of closed-loop automation, whereby events and updated | |||
operational states streamed from the network drive remediation change | ||||
back into the network. | ||||
However, it proves difficult for network operators to correlate the | However, it proves difficult for network operators to correlate the | |||
service degradation with the network root cause. For example, "Why | service degradation with the network root cause, for example, "Why | |||
does my layer 3 virtual private network (L3VPN) fail to connect?" or | does my layer 3 virtual private network (L3VPN) fail to connect?" or | |||
"Why is this specific service not highly responsive?". The reverse, | "Why is this specific service not highly responsive?" The reverse, | |||
i.e., which services are impacted when a network component fails or | i.e., which services are impacted when a network component fails or | |||
degrades, is also important for operators. For example, "Which | degrades, is also important for operators, for example, "Which | |||
services are impacted when this specific optic decibel milliwatt | services are impacted when this specific optic decibel milliwatt | |||
(dBm) begins to degrade?", "Which applications are impacted by an | (dBm) begins to degrade?", "Which applications are impacted by an | |||
imbalance in this equal cost multiple paths (ECMP) bundle?", or "Is | imbalance in this Equal-Cost Multipath (ECMP) bundle?", or "Is that | |||
that issue actually impacting any other customers?". This task | issue actually impacting any other customers?" This task usually | |||
usually falls under the so-called "Service Impact Analysis" | falls under the so-called "Service Impact Analysis" functional block. | |||
functional block. | ||||
In this document, we propose an architecture implementing Service | This document defines an architecture implementing Service Assurance | |||
Assurance for Intent-Based Networking (SAIN). Intent-based | for Intent-based Networking (SAIN). Intent-based approaches are | |||
approaches are often declarative, starting from a statement of "The | often declarative, starting from a statement of "The service works as | |||
service works as expected" and trying to enforce it. However, some | expected" and trying to enforce it. However, some already-defined | |||
already defined services might have been designed using a different | services might have been designed using a different approach. | |||
approach. Aligned with Section 3.3 of [RFC7149], and instead of | Aligned with Section 3.3 of [RFC7149], and instead of requiring a | |||
requiring a declarative intent as a starting point, this architecture | declarative intent as a starting point, this architecture focuses on | |||
focuses on already defined services and tries to infer the meaning of | already-defined services and tries to infer the meaning of "The | |||
"The service works as expected". To do so, the architecture works | service works as expected". To do so, the architecture works from an | |||
from an assurance graph, deduced from the configuration pushed to the | assurance graph, deduced from the configuration pushed to the device | |||
device for enabling the service instance. If the SAIN orchestrator | for enabling the service instance. If the SAIN orchestrator supports | |||
supports it, the service model (Section 2 of [RFC8309]) or the | it, the service model (Section 2 of [RFC8309]) or the network model | |||
network model (Section 2.1 of [RFC8969]) can also be used to build | (Section 2.1 of [RFC8969]) can also be used to build the assurance | |||
the assurance graph. In that case and if the service model includes | graph. In that case and if the service model includes the | |||
the declarative intent as well, the SAIN orchestrator can rely on the | declarative intent as well, the SAIN orchestrator can rely on the | |||
declared intent instead of inferring it. The assurance graph may | declared intent instead of inferring it. The assurance graph may | |||
also be explicitly completed to add an intent not exposed in the | also be explicitly completed to add an intent not exposed in the | |||
service model itself. | service model itself. | |||
The assurance graph of a service instance is decomposed into | The assurance graph of a service instance is decomposed into | |||
components, which are then assured independently. The top of the | components, which are then assured independently. The top of the | |||
assurance graph represents the service instance to assure, and its | assurance graph represents the service instance to assure, and its | |||
children represent components identified as its direct dependencies; | children represent components identified as its direct dependencies; | |||
each component can have dependencies as well. Components involved in | each component can have dependencies as well. Components involved in | |||
the assurance graph of a service are called subservices. The SAIN | the assurance graph of a service are called subservices. The SAIN | |||
orchestrator updates automatically the assurance graph when the | orchestrator updates the assurance graph automatically when the | |||
service instance is modified. | service instance is modified. | |||
When a service is degraded, the SAIN architecture will highlight | When a service is degraded, the SAIN architecture will highlight | |||
where in the assurance service graph to look, as opposed to going hop | where in the assurance service graph to look, as opposed to going hop | |||
by hop to troubleshoot the issue. More precisely, the SAIN | by hop to troubleshoot the issue. More precisely, the SAIN | |||
architecture will associate to each service instance a list of | architecture will associate to each service instance a list of | |||
symptoms originating from specific subservices, corresponding to | symptoms originating from specific subservices, corresponding to | |||
components of the network. These components are good candidates for | components of the network. These components are good candidates for | |||
explaining the source of a service degradation. Not only can this | explaining the source of a service degradation. Not only can this | |||
architecture help to correlate service degradation with network root | architecture help to correlate service degradation with network root | |||
cause/symptoms, but it can deduce from the assurance graph the list | cause/symptoms, but it can deduce from the assurance graph the list | |||
of service instances impacted by a component degradation/failure. | of service instances impacted by a component degradation/failure. | |||
This added value informs the operational team where to focus its | This added value informs the operational team where to focus its | |||
attention for maximum return. Indeed, the operational team is likely | attention for maximum return. Indeed, the operational team is likely | |||
to focus their priority on the degrading/failing components impacting | to focus their priority on the degrading/failing components impacting | |||
the highest number of their customers, especially the ones with the | the highest number of their customers, especially the ones with the | |||
SLA contracts involving penalties in case of failure. | Service-Level Agreement (SLA) contracts involving penalties in case | |||
of failure. | ||||
This architecture provides the building blocks to assure both | This architecture provides the building blocks to assure both | |||
physical and virtual entities and is flexible with respect to | physical and virtual entities and is flexible with respect to | |||
services and subservices, of (distributed) graphs, and of components | services and subservices of (distributed) graphs and components | |||
(Section 3.7). | (Section 3.7). | |||
The architecture presented in this document is implemented by a set | The architecture presented in this document is implemented by a set | |||
of YANG modules defined in a companion document | of YANG modules defined in a companion document [RFC9418]. These | |||
[I-D.ietf-opsawg-service-assurance-yang]. These YANG modules | YANG modules properly define the interfaces between the various | |||
properly define the interfaces between the various components of the | components of the architecture to foster interoperability. | |||
architecture in order to foster interoperability. | ||||
2. Terminology | ||||
SAIN agent: A functional component that communicates with a device, | ||||
a set of devices, or another agent to build an expression graph | ||||
from a received assurance graph and perform the corresponding | ||||
computation of the health status and symptoms. A SAIN agent might | ||||
be running directly on the device it monitors. | ||||
Assurance case: "An assurance case is a structured argument, | ||||
supported by evidence, intended to justify that a system is | ||||
acceptably assured relative to a concern (such as safety or | ||||
security) in the intended operating environment" [Piovesan2017]. | ||||
Service instance: A specific instance of a service. | ||||
Intent: "A set of operational goals (that a network should meet) and | ||||
outcomes (that a network is supposed to deliver) defined in a | ||||
declarative manner without specifying how to achieve or implement | ||||
them" [RFC9315]. | ||||
Subservice: A part or functionality of the network system that can | ||||
be independently assured as a single entity in an assurance graph. | ||||
Assurance graph: A Directed Acyclic Graph (DAG) representing the | ||||
assurance case for one or several service instances. The nodes | ||||
(also known as vertices in the context of DAG) are the service | ||||
instances themselves and the subservices; the edges indicate a | ||||
dependency relation. | ||||
SAIN collector: A functional component that fetches or receives the | ||||
computer-consumable output of the SAIN agent(s) and processes it | ||||
locally (including displaying it in a user-friendly form). | ||||
DAG: Directed Acyclic Graph. | ||||
ECMP: Equal-Cost Multipath. | ||||
Expression graph: A generic term for a DAG representing a | ||||
computation in SAIN. More specific terms are listed below: | ||||
Subservice expressions: | ||||
An expression graph representing all the computations to | ||||
execute for a subservice. | ||||
Service expressions: | ||||
An expression graph representing all the computations to | ||||
execute for a service instance, i.e., including the | ||||
computations for all dependent subservices. | ||||
Global computation graph: | ||||
An expression graph representing all the computations to | ||||
execute for all services instances (i.e., all computations | ||||
performed). | ||||
Dependency: The directed relationship between subservice instances | ||||
in the assurance graph. | ||||
Metric: A piece of information retrieved from the network running | ||||
the assured service. | ||||
Metric engine: A functional component, part of the SAIN agent, that | ||||
maps metrics to a list of candidate metric implementations, | ||||
depending on the network element. | ||||
Metric implementation: The actual way of retrieving a metric from a | ||||
network element. | ||||
Network Service YANG Module: The characteristics of a service, as | ||||
agreed upon with consumers of that service [RFC8199]. | ||||
Service orchestrator: "Network Service YANG Modules describe the | ||||
characteristics of a service, as agreed upon with consumers of | ||||
that service. That is, a service module does not expose the | ||||
detailed configuration parameters of all participating network | ||||
elements and features but describes an abstract model that allows | ||||
instances of the service to be decomposed into instance data | ||||
according to the Network Element YANG Modules of the participating | ||||
network elements. The service-to-element decomposition is a | ||||
separate process; the details depend on how the network operator | ||||
chooses to realize the service. For the purpose of this document, | ||||
the term "orchestrator" is used to describe a system implementing | ||||
such a process" [RFC8199]. | ||||
SAIN orchestrator: A functional component that is in charge of | ||||
fetching the configuration specific to each service instance and | ||||
converting it into an assurance graph. | ||||
Health status: The score and symptoms indicating whether a service | ||||
instance or a subservice is "healthy". A non-maximal score must | ||||
always be explained by one or more symptoms. | ||||
Health score: An integer ranging from 0 to 100 that indicates the | ||||
health of a subservice. A score of 0 means that the subservice is | ||||
broken, a score of 100 means that the subservice in question is | ||||
operating as expected, and the special value -1 can be used to | ||||
specify that no value could be computed for that health score, for | ||||
instance, if some metric needed for that computation could not be | ||||
collected. | ||||
Strongly connected component: A subset of a directed graph such that | ||||
there is a (directed) path from any node of the subset to any | ||||
other node. A DAG does not contain any strongly connected | ||||
component. | ||||
Symptom: A reason explaining why a service instance or a subservice | ||||
is not completely healthy. | ||||
3. A Functional Architecture | 3. A Functional Architecture | |||
The goal of SAIN is to assure that service instances are operating as | The goal of SAIN is to assure that service instances are operating as | |||
expected (i.e., the observed service is matching the expected | expected (i.e., the observed service is matching the expected | |||
service) and if not, to pinpoint what is wrong. More precisely, SAIN | service) and, if not, to pinpoint what is wrong. More precisely, | |||
computes a score for each service instance and outputs symptoms | SAIN computes a score for each service instance and outputs symptoms | |||
explaining that score. The only valid situation where no symptoms | explaining that score. The only valid situation where no symptoms | |||
are returned is when the score is maximal, indicating that no issues | are returned is when the score is maximal, indicating that no issues | |||
were detected for that service instance. The score augmented with | were detected for that service instance. The score augmented with | |||
the symptoms is called the health status. The exact meaning of the | the symptoms is called the health status. The exact meaning of the | |||
health score value is out of scope of this document. However the | health score value is out of scope of this document. However, the | |||
following constraints should be followed: the higher the score, the | following constraints should be followed: the higher the score, the | |||
better the service health is; the two extrema being 0 meaning the | better the service health is and the two extrema are 0 meaning the | |||
service is completely broken and 100 meaning the service is | service is completely broken, and 100 meaning the service is | |||
completely operational. | completely operational. | |||
The SAIN architecture is a generic architecture, which generates an | The SAIN architecture is a generic architecture, which generates an | |||
assurance graph from service instance(s), as specified in | assurance graph from service instance(s), as specified in | |||
Section 3.1). This architecture is applicable to multiple | Section 3.1. This architecture is applicable to not only multiple | |||
environments (e.g. wireline, wireless), but also different domains | environments (e.g., wireline and wireless) but also different domains | |||
(e.g. 5G network function virtualization (NFV) domain with a virtual | (e.g., 5G network function virtualization (NFV) domain with a virtual | |||
infrastructure manager (VIM), etc.), and as already noted, for | infrastructure manager (VIM), etc.) and, as already noted, for | |||
physical or virtual devices, as well as virtual functions. Thanks to | physical or virtual devices, as well as virtual functions. Thanks to | |||
the distributed graph design principle, graphs from different | the distributed graph design principle, graphs from different | |||
environments/orchestrator can be combined to obtain the graph of a | environments and orchestrators can be combined to obtain the graph of | |||
service instance that spans over multiple domains. | a service instance that spans over multiple domains. | |||
As an example of a service, let us consider a point-to-point level 2 | As an example of a service, let us consider a point-to-point layer 2 | |||
virtual private network (L2VPN). [RFC8466] specifies the parameters | virtual private network (L2VPN). [RFC8466] specifies the parameters | |||
for such a service. Examples of symptoms might be symptoms reported | for such a service. Examples of symptoms might be symptoms reported | |||
by specific subservices "Interface has high error rate" or "Interface | by specific subservices, including "Interface has high error rate", | |||
flapping", or "Device almost out of memory" as well as symptoms more | "Interface flapping", or "Device almost out of memory", as well as | |||
specific to the service such as "Site disconnected from VPN". | symptoms more specific to the service (such as "Site disconnected | |||
from VPN"). | ||||
To compute the health status of an instance of such a service, the | To compute the health status of an instance of such a service, the | |||
service definition is decomposed into an assurance graph formed by | service definition is decomposed into an assurance graph formed by | |||
subservices linked through dependencies. Each subservice is then | subservices linked through dependencies. Each subservice is then | |||
turned into an expression graph that details how to fetch metrics | turned into an expression graph that details how to fetch metrics | |||
from the devices and compute the health status of the subservice. | from the devices and compute the health status of the subservice. | |||
The subservice expressions are combined according to the dependencies | The subservice expressions are combined according to the dependencies | |||
between the subservices in order to obtain the expression graph which | between the subservices in order to obtain the expression graph that | |||
computes the health status of the service instance. | computes the health status of the service instance. | |||
The overall SAIN architecture is presented in Figure 1. Based on the | The overall SAIN architecture is presented in Figure 1. Based on the | |||
service configuration provided by the service orchestrator, the SAIN | service configuration provided by the service orchestrator, the SAIN | |||
orchestrator decomposes the assurance graph. It then sends to the | orchestrator decomposes the assurance graph. It then sends to the | |||
SAIN agents the assurance graph along with some other configuration | SAIN agents the assurance graph along with some other configuration | |||
options. The SAIN agents are responsible for building the expression | options. The SAIN agents are responsible for building the expression | |||
graph and computing the health statuses in a distributed manner. The | graph and computing the health statuses in a distributed manner. The | |||
collector is in charge of collecting and displaying the current | collector is in charge of collecting and displaying the current | |||
inferred health status of the service instances and subservices. The | inferred health status of the service instances and subservices. The | |||
collector also detects changes in the assurance graph structures, for | collector also detects changes in the assurance graph structures | |||
instance when a switchover from primary to backup path occurs, and | (e.g., an occurrence of a switchover from primary to backup path) and | |||
forwards to the orchestrator, which reconfigures the agents. | forwards the information to the orchestrator, which reconfigures the | |||
Finally, the automation loop is closed by having the SAIN collector | agents. Finally, the automation loop is closed by having the SAIN | |||
providing feedback to the network/service orchestrator. | collector provide feedback to the network/service orchestrator. | |||
In order to make agents, orchestrators and collectors from different | In order to make agents, orchestrators, and collectors from different | |||
vendors interoperable, their interface is defined as a YANG model in | vendors interoperable, their interface is defined as a YANG module in | |||
a companion document [I-D.ietf-opsawg-service-assurance-yang]. In | a companion document [RFC9418]. In Figure 1, the communications that | |||
Figure 1, the communications that are normalized by this YANG model | are normalized by this YANG module are tagged with a "Y". The use of | |||
are tagged with a "Y". The use of this YANG model is further | this YANG module is further explained in Section 3.5. | |||
explained in Section 3.5. | ||||
+-----------------+ | +-----------------+ | |||
| Service | | | Service | | |||
| Orchestrator |<----------------------+ | | Orchestrator |<----------------------+ | |||
| | | | | | | | |||
+-----------------+ | | +-----------------+ | | |||
| ^ | | | ^ | | |||
| | Network | | | | Network | | |||
| | Service | Feedback | | | Service | Feedback | |||
| | Instance | Loop | | | Instance | Loop | |||
| | Configuration | | | | Configuration | | |||
| | | | | | | | |||
| V | | | V | | |||
| +-----------------+ Graph +-------------------+ | | +-----------------+ Graph +-------------------+ | |||
| | SAIN | updates | SAIN | | | | SAIN | Updates | SAIN | | |||
| | Orchestrator |<--------| Collector | | | | Orchestrator |<--------| Collector | | |||
| +-----------------+ +-------------------+ | | +-----------------+ +-------------------+ | |||
| | ^ | | | ^ | |||
| Y| Configuration | Health Status | | Y| Configuration | Health Status | |||
| | (assurance graph) Y| (Score + Symptoms) | | | (Assurance Graph) Y| (Score + Symptoms) | |||
| V | Streamed | | V | Streamed | |||
| +-------------------+ | via Telemetry | | +-------------------+ | via Telemetry | |||
| |+-------------------+ | | | |+-------------------+ | | |||
| ||+-------------------+ | | | ||+-------------------+ | | |||
| +|| SAIN |-----------+ | | +|| SAIN |-----------+ | |||
| +| agent | | | +| Agent | | |||
| +-------------------+ | | +-------------------+ | |||
| ^ ^ ^ | | ^ ^ ^ | |||
| | | | | | | | | | |||
| | | | Metric Collection | | | | | Metric Collection | |||
V V V V | V V V V | |||
+-------------------------------------------------------------+ | +-------------------------------------------------------------+ | |||
| (Network) System | | | (Network) System | | |||
| | | | | | |||
+-------------------------------------------------------------+ | +-------------------------------------------------------------+ | |||
skipping to change at page 10, line 5 ¶ | skipping to change at line 407 ¶ | |||
In order to produce the score assigned to a service instance, the | In order to produce the score assigned to a service instance, the | |||
various involved components perform the following tasks: | various involved components perform the following tasks: | |||
* Analyze the configuration pushed to the network device(s) for | * Analyze the configuration pushed to the network device(s) for | |||
configuring the service instance. From there, determine which | configuring the service instance. From there, determine which | |||
information (called a metric) must be collected from the device(s) | information (called a metric) must be collected from the device(s) | |||
and which operations to apply to the metrics to compute the health | and which operations to apply to the metrics to compute the health | |||
status. | status. | |||
* Stream (via telemetry [RFC8641]) operational and config metric | * Stream (via telemetry, such as YANG-Push [RFC8641]) operational | |||
values when possible, else continuously poll. | and config metric values when possible, else continuously poll. | |||
* Continuously compute the health status of the service instances, | * Continuously compute the health status of the service instances | |||
based on the metric values. | based on the metric values. | |||
The SAIN architecture requires time synchronization, with Network | The SAIN architecture requires time synchronization, with the Network | |||
Time Protocol (NTP) [RFC5905] as a candidate, between all elements: | Time Protocol (NTP) [RFC5905] as a candidate, between all elements: | |||
monitored entities, SAIN agents, Service orchestrator, the SAIN | monitored entities, SAIN agents, service orchestrator, the SAIN | |||
collector, as well as the SAIN orchestrator. This guarantees the | collector, as well as the SAIN orchestrator. This guarantees the | |||
correlations of all symptoms in the system, correlated with the right | correlations of all symptoms in the system, correlated with the right | |||
assurance graph version. | assurance graph version. | |||
3.1. Translating a Service Instance Configuration into an Assurance | 3.1. Translating a Service Instance Configuration into an Assurance | |||
Graph | Graph | |||
In order to structure the assurance of a service instance, the SAIN | In order to structure the assurance of a service instance, the SAIN | |||
orchestrator decomposes the service instance into so-called | orchestrator decomposes the service instance into so-called | |||
subservice instances. Each subservice instance focuses on a specific | subservice instances. Each subservice instance focuses on a specific | |||
feature or subpart of the service. | feature or subpart of the service. | |||
The decomposition into subservices is an important function of the | The decomposition into subservices is an important function of the | |||
architecture, for the following reasons: | architecture for the following reasons: | |||
* The result of this decomposition provides a relational picture of | * The result of this decomposition provides a relational picture of | |||
a service instance, that can be represented as a graph (called | a service instance, which can be represented as a graph (called an | |||
assurance graph) to the operator. | assurance graph) to the operator. | |||
* Subservices provide a scope for particular expertise and thereby | * Subservices provide a scope for particular expertise and thereby | |||
enable contribution from external experts. For instance, the | enable contribution from external experts. For instance, the | |||
subservice dealing with the optics health should be reviewed and | subservice dealing with the optic's health should be reviewed and | |||
extended by an expert in optical interfaces. | extended by an expert in optical interfaces. | |||
* Subservices that are common to several service instances are | * Subservices that are common to several service instances are | |||
reused for reducing the amount of computation needed. For | reused for reducing the amount of computation needed. For | |||
instance, the subservice assuring a given interface is reused by | instance, the subservice assuring a given interface is reused by | |||
any service instance relying on that interface. | any service instance relying on that interface. | |||
The assurance graph of a service instance is a DAG representing the | The assurance graph of a service instance is a DAG representing the | |||
structure of the assurance case for the service instance. The nodes | structure of the assurance case for the service instance. The nodes | |||
of this graph are service instances or subservice instances. Each | of this graph are service instances or subservice instances. Each | |||
edge of this graph indicates a dependency between the two nodes at | edge of this graph indicates a dependency between the two nodes at | |||
its extremities: the service or subservice at the source of the edge | its extremities, i.e., the service or subservice at the source of the | |||
depends on the service or subservice at the destination of the edge. | edge depends on the service or subservice at the destination of the | |||
edge. | ||||
Figure 2 depicts a simplistic example of the assurance graph for a | Figure 2 depicts a simplistic example of the assurance graph for a | |||
tunnel service. The node at the top is the service instance, the | tunnel service. The node at the top is the service instance; the | |||
nodes below are its dependencies. In the example, the tunnel service | nodes below are its dependencies. In the example, the tunnel service | |||
instance depends on the "peer1" and "peer2" tunnel interfaces (the | instance depends on the "peer1" and "peer2" tunnel interfaces (the | |||
tunnel interfaces created on the peer1 and peer2 devices, | tunnel interfaces created on the peer1 and peer2 devices, | |||
respectively), which in turn depend on the respective physical | respectively), which in turn depend on the respective physical | |||
interfaces, which finally depend on the respective "peer1" and | interfaces, which finally depend on the respective "peer1" and | |||
"peer2" devices. The tunnel service instance also depends on the IP | "peer2" devices. The tunnel service instance also depends on the IP | |||
connectivity that depends on the IS-IS routing protocol. | connectivity that depends on the IS-IS routing protocol. | |||
+------------------+ | +------------------+ | |||
| Tunnel | | | Tunnel | | |||
skipping to change at page 12, line 7 ¶ | skipping to change at line 497 ¶ | |||
+-------------+ +-------------+ | +-------------+ +-------------+ | |||
| | | | | | | | | | |||
| Peer1 | | Peer2 | | | Peer1 | | Peer2 | | |||
| Device | | Device | | | Device | | Device | | |||
+-------------+ +-------------+ | +-------------+ +-------------+ | |||
Figure 2: Assurance Graph Example | Figure 2: Assurance Graph Example | |||
Depicting the assurance graph helps the operator to understand (and | Depicting the assurance graph helps the operator to understand (and | |||
assert) the decomposition. The assurance graph shall be maintained | assert) the decomposition. The assurance graph shall be maintained | |||
during normal operation with addition, modification and removal of | during normal operation with addition, modification, and removal of | |||
service instances. A change in the network configuration or topology | service instances. A change in the network configuration or topology | |||
shall automatically be reflected in the assurance graph. As a first | shall automatically be reflected in the assurance graph. As a first | |||
example, a change of routing protocol from IS-IS to OSPF would change | example, a change of the routing protocol from IS-IS to OSPF would | |||
the assurance graph accordingly. As a second example, assuming that | change the assurance graph accordingly. As a second example, assume | |||
ECMP is in place for the source router for that specific tunnel; in | that the ECMP is in place for the source router for that specific | |||
that case, multiple interfaces must now be monitored, on top of the | tunnel; in that case, multiple interfaces must now be monitored, in | |||
monitoring the ECMP health itself. | addition to monitoring the ECMP health itself. | |||
3.1.1. Circular Dependencies | 3.1.1. Circular Dependencies | |||
The edges of the assurance graph represent dependencies. An | The edges of the assurance graph represent dependencies. An | |||
assurance graph is a DAG if and only if there are no circular | assurance graph is a DAG if and only if there are no circular | |||
dependencies among the subservices, and every assurance graph should | dependencies among the subservices, and every assurance graph should | |||
avoid circular dependencies. However, in some cases, circular | avoid circular dependencies. However, in some cases, circular | |||
dependencies might appear in the assurance graph. | dependencies might appear in the assurance graph. | |||
First, the assurance graph of a whole system is obtained by combining | First, the assurance graph of a whole system is obtained by combining | |||
the assurance graph of every service running on that system. Here | the assurance graph of every service running on that system. Here, | |||
combining means that two subservices having the same type and the | combining means that two subservices having the same type and the | |||
same parameters are in fact the same subservice and thus a single | same parameters are in fact the same subservice and thus a single | |||
node in the graph. For instance, the subservice of type "device" | node in the graph. For instance, the subservice of type "device" | |||
with the only parameter (the device ID) set to "PE1" will appear only | with the only parameter (the device ID) set to "PE1" will appear only | |||
once in the whole assurance graph even if several service instances | once in the whole assurance graph, even if several service instances | |||
rely on that device. Now, if two engineers design assurance graphs | rely on that device. Now, if two engineers design assurance graphs | |||
for two different services, and engineer A decides that an interface | for two different services, and Engineer A decides that an interface | |||
depends on the link it is connected to, but engineer B decides that | depends on the link it is connected to, but Engineer B decides that | |||
the link depends on the interface it is connected to, then when | the link depends on the interface it is connected to, then when | |||
combining the two assurance graphs, we will have a circular | combining the two assurance graphs, we will have a circular | |||
dependency interface -> link -> interface. | dependency interface -> link -> interface. | |||
Another case possibly resulting in circular dependencies is when | Another case possibly resulting in circular dependencies is when | |||
subservices are not properly identified. Assume that we want to | subservices are not properly identified. Assume that we want to | |||
assure a cloud-based computing cluster that runs containers. We | assure a cloud-based computing cluster that runs containers. We | |||
could represent the cluster by a subservice and the network service | could represent the cluster by a subservice and the network service | |||
connecting containers on the cluster by another subservice. We will | connecting containers on the cluster by another subservice. We would | |||
likely model that the network service depends on the cluster, because | likely model that as the network service depending on the cluster, | |||
the network service runs in a container supported by the cluster. | because the network service runs in a container supported by the | |||
Conversely, the cluster depends on the network service for | cluster. Conversely, the cluster depends on the network service for | |||
connectivity between containers, which creates a circular dependency. | connectivity between containers, which creates a circular dependency. | |||
A finer decomposition might distinguish between the resources for | A finer decomposition might distinguish between the resources for | |||
executing containers (a part of our cluster subservice) and the | executing containers (a part of our cluster subservice) and the | |||
communication between the containers (which could be modelled in the | communication between the containers (which could be modeled in the | |||
same way as communication between routers). | same way as communication between routers). | |||
In any case, it is likely that circular dependencies will show up in | In any case, it is likely that circular dependencies will show up in | |||
the assurance graph. A first step would be to detect circular | the assurance graph. A first step would be to detect circular | |||
dependencies as soon as possible in the SAIN architecture. Such a | dependencies as soon as possible in the SAIN architecture. Such a | |||
detection could be carried out by the SAIN orchestrator. Whenever a | detection could be carried out by the SAIN orchestrator. Whenever a | |||
circular dependency is detected, the newly added service would not be | circular dependency is detected, the newly added service would not be | |||
monitored until more careful modelling or alignment between the | monitored until more careful modeling or alignment between the | |||
different teams (engineer A and B) remove the circular dependency. | different teams (Engineers A and B) remove the circular dependency. | |||
As more elaborate solution we could consider a graph transformation: | As a more elaborate solution, we could consider a graph | |||
transformation: | ||||
* Decompose the graph into strongly connected components. | * Decompose the graph into strongly connected components. | |||
* For each strongly connected component: | * For each strongly connected component: | |||
- Remove all edges between nodes of the strongly connected | - remove all edges between nodes of the strongly connected | |||
component | component; | |||
- Add a new "synthetic" node for the strongly connected component | - add a new "synthetic" node for the strongly connected | |||
component; | ||||
- For each edge pointing to a node in the strongly connected | - for each edge pointing to a node in the strongly connected | |||
component, change the destination to the "synthetic" node | component, change the destination to the "synthetic" node; and | |||
- Add a dependency from the "synthetic" node to every node in the | - add a dependency from the "synthetic" node to every node in the | |||
strongly connected component. | strongly connected component. | |||
Such an algorithm would include all symptoms detected by any | Such an algorithm would include all symptoms detected by any | |||
subservice in one of the strongly component and make it available to | subservice in one of the strongly connected components and make it | |||
any subservice that depends on it. Figure 3 shows an example of such | available to any subservice that depends on it. Figure 3 shows an | |||
a transformation. On the left-hand side, the nodes c, d, e and f | example of such a transformation. On the left-hand side, the nodes | |||
form a strongly connected component. The status of node a should | c, d, e, and f form a strongly connected component. The status of | |||
depend on the status of nodes c, d, e, f, g, and h, but this is hard | node a should depend on the status of nodes c, d, e, f, g, and h, but | |||
to compute because of the circular dependency. On the right hand- | this is hard to compute because of the circular dependency. On the | |||
side, a depends on all these nodes as well, but there the circular | right-hand side, node a depends on all these nodes as well, but the | |||
dependency has been removed. | circular dependency has been removed. | |||
+---+ +---+ | +---+ +---+ | +---+ +---+ | +---+ +---+ | |||
| a | | b | | | a | | b | | | a | | b | | | a | | b | | |||
+---+ +---+ | +---+ +---+ | +---+ +---+ | +---+ +---+ | |||
| | | | | | | | | | | | |||
v v | v v | v v | v v | |||
+---+ +---+ | +------------+ | +---+ +---+ | +------------+ | |||
| c |--->| d | | | synthetic | | | c |--->| d | | | synthetic | | |||
+---+ +---+ | +------------+ | +---+ +---+ | +------------+ | |||
^ | | / | | \ | ^ | | / | | \ | |||
skipping to change at page 14, line 28 ¶ | skipping to change at line 602 ¶ | |||
+---+ +---+ | +---+ +---+ +---+ +---+ | +---+ +---+ | +---+ +---+ +---+ +---+ | |||
| | | | | | | | | | | | |||
v v | v v | v v | v v | |||
+---+ +---+ | +---+ +---+ | +---+ +---+ | +---+ +---+ | |||
| g | | h | | | g | | h | | | g | | h | | | g | | h | | |||
+---+ +---+ | +---+ +---+ | +---+ +---+ | +---+ +---+ | |||
Before After | Before After | |||
Transformation Transformation | Transformation Transformation | |||
Figure 3: Graph transformation | Figure 3: Graph Transformation | |||
We consider a concrete example to illustrate this transformation. | We consider a concrete example to illustrate this transformation. | |||
Let's assume that Engineer A is building an assurance graph dealing | Let's assume that Engineer A is building an assurance graph dealing | |||
with IS-IS and Engineer B is building an assurance graph dealing with | with IS-IS and Engineer B is building an assurance graph dealing with | |||
OSPF. The graph from Engineer A could contain the following: | OSPF. The graph from Engineer A could contain the following: | |||
+------------+ | +------------+ | |||
| IS-IS Link | | | IS-IS Link | | |||
+------------+ | +------------+ | |||
| | | | |||
v | v | |||
+------------+ | +------------+ | |||
| Phys. Link | | | Phys. Link | | |||
+------------+ | +------------+ | |||
| | | | | | |||
v v | v v | |||
+-------------+ +-------------+ | +-------------+ +-------------+ | |||
| Interface 1 | | Interface 2 | | | Interface 1 | | Interface 2 | | |||
+-------------+ +-------------+ | +-------------+ +-------------+ | |||
Figure 4: Fragment of assurance graph from Engineer A | Figure 4: Fragment of the Assurance Graph from Engineer A | |||
The graph from Engineer B could contain the following: | The graph from Engineer B could contain the following: | |||
+------------+ | +------------+ | |||
| OSPF Link | | | OSPF Link | | |||
+------------+ | +------------+ | |||
| | | | | | | | |||
v | v | v | v | |||
+-------------+ | +-------------+ | +-------------+ | +-------------+ | |||
| Interface 1 | | | Interface 2 | | | Interface 1 | | | Interface 2 | | |||
+-------------+ | +-------------+ | +-------------+ | +-------------+ | |||
| | | | | | | | |||
v v v | v v v | |||
+------------+ | +------------+ | |||
| Phys. Link | | | Phys. Link | | |||
+------------+ | +------------+ | |||
Figure 5: Fragment of assurance graph from Engineer B | Figure 5: Fragment of the Assurance Graph from Engineer B | |||
Each Interface subservice and the Physical Link subservice are common | The Interface subservices and the Physical Link subservice are common | |||
to both fragments above. Each of these subservice appears only once | to both fragments above. Each of these subservices appear only once | |||
in the graph merging the two fragments. Dependencies from both | in the graph merging the two fragments. Dependencies from both | |||
fragments are included in the merged graph, resulting in a circular | fragments are included in the merged graph, resulting in a circular | |||
dependency: | dependency: | |||
+------------+ +------------+ | +------------+ +------------+ | |||
| IS-IS Link | | OSPF Link |---+ | | IS-IS Link | | OSPF Link |---+ | |||
+------------+ +------------+ | | +------------+ +------------+ | | |||
| | | | | | | | | | |||
| +-------- + | | | | +-------- + | | | |||
v v | | | v v | | | |||
skipping to change at page 15, line 46 ¶ | skipping to change at line 668 ¶ | |||
| ^ | | | | | | ^ | | | | | |||
| | +-------+ | | | | | | +-------+ | | | | |||
v | v | v | | v | v | v | | |||
+-------------+ +-------------+ | | +-------------+ +-------------+ | | |||
| Interface 1 | | Interface 2 | | | | Interface 1 | | Interface 2 | | | |||
+-------------+ +-------------+ | | +-------------+ +-------------+ | | |||
^ | | ^ | | |||
| | | | | | |||
+------------------------------+ | +------------------------------+ | |||
Figure 6: Merging graphs from A and B | Figure 6: Merging Graphs from Engineers A and B | |||
The solution presented above would result in graph looking as | The solution presented above would result in a graph looking as | |||
follows, where a new "synthetic" node is included. Using that | follows, where a new "synthetic" node is included. Using that | |||
transformation, all dependencies are indirectly satisfied for the | transformation, all dependencies are indirectly satisfied for the | |||
nodes outside the circular dependency, in the sense that both IS-IS | nodes outside the circular dependency, in the sense that both IS-IS | |||
and OSPF links have indirect dependencies to the two interfaces and | and OSPF links have indirect dependencies to the two interfaces and | |||
the link. However, the dependencies between the link and the | the link. However, the dependencies between the link and the | |||
interfaces are lost as they were causing the circular dependency. | interfaces are lost since they were causing the circular dependency. | |||
+------------+ +------------+ | +------------+ +------------+ | |||
| IS-IS Link | | OSPF Link | | | IS-IS Link | | OSPF Link | | |||
+------------+ +------------+ | +------------+ +------------+ | |||
| | | | | | |||
v v | v v | |||
+------------+ | +------------+ | |||
| synthetic | | | synthetic | | |||
+------------+ | +------------+ | |||
| | | | |||
+-----------+-------------+ | +-----------+-------------+ | |||
| | | | | | | | |||
v v v | v v v | |||
+-------------+ +------------+ +-------------+ | +-------------+ +------------+ +-------------+ | |||
| Interface 1 | | Phys. Link | | Interface 2 | | | Interface 1 | | Phys. Link | | Interface 2 | | |||
+-------------+ +------------+ +-------------+ | +-------------+ +------------+ +-------------+ | |||
Figure 7: Removing circular dependencies after merging graphs | Figure 7: Removing Circular Dependencies after Merging Graphs | |||
from A and B | from Engineers A and B | |||
3.2. Intent and Assurance Graph | 3.2. Intent and Assurance Graph | |||
The SAIN orchestrator analyzes the configuration of a service | The SAIN orchestrator analyzes the configuration of a service | |||
instance to: | instance to do the following: | |||
* Try to capture the intent of the service instance, i.e., what is | * Try to capture the intent of the service instance, i.e., What is | |||
the service instance trying to achieve. At least, this requires | the service instance trying to achieve? At a minimum, this | |||
the SAIN orchestrator to know the YANG modules that are being | requires the SAIN orchestrator to know the YANG modules that are | |||
configured on the devices to enable the service. Note that if the | being configured on the devices to enable the service. Note that, | |||
service model or the network model is known to the SAIN | if the service model or the network model is known to the SAIN | |||
orchestrator, the latter can exploit it. In that case, the intent | orchestrator, the latter can exploit it. In that case, the intent | |||
could be directly extracted and include more details, such as the | could be directly extracted and include more details, such as the | |||
notion of sites for a VPN, which is out of scope of the device | notion of sites for a VPN, which is out of scope of the device | |||
configuration. | configuration. | |||
* Decompose the service instance into subservices representing the | * Decompose the service instance into subservices representing the | |||
network features on which the service instance relies. | network features on which the service instance relies. | |||
The SAIN orchestrator must be able to analyze configuration pushed to | The SAIN orchestrator must be able to analyze the configuration | |||
various devices for configuring a service instance and produce the | pushed to various devices of a service instance and produce the | |||
assurance graph for that service instance. | assurance graph for that service instance. | |||
To schematize what a SAIN orchestrator does, assume that the | To schematize what a SAIN orchestrator does, assume that a service | |||
configuration for a service instance touches two devices and | instance touches two devices and configures a virtual tunnel | |||
configure on each device a virtual tunnel interface. Then: | interface on each device. Then: | |||
* Capturing the intent would start by detecting that the service | * Capturing the intent would start by detecting that the service | |||
instance is actually a tunnel between the two devices, and stating | instance is actually a tunnel between the two devices and stating | |||
that this tunnel must be functional. This solution is minimally | that this tunnel must be operational. This solution is minimally | |||
invasive as it does not require modifying nor knowing the service | invasive, as it does not require modifying nor knowing the service | |||
model. If the service model or network model is known by the SAIN | model. If the service model or network model is known by the SAIN | |||
orchestrator, it can be used to further capture the intent and | orchestrator, it can be used to further capture the intent and | |||
include more information such as Service Level Objectives. For | include more information, such as Service-Level Objectives (e.g., | |||
instance, the latency and bandwidth requirements for the tunnel, | the latency and bandwidth requirements for the tunnel) if present | |||
if present in the service model | in the service model. | |||
* Decomposing the service instance into subservices would result in | * Decomposing the service instance into subservices would result in | |||
the assurance graph depicted in Figure 2, for instance. | the assurance graph depicted in Figure 2, for instance. | |||
The assurance graph, or more precisely the subservices and | The assurance graph, or more precisely the subservices and | |||
dependencies that a SAIN orchestrator can instantiate, should be | dependencies that a SAIN orchestrator can instantiate, should be | |||
curated. The organization of such a process is out-of-scope for this | curated. The organization of such a process (i.e., ensure that | |||
document and should aim to: | existing subservices are reused as much as possible and avoid | |||
circular dependencies) is out-of-scope for this document. | ||||
* Ensure that existing subservices are reused as much as possible. | ||||
* Avoid circular dependencies. | ||||
To be applied, SAIN requires a mechanism mapping a service instance | To be applied, SAIN requires a mechanism mapping a service instance | |||
to the configuration actually required on the devices for that | to the configuration actually required on the devices for that | |||
service instance to run. While the Figure 1 makes a distinction | service instance to run. While Figure 1 makes a distinction between | |||
between the SAIN orchestrator and a different component providing the | the SAIN orchestrator and a different component providing the service | |||
service instance configuration, in practice those two components are | instance configuration, in practice those two components are most | |||
mostly likely combined. The internals of the orchestrator are out of | likely combined. The internals of the orchestrator are out of scope | |||
scope of this document. | of this document. | |||
3.3. Subservices | 3.3. Subservices | |||
A subservice corresponds to subpart or a feature of the network | A subservice corresponds to a subpart or a feature of the network | |||
system that is needed for a service instance to function properly. | system that is needed for a service instance to function properly. | |||
In the context of SAIN, a subservice is associated to its assurance, | In the context of SAIN, a subservice is associated to its assurance, | |||
that is the method for assuring that a subservice behaves correctly. | which is the method for assuring that a subservice behaves correctly. | |||
Subservices, just as with services, have high-level parameters that | Subservices, just as with services, have high-level parameters that | |||
specify the instance to be assured. The needed parameters depend on | specify the instance to be assured. The needed parameters depend on | |||
the subservice type. For example, assuring a device requires a | the subservice type. For example, assuring a device requires a | |||
specific deviceId as parameter. For example, assuring an interface | specific deviceId as a parameter and assuring an interface requires a | |||
requires a specific combination of deviceId and interfaceId. | specific combination of deviceId and interfaceId. | |||
When designing a new type of subservice, one should carefully define | When designing a new type of subservice, one should carefully define | |||
what is the assured object or functionality. Then, the parameters | what is the assured object or functionality. Then, the parameters | |||
must be chosen as a minimal set that completely identify the object | must be chosen as a minimal set that completely identifies the object | |||
(see examples from the previous paragraph). Parameters cannot change | (see examples from the previous paragraph). Parameters cannot change | |||
during the lifecycle of a subservice. For instance, an IP address is | during the life cycle of a subservice. For instance, an IP address | |||
a good parameter when assuring a connectivity towards that address | is a good parameter when assuring a connectivity towards that address | |||
(i.e. a given device can reach a given IP address), however it's not | (i.e., a given device can reach a given IP address); however, it's | |||
a good parameter to identify an interface as the IP address assigned | not a good parameter to identify an interface, as the IP address | |||
to that interface can be changed. | assigned to that interface can be changed. | |||
A subservice is also characterized by a list of metrics to fetch and | A subservice is also characterized by a list of metrics to fetch and | |||
a list of operations to apply to these metrics in order to infer a | a list of operations to apply to these metrics in order to infer a | |||
health status. | health status. | |||
3.4. Building the Expression Graph from the Assurance Graph | 3.4. Building the Expression Graph from the Assurance Graph | |||
From the assurance graph is derived a so-called global computation | From the assurance graph, a so-called global computation graph is | |||
graph. First, each subservice instance is transformed into a set of | derived. First, each subservice instance is transformed into a set | |||
subservice expressions that take metrics and constants as input | of subservice expressions that take metrics and constants as input | |||
(i.e., sources of the DAG) and produce the status of the subservice, | (i.e., sources of the DAG) and produce the status of the subservice | |||
based on some heuristics. For instance, the health of an interface | based on some heuristics. For instance, the health of an interface | |||
is 0 (minimal score) with the symptom "interface admin-down" if the | is 0 (minimal score) with the symptom "interface admin-down" if the | |||
interface is disabled in the configuration. Then for each service | interface is disabled in the configuration. Then, for each service | |||
instance, the service expressions are constructed by combining the | instance, the service expressions are constructed by combining the | |||
subservice expressions of its dependencies. The way service | subservice expressions of its dependencies. The way service | |||
expressions are combined depends on the dependency types (impacting | expressions are combined depends on the dependency types (impacting | |||
or informational). Finally, the global computation graph is built by | or informational). Finally, the global computation graph is built by | |||
combining the service expressions, to get a global view of all | combining the service expressions to get a global view of all | |||
subservices. In other words, the global computation graph encodes | subservices. In other words, the global computation graph encodes | |||
all the operations needed to produce health statuses from the | all the operations needed to produce health statuses from the | |||
collected metrics. | collected metrics. | |||
The two types of dependencies for combining subservices are: | The two types of dependencies for combining subservices are: | |||
Informational Dependency: Type of dependency whose health score | Informational Dependency: | |||
does not impact the health score of its parent subservice or | The type of dependency whose health score does not impact the | |||
service instance(s) in the assurance graph. However, the symptoms | health score of its parent subservice or service instance(s) in | |||
should be taken into account in the parent service instance or | the assurance graph. However, the symptoms should be taken into | |||
subservice instance(s), for informational reasons. | account in the parent service instance or subservice instance(s) | |||
for informational reasons. | ||||
Impacting Dependency: Type of dependency whose score impacts the | Impacting Dependency: | |||
score of its parent subservice or service instance(s) in the | The type of dependency whose health score impacts the health score | |||
assurance graph. The symptoms are taken into account in the | of its parent subservice or service instance(s) in the assurance | |||
parent service instance or subservice instance(s), as the | graph. The symptoms are taken into account in the parent service | |||
impacting reasons. | instance or subservice instance(s) as the impacting reasons. | |||
The set of dependency type presented here is not exhaustive. More | The set of dependency types presented here is not exhaustive. More | |||
specific dependency types can be defined by extending the YANG model. | specific dependency types can be defined by extending the YANG | |||
For instance, a connectivity subservice depending on several path | module. For instance, a connectivity subservice depending on several | |||
subservices is only partially impacted if only one of these paths | path subservices is partially impacted if only one of these paths | |||
fails. Adding these new dependency types requires defining the | fails. Adding these new dependency types requires defining the | |||
corresponding operation for combining statuses of subservices. | corresponding operation for combining statuses of subservices. | |||
Subservices shall not be dependent on the protocol used to retrieve | Subservices shall not be dependent on the protocol used to retrieve | |||
the metrics. To justify this, let's consider the interface | the metrics. To justify this, let's consider the interface | |||
operational status. Depending on the device capabilities, this | operational status. Depending on the device capabilities, this | |||
status can be collected by an industry-accepted YANG module (IETF, | status can be collected by an industry-accepted YANG module (e.g., | |||
Openconfig [OpenConfig]), by a vendor-specific YANG module, or even | IETF or Openconfig [OpenConfig]), by a vendor-specific YANG module, | |||
by a MIB module. If the subservice was dependent on the mechanism to | or even by a MIB module. If the subservice was dependent on the | |||
collect the operational status, then we would need multiple | mechanism to collect the operational status, then we would need | |||
subservice definitions in order to support all different mechanisms. | multiple subservice definitions in order to support all different | |||
This also implies that, while waiting for all the metrics to be | mechanisms. This also implies that, while waiting for all the | |||
available via standard YANG modules, SAIN agents might have to | metrics to be available via standard YANG modules, SAIN agents might | |||
retrieve metric values via non-standard YANG models, via MIB modules, | have to retrieve metric values via nonstandard YANG data models, MIB | |||
Command Line Interface (CLI), etc., effectively implementing a | modules, the Command-Line Interface (CLI), etc., effectively | |||
normalization layer between data models and information models. | implementing a normalization layer between data models and | |||
information models. | ||||
In order to keep subservices independent of metric collection method, | In order to keep subservices independent of metric collection method | |||
or, expressed differently, to support multiple combinations of | (or, expressed differently, to support multiple combinations of | |||
platforms, OSes, and even vendors, the architecture introduces the | platforms, OSes, and even vendors), the architecture introduces the | |||
concept of "metric engine". The metric engine maps each device- | concept of "metric engine". The metric engine maps each device- | |||
independent metric used in the subservices to a list of device- | independent metric used in the subservices to a list of device- | |||
specific metric implementations that precisely define how to fetch | specific metric implementations that precisely define how to fetch | |||
values for that metric. The mapping is parameterized by the | values for that metric. The mapping is parameterized by the | |||
characteristics (model, OS version, etc.) of the device from which | characteristics (i.e., model, OS version, etc.) of the device from | |||
the metrics are fetched. This metric engine is included in the SAIN | which the metrics are fetched. This metric engine is included in the | |||
agent. | SAIN agent. | |||
3.5. Open Interfaces with YANG Modules | 3.5. Open Interfaces with YANG Modules | |||
The interfaces between the architecture components are open thanks to | The interfaces between the architecture components are open thanks to | |||
the YANG modules specified in | the YANG modules specified in [RFC9418]; they specify objects for | |||
[I-D.ietf-opsawg-service-assurance-yang]; they specify objects for | ||||
assuring network services based on their decomposition into so-called | assuring network services based on their decomposition into so-called | |||
subservices, according to the SAIN architecture. | subservices, according to the SAIN architecture. | |||
These modules are intended for the following use cases: | These modules are intended for the following use cases: | |||
* Assurance graph configuration: | * Assurance graph configuration: | |||
- Subservices: configure a set of subservices to assure, by | - Subservices: Configure a set of subservices to assure by | |||
specifying their types and parameters. | specifying their types and parameters. | |||
- Dependencies: configure the dependencies between the | - Dependencies: Configure the dependencies between the | |||
subservices, along with their types. | subservices, along with their types. | |||
* Assurance telemetry: export the health status of the subservices, | * Assurance telemetry: Export the health status of the subservices, | |||
along with the observed symptoms. | along with the observed symptoms. | |||
Some examples of YANG instances can be found in Appendix A of | Some examples of YANG instances can be found in Appendix A of | |||
[I-D.ietf-opsawg-service-assurance-yang]. | [RFC9418]. | |||
3.6. Handling Maintenance Windows | 3.6. Handling Maintenance Windows | |||
Whenever network components are under maintenance, the operator wants | Whenever network components are under maintenance, the operator wants | |||
to inhibit the emission of symptoms from those components. A typical | to inhibit the emission of symptoms from those components. A typical | |||
use case is device maintenance, during which the device is not | use case is device maintenance, during which the device is not | |||
supposed to be operational. As such, symptoms related to the device | supposed to be operational. As such, symptoms related to the device | |||
health should be ignored. Symptoms related to the device-specific | health should be ignored. Symptoms related to the device-specific | |||
subservices, such as the interfaces, might also be ignored because | subservices, such as the interfaces, might also be ignored because | |||
their state changes are probably the consequence of the maintenance. | their state changes are probably the consequence of the maintenance. | |||
The ietf-service-assurance model proposed in | The ietf-service-assurance model described in [RFC9418] enables | |||
[I-D.ietf-opsawg-service-assurance-yang] enables flagging subservices | flagging subservices as under maintenance and, in that case, requires | |||
as under maintenance, and, in that case, requires a string that | a string that identifies the person or process that requested the | |||
identifies the person or process who requested the maintenance. When | maintenance. When a service or subservice is flagged as under | |||
a service or subservice is flagged as under maintenance, it must | maintenance, it must report a generic "Under Maintenance" symptom for | |||
report a generic "Under Maintenance" symptom, for propagation towards | propagation towards subservices that depend on this specific | |||
subservices that depend on this specific subservice. Any other | subservice. Any other symptom from this service or by one of its | |||
symptom from this service, or by one of its impacting dependencies | impacting dependencies must not be reported. | |||
must not be reported. | ||||
We illustrate this mechanism on three independent examples based on | We illustrate this mechanism on three independent examples based on | |||
the assurance graph depicted in Figure 2: | the assurance graph depicted in Figure 2: | |||
* Device maintenance, for instance upgrading the device OS. The | * Device maintenance, for instance, upgrading the device OS. The | |||
operator flags the subservice "Peer1" device as under maintenance. | operator flags the subservice "Peer1" device as under maintenance. | |||
This inhibits the emission of symptoms, except "Under | This inhibits the emission of symptoms, except "Under Maintenance" | |||
Maintenance", from "Peer1 Physical Interface", "Peer1 Tunnel | from "Peer1 Physical Interface", "Peer1 Tunnel Interface", and | |||
Interface" and "Tunnel Service Instance". All other subservices | "Tunnel Service Instance". All other subservices are unaffected. | |||
are unaffected. | ||||
* Interface maintenance, for instance replacing a broken optic. The | * Interface maintenance, for instance, replacing a broken optic. | |||
operator flags the subservice "Peer1 Physical Interface" as under | The operator flags the subservice "Peer1 Physical Interface" as | |||
maintenance. This inhibits the emission of symptoms, except | under maintenance. This inhibits the emission of symptoms, except | |||
"Under Maintenance" from "Peer 1 Tunnel Interface" and "Tunnel | "Under Maintenance" from "Peer 1 Tunnel Interface" and "Tunnel | |||
Service Instance". All other subservices are unaffected. | Service Instance". All other subservices are unaffected. | |||
* Routing protocol maintenance, for instance modifying parameters or | * Routing protocol maintenance, for instance, modifying parameters | |||
redistribution. The operator marks the subservice "IS-IS Routing | or redistribution. The operator marks the subservice "IS-IS | |||
Protocol" as under maintenance. This inhibits the emission of | Routing Protocol" as under maintenance. This inhibits the | |||
symptoms, except "Under Maintenance", from "IP connectivity" and | emission of symptoms, except "Under Maintenance" from "IP | |||
"Tunnel Service Instance". All other subservices are unaffected. | connectivity" and "Tunnel Service Instance". All other | |||
subservices are unaffected. | ||||
In each example above, the subservice under maintenance is completely | In each example above, the subservice under maintenance is completely | |||
impacting the service instance, putting it under maintenance as well. | impacting the service instance, putting it under maintenance as well. | |||
There are use cases where the subservice under maintenance only | There are use cases where the subservice under maintenance only | |||
partially impacts the service instance. For instance, consider a | partially impacts the service instance. For instance, consider a | |||
service instance supported by both a primary and backup path. If a | service instance supported by both a primary and backup path. If a | |||
subservice impacting the primary path is under maintenance, the | subservice impacting the primary path is under maintenance, the | |||
service instance might still be functional but degraded. In that | service instance might still be functional but degraded. In that | |||
case, the status of the service instance might include "Primary path | case, the status of the service instance might include "Primary path | |||
Under Maintenance", "No redundancy" as well as other symptoms from | Under Maintenance", "No redundancy", as well as other symptoms from | |||
the backup path to explain the lower health score. In general, the | the backup path to explain the lower health score. In general, the | |||
computation of the service instance status from the subservices is | computation of the service instance status from the subservices is | |||
done in the SAIN collector whose implementation is out of scope for | done in the SAIN collector whose implementation is out of scope for | |||
this document. | this document. | |||
The maintenance of a subservice might modify or hide modifications of | The maintenance of a subservice might modify or hide modifications of | |||
the structure of the assurance graph. Therefore, unflagging a | the structure of the assurance graph. Therefore, unflagging a | |||
subservice as under maintenance should trigger an update of the | subservice as under maintenance should trigger an update of the | |||
assurance graph. | assurance graph. | |||
3.7. Flexible Functional Architecture | 3.7. Flexible Functional Architecture | |||
The SAIN architecture is flexible in terms of components. While the | The SAIN architecture is flexible in terms of components. While the | |||
SAIN architecture in Figure 1 makes a distinction between two | SAIN architecture in Figure 1 makes a distinction between two | |||
components, the service orchestrator and the SAIN orchestrator, in | components, the service orchestrator and the SAIN orchestrator, in | |||
practice those two components are mostly likely combined. Similarly, | practice the two components are most likely combined. Similarly, the | |||
the SAIN agents are displayed in Figure 1 as being separate | SAIN agents are displayed in Figure 1 as being separate components. | |||
components. Practically, the SAIN agents could be either independent | In practice, the SAIN agents could be either independent components | |||
components or directly integrated in monitored entities. A practical | or directly integrated in monitored entities. A practical example is | |||
example is an agent in a router. | an agent in a router. | |||
The SAIN architecture is also flexible in terms of services and | The SAIN architecture is also flexible in terms of services and | |||
subservices. In the proposed architecture, the SAIN orchestrator is | subservices. In the defined architecture, the SAIN orchestrator is | |||
coupled to a service orchestrator which defines the kinds of services | coupled to a service orchestrator, which defines the kinds of | |||
that the architecture handles. Most examples in this document deal | services that the architecture handles. Most examples in this | |||
with the notion of Network Service YANG modules, with well-known | document deal with the notion of Network Service YANG Modules with | |||
services such as L2VPN or tunnels. However, the concept of services | well-known services, such as L2VPN or tunnels. However, the concept | |||
is general enough to cross into different domains. One of them is | of services is general enough to cross into different domains. One | |||
the domain of service management on network elements, which also | of them is the domain of service management on network elements, | |||
require their own assurance. Examples include a DHCP server on a | which also require their own assurance. Examples include a DHCP | |||
Linux server, a data plane, an IPFIX export, etc. The notion of | server on a Linux server, a data plane, an IPFIX export, etc. The | |||
"service" is generic in this architecture and depends on the service | notion of "service" is generic in this architecture and depends on | |||
orchestrator and underlying network system, as illustrated by the | the service orchestrator and underlying network system, as | |||
following examples: | illustrated by the following examples: | |||
* if a main service orchestrator coordinates several lower level | * If a main service orchestrator coordinates several lower-level | |||
controllers, a service for the controller can be a subservice from | controllers, a service for the controller can be a subservice from | |||
the point of view of the orchestrator. | the point of view of the orchestrator. | |||
* A DHCP server/data plane/IPFIX export can be considered as | * A DHCP server / data plane / IPFIX export can be considered | |||
subservices for a device. | subservices for a device. | |||
* A routing instance can be considered as a subservice for a L3VPN. | * A routing instance can be considered a subservice for an L3VPN. | |||
* A tunnel can be considered as a subservice for an application in | * A tunnel can be considered a subservice for an application in the | |||
the cloud. | cloud. | |||
* A service function can be considered as a subservice for a service | * A service function can be considered a subservice for a service | |||
function chain [RFC7665]. | function chain [RFC7665]. | |||
The assurance graph is created to be flexible and open, regardless of | The assurance graph is created to be flexible and open, regardless of | |||
the subservice types, locations, or domains. | the subservice types, locations, or domains. | |||
The SAIN architecture is also flexible in terms of distributed | The SAIN architecture is also flexible in terms of distributed | |||
graphs. As shown in Figure 1, the architecture comprises several | graphs. As shown in Figure 1, the architecture comprises several | |||
agents. Each agent is responsible for handling a subgraph of the | agents. Each agent is responsible for handling a subgraph of the | |||
assurance graph. The collector is responsible for fetching the sub- | assurance graph. The collector is responsible for fetching the | |||
graphs from the different agents and gluing them together. As an | subgraphs from the different agents and gluing them together. As an | |||
example, in the graph from Figure 2, the subservices relative to Peer | example, in the graph from Figure 2, the subservices relative to Peer | |||
1 might be handled by a different agent than the subservices relative | 1 might be handled by a different agent than the subservices relative | |||
to Peer 2 and the Connectivity and IS-IS subservices might be handled | to Peer 2, and the Connectivity and IS-IS subservices might be | |||
by yet another agent. The agents will export their partial graph and | handled by yet another agent. The agents will export their partial | |||
the collector will stitch them together as dependencies of the | graph, and the collector will stitch them together as dependencies of | |||
service instance. | the service instance. | |||
And finally, the SAIN architecture is flexible in terms of what it | And finally, the SAIN architecture is flexible in terms of what it | |||
monitors. Most, if not all examples, in this document refer to | monitors. Most, if not all, examples in this document refer to | |||
physical components, but this is not a constraint. Indeed, the | physical components, but this is not a constraint. Indeed, the | |||
assurance of virtual components would follow the same principles and | assurance of virtual components would follow the same principles, and | |||
an assurance graph composed of virtualized components (or a mix of | an assurance graph composed of virtualized components (or a mix of | |||
virtualized and physical ones) is supported by this architecture. | virtualized and physical ones) is supported by this architecture. | |||
3.8. Time window for symptoms history | 3.8. Time Window for Symptoms' History | |||
The health status reported via the YANG modules contains, for each | The health status reported via the YANG modules contains, for each | |||
subservice, the list of symptoms. Symptoms have a start and end | subservice, the list of symptoms. Symptoms have a start and end | |||
date, making it is possible to report symptoms that are no longer | date, making it is possible to report symptoms that are no longer | |||
occurring. | occurring. | |||
The SAIN agent might have to remove some symptoms for specific | The SAIN agent might have to remove some symptoms for specific | |||
subservice symptoms, because there are outdated and not relevant any | subservice symptoms because they are outdated and no longer relevant | |||
longer, or simply because the SAIN agent needs to free up some space. | or simply because the SAIN agent needs to free up some space. | |||
Regardless of the reason, it's important for a SAIN collector | Regardless of the reason, it's important for a SAIN collector | |||
(re-)connecting to a SAIN agent to understand the effect of this | connecting/reconnecting to a SAIN agent to understand the effect of | |||
garbage collection. | this garbage collection. | |||
Therefore, the SAIN agent contains a YANG object specifying the date | Therefore, the SAIN agent contains a YANG object specifying the date | |||
and time at which the symptoms' history starts for the subservice | and time at which the symptoms' history starts for the subservice | |||
instances. The subservice reports only symptoms that are occurring | instances. The subservice reports only symptoms that are occurring | |||
or that have been occurring after the history start date. | or that have been occurring after the history start date. | |||
3.9. New Assurance Graph Generation | 3.9. New Assurance Graph Generation | |||
The assurance graph will change over time, because services and | The assurance graph will change over time, because services and | |||
subservices come and go (changing the dependencies between | subservices come and go (changing the dependencies between | |||
subservices), or as a result of resolving maintenance issues. | subservices) or as a result of resolving maintenance issues. | |||
Therefore, an assurance graph version must be maintained, along with | Therefore, an assurance graph version must be maintained, along with | |||
the date and time of its last generation. The date and time of a | the date and time of its last generation. The date and time of a | |||
particular subservice instance (again dependencies or under | particular subservice instance (again dependencies or under | |||
maintenance) might be kept. From a client point of view, an | maintenance) might be kept. From a client point of view, an | |||
assurance graph change is triggered by the value of the assurance- | assurance graph change is triggered by the value of the assurance- | |||
graph-version and assurance-graph-last-change YANG leaves. At that | graph-version and assurance-graph-last-change YANG leaves. At that | |||
point in time, the client (collector) follows the following process: | point in time, the client (collector) follows the following process: | |||
* Keep the previous assurance-graph-last-change value (let's call it | * Keep the previous assurance-graph-last-change value (let's call it | |||
time T) | time T). | |||
* Run through all subservice instances and process the subservice | * Run through all the subservice instances and process the | |||
instances for which the last-change is newer that the time T | subservice instances for which the last-change is newer than the | |||
time T. | ||||
* Keep the new assurance-graph-last-change as the new referenced | * Keep the new assurance-graph-last-change as the new referenced | |||
date and time | date and time. | |||
4. Security Considerations | 4. IANA Considerations | |||
This document has no IANA actions. | ||||
5. Security Considerations | ||||
The SAIN architecture helps operators to reduce the mean time to | The SAIN architecture helps operators to reduce the mean time to | |||
detect and mean time to repair. However, the SAIN agents must be | detect and the mean time to repair. However, the SAIN agents must be | |||
secured: a compromised SAIN agent may be sending wrong root causes or | secured; a compromised SAIN agent may be sending incorrect root | |||
symptoms to the management systems. Securing the agents falls back | causes or symptoms to the management systems. Securing the agents | |||
to ensuring the integrity and confidentiality of the assurance graph. | falls back to ensuring the integrity and confidentiality of the | |||
This can be partially achieved by correctly setting permissions of | assurance graph. This can be partially achieved by correctly setting | |||
each node in the YANG model as described in Section 6 of | permissions of each node in the YANG data model, as described in | |||
[I-D.ietf-opsawg-service-assurance-yang]. | Section 6 of [RFC9418]. | |||
Except for the configuration of telemetry, the agents do not need | Except for the configuration of telemetry, the agents do not need | |||
"write access" to the devices they monitor. This configuration is | "write access" to the devices they monitor. This configuration is | |||
applied with a YANG module, whose protection is covered by Secure | applied with a YANG module, whose protection is covered by Secure | |||
Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF. | Shell (SSH) [RFC6242] for the Network Configuration Protocol | |||
Devices should be configured so that agents have their own | (NETCONF) or TLS [RFC8446] for RESTCONF. Devices should be | |||
credentials with write access only for the YANG nodes configuring the | configured so that agents have their own credentials with write | |||
telemetry. | access only for the YANG nodes configuring the telemetry. | |||
The data collected by SAIN could potentially be compromising to the | The data collected by SAIN could potentially be compromising to the | |||
network or provide more insight into how the network is designed. | network or provide more insight into how the network is designed. | |||
Considering the data that SAIN requires (including CLI access in some | Considering the data that SAIN requires (including CLI access in some | |||
cases), one should weigh data access concerns with the impact that | cases), one should weigh data access concerns with the impact that | |||
reduced visibility will have on being able to rapidly identify root | reduced visibility will have on being able to rapidly identify root | |||
causes. | causes. | |||
For building the assurance graph, the SAIN orchestrator needs to | For building the assurance graph, the SAIN orchestrator needs to | |||
obtain the configuration from the service orchestrator. The latter | obtain the configuration from the service orchestrator. The latter | |||
should restrict access of the SAIN orchestrator to information needed | should restrict access of the SAIN orchestrator to information needed | |||
to build the assurance graph. | to build the assurance graph. | |||
If a closed loop system relies on this architecture then the well | If a closed loop system relies on this architecture, then the well- | |||
known issue of those systems also applies, i.e., a lying device or | known issue of those systems also applies, i.e., a lying device or | |||
compromised agent could trigger partial reconfiguration of the | compromised agent could trigger partial reconfiguration of the | |||
service or network. The SAIN architecture neither augments nor | service or network. The SAIN architecture neither augments nor | |||
reduces this risk. An extension of SAIN, out of scope for this | reduces this risk. An extension of SAIN, which is out of scope for | |||
document, could detect discrepancies between symptoms reported by | this document, could detect discrepancies between symptoms reported | |||
different agents and thus detect anomalies if an agent or a device is | by different agents, and thus detect anomalies if an agent or a | |||
lying. | device is lying. | |||
If NTP service goes down, the devices clocks might lose their | If NTP service goes down, the devices clocks might lose their | |||
synchronization. In that case, correlating information from | synchronization. In that case, correlating information from | |||
different devices, such as detecting symptoms about a link or | different devices, such as detecting symptoms about a link or | |||
correlating symptoms from different devices, will give inaccurate | correlating symptoms from different devices, will give inaccurate | |||
results. | results. | |||
5. IANA Considerations | 6. References | |||
This document includes no request to IANA. | ||||
6. Contributors | ||||
* Youssef El Fathi | ||||
* Eric Vyncke | ||||
7. References | ||||
7.1. Normative References | ||||
[I-D.ietf-opsawg-service-assurance-yang] | 6.1. Normative References | |||
Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. | ||||
Arumugam, "YANG Modules for Service Assurance", Work in | ||||
Progress, Internet-Draft, draft-ietf-opsawg-service- | ||||
assurance-yang-10, 28 November 2022, | ||||
<https://www.ietf.org/archive/id/draft-ietf-opsawg- | ||||
service-assurance-yang-10.txt>. | ||||
[RFC8309] Wu, Q., Liu, W., Farrel, A., and RFC Publisher, "Service | [RFC8309] Wu, Q., Liu, W., and A. Farrel, "Service Models | |||
Models Explained", RFC 8309, DOI 10.17487/RFC8309, January | Explained", RFC 8309, DOI 10.17487/RFC8309, January 2018, | |||
2018, <https://www.rfc-editor.org/info/rfc8309>. | <https://www.rfc-editor.org/info/rfc8309>. | |||
[RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., Geng, | [RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and | |||
L., and RFC Publisher, "A Framework for Automating Service | L. Geng, "A Framework for Automating Service and Network | |||
and Network Management with YANG", RFC 8969, | Management with YANG", RFC 8969, DOI 10.17487/RFC8969, | |||
DOI 10.17487/RFC8969, January 2021, | January 2021, <https://www.rfc-editor.org/info/rfc8969>. | |||
<https://www.rfc-editor.org/info/rfc8969>. | ||||
7.2. Informative References | [RFC9418] Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T. | |||
Arumugam, "YANG Modules for Service Assurance", RFC 9418, | ||||
DOI 10.17487/RFC9418, June 2023, | ||||
<https://www.rfc-editor.org/info/rfc9418>. | ||||
[I-D.ietf-opsawg-yang-vpn-service-pm] | 6.2. Informative References | |||
Wu, B., Wu, Q., Boucadair, M., de Dios, O. G., and B. Wen, | ||||
"A YANG Model for Network and VPN Service Performance | ||||
Monitoring", Work in Progress, Internet-Draft, draft-ietf- | ||||
opsawg-yang-vpn-service-pm-15, 11 November 2022, | ||||
<https://www.ietf.org/archive/id/draft-ietf-opsawg-yang- | ||||
vpn-service-pm-15.txt>. | ||||
[OpenConfig] | [OpenConfig] | |||
"OpenConfig", <https://openconfig.net>. | "OpenConfig", <https://openconfig.net>. | |||
[Piovesan2017] | [Piovesan2017] | |||
Piovesan, A. and E. Griffor, "Reasoning About Safety and | Piovesan, A. and E. Griffor, "7 - Reasoning About Safety | |||
Security: The Logic of Assurance", 2017, | and Security: The Logic of Assurance", | |||
DOI 10.1016/B978-0-12-803773-7.00007-3, 2017, | ||||
<https://doi.org/10.1016/B978-0-12-803773-7.00007-3>. | <https://doi.org/10.1016/B978-0-12-803773-7.00007-3>. | |||
[RFC2865] Rigney, C., Willens, S., Rubens, A., Simpson, W., and RFC | [RFC2865] Rigney, C., Willens, S., Rubens, A., and W. Simpson, | |||
Publisher, "Remote Authentication Dial In User Service | "Remote Authentication Dial In User Service (RADIUS)", | |||
(RADIUS)", RFC 2865, DOI 10.17487/RFC2865, June 2000, | RFC 2865, DOI 10.17487/RFC2865, June 2000, | |||
<https://www.rfc-editor.org/info/rfc2865>. | <https://www.rfc-editor.org/info/rfc2865>. | |||
[RFC5424] Gerhards, R. and RFC Publisher, "The Syslog Protocol", | [RFC5424] Gerhards, R., "The Syslog Protocol", RFC 5424, | |||
RFC 5424, DOI 10.17487/RFC5424, March 2009, | DOI 10.17487/RFC5424, March 2009, | |||
<https://www.rfc-editor.org/info/rfc5424>. | <https://www.rfc-editor.org/info/rfc5424>. | |||
[RFC5905] Mills, D., Martin, J., Ed., Burbank, J., Kasch, W., and | [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, | |||
RFC Publisher, "Network Time Protocol Version 4: Protocol | "Network Time Protocol Version 4: Protocol and Algorithms | |||
and Algorithms Specification", RFC 5905, | Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, | |||
DOI 10.17487/RFC5905, June 2010, | ||||
<https://www.rfc-editor.org/info/rfc5905>. | <https://www.rfc-editor.org/info/rfc5905>. | |||
[RFC6242] Wasserman, M. and RFC Publisher, "Using the NETCONF | [RFC6242] Wasserman, M., "Using the NETCONF Protocol over Secure | |||
Protocol over Secure Shell (SSH)", RFC 6242, | Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011, | |||
DOI 10.17487/RFC6242, June 2011, | ||||
<https://www.rfc-editor.org/info/rfc6242>. | <https://www.rfc-editor.org/info/rfc6242>. | |||
[RFC7011] Claise, B., Ed., Trammell, B., Ed., Aitken, P., and RFC | [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, | |||
Publisher, "Specification of the IP Flow Information | "Specification of the IP Flow Information Export (IPFIX) | |||
Export (IPFIX) Protocol for the Exchange of Flow | Protocol for the Exchange of Flow Information", STD 77, | |||
Information", STD 77, RFC 7011, DOI 10.17487/RFC7011, | RFC 7011, DOI 10.17487/RFC7011, September 2013, | |||
September 2013, <https://www.rfc-editor.org/info/rfc7011>. | <https://www.rfc-editor.org/info/rfc7011>. | |||
[RFC7149] Boucadair, M., Jacquenet, C., and RFC Publisher, | [RFC7149] Boucadair, M. and C. Jacquenet, "Software-Defined | |||
"Software-Defined Networking: A Perspective from within a | Networking: A Perspective from within a Service Provider | |||
Service Provider Environment", RFC 7149, | Environment", RFC 7149, DOI 10.17487/RFC7149, March 2014, | |||
DOI 10.17487/RFC7149, March 2014, | ||||
<https://www.rfc-editor.org/info/rfc7149>. | <https://www.rfc-editor.org/info/rfc7149>. | |||
[RFC7665] Halpern, J., Ed., Pignataro, C., Ed., and RFC Publisher, | [RFC7665] Halpern, J., Ed. and C. Pignataro, Ed., "Service Function | |||
"Service Function Chaining (SFC) Architecture", RFC 7665, | Chaining (SFC) Architecture", RFC 7665, | |||
DOI 10.17487/RFC7665, October 2015, | DOI 10.17487/RFC7665, October 2015, | |||
<https://www.rfc-editor.org/info/rfc7665>. | <https://www.rfc-editor.org/info/rfc7665>. | |||
[RFC7950] Bjorklund, M., Ed. and RFC Publisher, "The YANG 1.1 Data | [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", | |||
Modeling Language", RFC 7950, DOI 10.17487/RFC7950, August | RFC 7950, DOI 10.17487/RFC7950, August 2016, | |||
2016, <https://www.rfc-editor.org/info/rfc7950>. | <https://www.rfc-editor.org/info/rfc7950>. | |||
[RFC8199] Bogdanovic, D., Claise, B., Moberg, C., and RFC Publisher, | [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module | |||
"YANG Module Classification", RFC 8199, | Classification", RFC 8199, DOI 10.17487/RFC8199, July | |||
DOI 10.17487/RFC8199, July 2017, | 2017, <https://www.rfc-editor.org/info/rfc8199>. | |||
<https://www.rfc-editor.org/info/rfc8199>. | ||||
[RFC8446] Rescorla, E. and RFC Publisher, "The Transport Layer | [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol | |||
Security (TLS) Protocol Version 1.3", RFC 8446, | Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, | |||
DOI 10.17487/RFC8446, August 2018, | ||||
<https://www.rfc-editor.org/info/rfc8446>. | <https://www.rfc-editor.org/info/rfc8446>. | |||
[RFC8466] Wen, B., Fioccola, G., Ed., Xie, C., Jalil, L., and RFC | [RFC8466] Wen, B., Fioccola, G., Ed., Xie, C., and L. Jalil, "A YANG | |||
Publisher, "A YANG Data Model for Layer 2 Virtual Private | Data Model for Layer 2 Virtual Private Network (L2VPN) | |||
Network (L2VPN) Service Delivery", RFC 8466, | Service Delivery", RFC 8466, DOI 10.17487/RFC8466, October | |||
DOI 10.17487/RFC8466, October 2018, | 2018, <https://www.rfc-editor.org/info/rfc8466>. | |||
<https://www.rfc-editor.org/info/rfc8466>. | ||||
[RFC8641] Clemm, A., Voit, E., and RFC Publisher, "Subscription to | [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications | |||
YANG Notifications for Datastore Updates", RFC 8641, | for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, | |||
DOI 10.17487/RFC8641, September 2019, | September 2019, <https://www.rfc-editor.org/info/rfc8641>. | |||
<https://www.rfc-editor.org/info/rfc8641>. | ||||
[RFC8907] Dahm, T., Ota, A., Medway Gash, D.C., Carrel, D., Grant, | [RFC8907] Dahm, T., Ota, A., Medway Gash, D.C., Carrel, D., and L. | |||
L., and RFC Publisher, "The Terminal Access Controller | Grant, "The Terminal Access Controller Access-Control | |||
Access-Control System Plus (TACACS+) Protocol", RFC 8907, | System Plus (TACACS+) Protocol", RFC 8907, | |||
DOI 10.17487/RFC8907, September 2020, | DOI 10.17487/RFC8907, September 2020, | |||
<https://www.rfc-editor.org/info/rfc8907>. | <https://www.rfc-editor.org/info/rfc8907>. | |||
[RFC9315] Clemm, A., Ciavaglia, L., Granville, L. Z., Tantsura, J., | [RFC9315] Clemm, A., Ciavaglia, L., Granville, L. Z., and J. | |||
and RFC Publisher, "Intent-Based Networking - Concepts and | Tantsura, "Intent-Based Networking - Concepts and | |||
Definitions", RFC 9315, DOI 10.17487/RFC9315, October | Definitions", RFC 9315, DOI 10.17487/RFC9315, October | |||
2022, <https://www.rfc-editor.org/info/rfc9315>. | 2022, <https://www.rfc-editor.org/info/rfc9315>. | |||
Appendix A. Changes between revisions | [RFC9375] Wu, B., Ed., Wu, Q., Ed., Boucadair, M., Ed., Gonzalez de | |||
Dios, O., and B. Wen, "A YANG Data Model for Network and | ||||
[[RFC editor: please remove this section before publication.]] | VPN Service Performance Monitoring", RFC 9375, | |||
DOI 10.17487/RFC9375, April 2023, | ||||
v12 - 13 | <https://www.rfc-editor.org/info/rfc9375>. | |||
* Addressing IESG telechat feedback | ||||
v11 - 12 | ||||
* Addressing comments from Last call | ||||
v10 - v11 | ||||
* Adding reference to example of network performance model | ||||
v09 - v10 | ||||
* Addressing comments from Rob Wilton | ||||
v08 - v09 | ||||
* Addressing comments from Michael Richardson | ||||
v07 - v08 | ||||
* Propagating removal of under-maintenance flag from the YANG module | ||||
v06-07 | ||||
Addressing comments from Dhruv Dhody and applying pending changes | ||||
v03 - v04 | ||||
* Address comments from Mohamed Boucadair | ||||
v00 - v01 | ||||
* Cover the feedback received during the WG call for adoption | ||||
Acknowledgements | Acknowledgements | |||
The authors would like to thank Stephane Litkowski, Charles Eckel, | The authors would like to thank Stephane Litkowski, Charles Eckel, | |||
Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, | Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin, | |||
Eric Vyncke, Mohamed Boucadair, Dhruv Dhody, Michael Richardson and | Éric Vyncke, Mohamed Boucadair, Dhruv Dhody, Michael Richardson, and | |||
Rob Wilton for their reviews and feedback. | Rob Wilton for their reviews and feedback. | |||
Contributors | ||||
* Youssef El Fathi | ||||
* Éric Vyncke | ||||
Authors' Addresses | Authors' Addresses | |||
Benoit Claise | Benoit Claise | |||
Huawei | Huawei | |||
Email: benoit.claise@huawei.com | Email: benoit.claise@huawei.com | |||
Jean Quilbeuf | Jean Quilbeuf | |||
Huawei | Huawei | |||
Email: jean.quilbeuf@huawei.com | Email: jean.quilbeuf@huawei.com | |||
Diego R. Lopez | Diego R. Lopez | |||
Telefonica I+D | Telefonica I+D | |||
Don Ramon de la Cruz, 82 | Don Ramon de la Cruz, 82 | |||
Madrid 28006 | 28006 Madrid | |||
Spain | Spain | |||
Email: diego.r.lopez@telefonica.com | Email: diego.r.lopez@telefonica.com | |||
Dan Voyer | Dan Voyer | |||
Bell Canada | Bell Canada | |||
Canada | Canada | |||
Email: daniel.voyer@bell.ca | Email: daniel.voyer@bell.ca | |||
Thangam Arumugam | Thangam Arumugam | |||
Cisco Systems, Inc. | Consultant | |||
Milpitas (California), | Milpitas, California | |||
United States of America | United States of America | |||
Email: tarumuga@cisco.com | Email: thangavelu@yahoo.com | |||
End of changes. 151 change blocks. | ||||
591 lines changed or deleted | 545 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |