rfc9485.original | rfc9485.txt | |||
---|---|---|---|---|
Network Working Group C. Bormann | Internet Engineering Task Force (IETF) C. Bormann | |||
Internet-Draft Universität Bremen TZI | Request for Comments: 9485 Universität Bremen TZI | |||
Intended status: Standards Track T. Bray | Category: Standards Track T. Bray | |||
Expires: 31 December 2023 Textuality | ISSN: 2070-1721 Textuality | |||
29 June 2023 | October 2023 | |||
I-Regexp: An Interoperable Regexp Format | I-Regexp: An Interoperable Regular Expression Format | |||
draft-ietf-jsonpath-iregexp-08 | ||||
Abstract | Abstract | |||
This document specifies I-Regexp, a flavor of regular expressions | This document specifies I-Regexp, a flavor of regular expression that | |||
that is limited in scope with the goal of interoperation across many | is limited in scope with the goal of interoperation across many | |||
different regular-expression libraries. | different regular expression libraries. | |||
About This Document | ||||
This note is to be removed before publishing as an RFC. | ||||
Status information for this document may be found at | ||||
https://datatracker.ietf.org/doc/draft-ietf-jsonpath-iregexp/. | ||||
Discussion of this document takes place on the JSONPath Working Group | ||||
mailing list (mailto:JSONPath@ietf.org), which is archived at | ||||
https://mailarchive.ietf.org/arch/browse/JSONPath/. Subscribe at | ||||
https://www.ietf.org/mailman/listinfo/JSONPath/. | ||||
Source for this draft and an issue tracker can be found at | ||||
https://github.com/ietf-wg-jsonpath/iregexp. | ||||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This is an Internet Standards Track document. | |||
provisions of BCP 78 and BCP 79. | ||||
Internet-Drafts are working documents of the Internet Engineering | ||||
Task Force (IETF). Note that other groups may also distribute | ||||
working documents as Internet-Drafts. The list of current Internet- | ||||
Drafts is at https://datatracker.ietf.org/drafts/current/. | ||||
Internet-Drafts are draft documents valid for a maximum of six months | This document is a product of the Internet Engineering Task Force | |||
and may be updated, replaced, or obsoleted by other documents at any | (IETF). It represents the consensus of the IETF community. It has | |||
time. It is inappropriate to use Internet-Drafts as reference | received public review and has been approved for publication by the | |||
material or to cite them other than as "work in progress." | Internet Engineering Steering Group (IESG). Further information on | |||
Internet Standards is available in Section 2 of RFC 7841. | ||||
This Internet-Draft will expire on 31 December 2023. | Information about the current status of this document, any errata, | |||
and how to provide feedback on it may be obtained at | ||||
https://www.rfc-editor.org/info/rfc9485. | ||||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2023 IETF Trust and the persons identified as the | Copyright (c) 2023 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents | |||
license-info) in effect on the date of publication of this document. | (https://trustee.ietf.org/license-info) in effect on the date of | |||
Please review these documents carefully, as they describe your rights | publication of this document. Please review these documents | |||
and restrictions with respect to this document. Code Components | carefully, as they describe your rights and restrictions with respect | |||
extracted from this document must include Revised BSD License text as | to this document. Code Components extracted from this document must | |||
described in Section 4.e of the Trust Legal Provisions and are | include Revised BSD License text as described in Section 4.e of the | |||
provided without warranty as described in the Revised BSD License. | Trust Legal Provisions and are provided without warranty as described | |||
in the Revised BSD License. | ||||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction | |||
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 | 1.1. Terminology | |||
2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 3 | 2. Objectives | |||
3. I-Regexp Syntax . . . . . . . . . . . . . . . . . . . . . . . 4 | 3. I-Regexp Syntax | |||
3.1. Checking Implementations . . . . . . . . . . . . . . . . 5 | 3.1. Checking Implementations | |||
4. I-Regexp Semantics . . . . . . . . . . . . . . . . . . . . . 5 | 4. I-Regexp Semantics | |||
5. Mapping I-Regexp to Regexp Dialects . . . . . . . . . . . . . 5 | 5. Mapping I-Regexp to Regexp Dialects | |||
5.1. Multi-Character Escapes . . . . . . . . . . . . . . . . . 6 | 5.1. Multi-Character Escapes | |||
5.2. XSD Regexps . . . . . . . . . . . . . . . . . . . . . . . 6 | 5.2. XSD Regexps | |||
5.3. ECMAScript Regexps . . . . . . . . . . . . . . . . . . . 6 | 5.3. ECMAScript Regexps | |||
5.4. PCRE, RE2, Ruby Regexps . . . . . . . . . . . . . . . . . 7 | 5.4. PCRE, RE2, and Ruby Regexps | |||
6. Motivation and Background . . . . . . . . . . . . . . . . . . 7 | 6. Motivation and Background | |||
6.1. Implementing I-Regexp . . . . . . . . . . . . . . . . . . 7 | 6.1. Implementing I-Regexp | |||
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 | 7. IANA Considerations | |||
8. Security considerations . . . . . . . . . . . . . . . . . . . 8 | 8. Security Considerations | |||
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 | 9. References | |||
9.1. Normative References . . . . . . . . . . . . . . . . . . 9 | 9.1. Normative References | |||
9.2. Informative References . . . . . . . . . . . . . . . . . 10 | 9.2. Informative References | |||
Appendix A. Regexps and Similar Constructs in Recent Published | Acknowledgements | |||
RFCs . . . . . . . . . . . . . . . . . . . . . . . . . . 10 | Authors' Addresses | |||
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 12 | ||||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 | ||||
1. Introduction | 1. Introduction | |||
This specification describes an interoperable regular expression | This specification describes an interoperable regular expression | |||
("regexp") flavor, I-Regexp. | (abbreviated as "regexp") flavor, I-Regexp. | |||
I-Regexp does not provide advanced regular expression features such | I-Regexp does not provide advanced regular expression features such | |||
as capture groups, lookahead, or backreferences. It supports only a | as capture groups, lookahead, or backreferences. It supports only a | |||
Boolean matching capability, i.e., testing whether a given regular | Boolean matching capability, i.e., testing whether a given regular | |||
expression matches a given piece of text. | expression matches a given piece of text. | |||
I-Regexp supports the entire repertoire of Unicode characters | I-Regexp supports the entire repertoire of Unicode characters | |||
(Unicode scalar values); both the I-Regexp strings themselves and the | (Unicode scalar values); both the I-Regexp strings themselves and the | |||
strings they are matched against are sequences of Unicode scalar | strings they are matched against are sequences of Unicode scalar | |||
values (often represented in UTF-8 encoding form [STD63] for | values (often represented in UTF-8 encoding form [RFC3629] for | |||
interchange). | interchange). | |||
I-Regexp is a subset of XSD regular expressions [XSD-2]. | I-Regexp is a subset of XML Schema Definition (XSD) regular | |||
expressions [XSD-2]. | ||||
This document includes guidance for converting I-Regexps for use with | This document includes guidance for converting I-Regexps for use with | |||
several well-known regular expression idioms. | several well-known regular expression idioms. | |||
The development of I-Regexp was motivated by the work of the JSONPath | The development of I-Regexp was motivated by the work of the JSONPath | |||
Working Group. The Working Group wanted to include in its | Working Group (WG). The WG wanted to include support for the use of | |||
specification [I-D.ietf-jsonpath-base] support for the use of regular | regular expressions in JSONPath filters in its specification | |||
expressions in JSONPath filters, but was unable to find a useful | [JSONPATH-BASE], but was unable to find a useful specification for | |||
specification for regular expressions which would be interoperable | regular expressions that would be interoperable across the popular | |||
across the popular libraries. | libraries. | |||
1.1. Terminology | 1.1. Terminology | |||
This document uses the abbreviation "regexp" for what are usually | This document uses the abbreviation "regexp" for what is usually | |||
called regular expressions in programming. "I-Regexp" is used as a | called a "regular expression" in programming. The term "I-Regexp" is | |||
noun meaning a character string (sequence of Unicode scalar values) | used as a noun meaning a character string (sequence of Unicode scalar | |||
that conforms to the requirements in this specification; the plural | values) that conforms to the requirements in this specification; the | |||
is "I-Regexps". | plural is "I-Regexps". | |||
This specification uses Unicode terminology. A good entry point into | This specification uses Unicode terminology; a good entry point is | |||
that is provided by [UNICODE-GLOSSARY]. | provided by [UNICODE-GLOSSARY]. | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
"OPTIONAL" in this document are to be interpreted as described in | "OPTIONAL" in this document are to be interpreted as described in | |||
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | |||
capitals, as shown here. | capitals, as shown here. | |||
The grammatical rules in this document are to be interpreted as ABNF, | The grammatical rules in this document are to be interpreted as ABNF, | |||
as described in [RFC5234] and [RFC7405], where the "characters" of | as described in [RFC5234] and [RFC7405], where the "characters" of | |||
Section 2.3 of [RFC5234] are Unicode scalar values. | Section 2.3 of [RFC5234] are Unicode scalar values. | |||
2. Objectives | 2. Objectives | |||
I-Regexps should handle the vast majority of practical cases where a | I-Regexps should handle the vast majority of practical cases where a | |||
matching regexp is needed in a data model specification or a query | matching regexp is needed in a data-model specification or a query- | |||
language expression. | language expression. | |||
The editors of this document conducted a survey of the regexp syntax | At the time of writing, an editor of this document conducted a survey | |||
used in published RFCs. All examples found there should be covered | of the regexp syntax used in recently published RFCs. All examples | |||
by I-Regexps, both syntactically and with their intended semantics. | found there should be covered by I-Regexps, both syntactically and | |||
The exception is the use of multi-character escapes, for which | with their intended semantics. The exception is the use of multi- | |||
workaround guidance is provided in Section 5. | character escapes, for which workaround guidance is provided in | |||
Section 5. | ||||
3. I-Regexp Syntax | 3. I-Regexp Syntax | |||
An I-Regexp MUST conform to the ABNF specification in Figure 1. | An I-Regexp MUST conform to the ABNF specification in Figure 1. | |||
i-regexp = branch *( "|" branch ) | i-regexp = branch *( "|" branch ) | |||
branch = *piece | branch = *piece | |||
piece = atom [ quantifier ] | piece = atom [ quantifier ] | |||
quantifier = ( "*" / "+" / "?" ) / range-quantifier | quantifier = ( "*" / "+" / "?" ) / range-quantifier | |||
range-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}" | range-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}" | |||
QuantExact = 1*%x30-39 ; '0'-'9' | QuantExact = 1*%x30-39 ; '0'-'9' | |||
atom = NormalChar / charClass / ( "(" i-regexp ")" ) | atom = NormalChar / charClass / ( "(" i-regexp ")" ) | |||
NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>' | NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>' | |||
/ %x40-5A ; '@'-'Z' | / %x40-5A ; '@'-'Z' | |||
/ %x5E-7A ; '^'-'z' | / %x5E-7A ; '^'-'z' | |||
/ %x7E-10FFFF ) | / %x7E-D7FF ; skip surrogate code points | |||
/ %xE000-10FFFF ) | ||||
charClass = "." / SingleCharEsc / charClassEsc / charClassExpr | charClass = "." / SingleCharEsc / charClassEsc / charClassExpr | |||
SingleCharEsc = "\" ( %x28-2B ; '('-'+' | SingleCharEsc = "\" ( %x28-2B ; '('-'+' | |||
/ "-" / "." / "?" / %x5B-5E ; '['-'^' | / "-" / "." / "?" / %x5B-5E ; '['-'^' | |||
/ %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}' | / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}' | |||
) | ) | |||
charClassEsc = catEsc / complEsc | charClassEsc = catEsc / complEsc | |||
charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]" | charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]" | |||
CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc | CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc | |||
CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z' | CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z' | |||
/ %x5E-10FFFF ) / SingleCharEsc | / %x5E-D7FF ; skip surrogate code points | |||
/ %xE000-10FFFF ) / SingleCharEsc | ||||
catEsc = %s"\p{" charProp "}" | catEsc = %s"\p{" charProp "}" | |||
complEsc = %s"\P{" charProp "}" | complEsc = %s"\P{" charProp "}" | |||
charProp = IsCategory | charProp = IsCategory | |||
IsCategory = Letters / Marks / Numbers / Punctuation / Separators / | IsCategory = Letters / Marks / Numbers / Punctuation / Separators / | |||
Symbols / Others | Symbols / Others | |||
Letters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ] | Letters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ] | |||
Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ] | Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ] | |||
Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ] | Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ] | |||
Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f' | Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f' | |||
/ %s"i" / %s"o" / %s"s" ) ] | / %s"i" / %s"o" / %s"s" ) ] | |||
Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ] | Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ] | |||
Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ] | Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ] | |||
Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ] | Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ] | |||
Figure 1: I-Regexp Syntax in ABNF | Figure 1: I-Regexp Syntax in ABNF | |||
As an additional restriction, charClassExpr is not allowed to match | As an additional restriction, charClassExpr is not allowed to match | |||
[^], which according to this grammar would parse as a positive | [^], which, according to this grammar, would parse as a positive | |||
character class containing the single character ^. | character class containing the single character ^. | |||
This is essentially XSD regexp without character class subtraction, | This is essentially an XSD regexp without: | |||
without multi-character escapes such as \s, \S, and \w, and without | ||||
Unicode blocks. | * character class subtraction, | |||
* multi-character escapes such as \s, \S, and \w, and | ||||
* Unicode blocks. | ||||
An I-Regexp implementation MUST be a complete implementation of this | An I-Regexp implementation MUST be a complete implementation of this | |||
limited subset. In particular, full support for the Unicode | limited subset. In particular, full support for the Unicode | |||
functionality defined in this specification is REQUIRED; the | functionality defined in this specification is REQUIRED. The | |||
implementation MUST NOT limit itself to 7- or 8-bit character sets | implementation: | |||
such as ASCII and MUST support the Unicode character property set in | ||||
character classes. | * MUST NOT limit itself to 7- or 8-bit character sets such as ASCII, | |||
and | ||||
* MUST support the Unicode character property set in character | ||||
classes. | ||||
3.1. Checking Implementations | 3.1. Checking Implementations | |||
A _checking_ I-Regexp implementation is one that checks a supplied | A _checking_ I-Regexp implementation is one that checks a supplied | |||
regexp for compliance with this specification and reports any | regexp for compliance with this specification and reports any | |||
problems. Checking implementations give their users confidence that | problems. Checking implementations give their users confidence that | |||
they didn't accidentally insert non-interoperable syntax, so checking | they didn't accidentally insert syntax that is not interoperable, so | |||
is RECOMMENDED. Exceptions to this rule may be made for low-effort | checking is RECOMMENDED. Exceptions to this rule may be made for | |||
implementations that map I-Regexp to another regexp library by simple | low-effort implementations that map I-Regexp to another regexp | |||
steps such as performing the mapping operations discussed in | library by simple steps such as performing the mapping operations | |||
Section 5; here, the effort needed to do full checking may dwarf the | discussed in Section 5. Here, the effort needed to do full checking | |||
rest of the implementation effort. Implementations SHOULD document | might dwarf the rest of the implementation effort. Implementations | |||
whether they are checking or not. | SHOULD document whether or not they are checking. | |||
Specifications that employ I-Regexp may want to define in which cases | Specifications that employ I-Regexp may want to define in which cases | |||
their implementations can work with a non-checking I-Regexp | their implementations can work with a non-checking I-Regexp | |||
implementation and when full checking is needed, possibly in the | implementation and when full checking is needed, possibly in the | |||
process of defining their own implementation classes. | process of defining their own implementation classes. | |||
4. I-Regexp Semantics | 4. I-Regexp Semantics | |||
This syntax is a subset of that of [XSD-2]. Implementations which | This syntax is a subset of that of [XSD-2]. Implementations that | |||
interpret I-Regexps MUST yield Boolean results as specified in | interpret I-Regexps MUST yield Boolean results as specified in | |||
[XSD-2]. (See also Section 5.2.) | [XSD-2]. (See also Section 5.2.) | |||
5. Mapping I-Regexp to Regexp Dialects | 5. Mapping I-Regexp to Regexp Dialects | |||
The material in this section is non-normative, provided as guidance | The material in this section is not normative; it is provided as | |||
to developers who want to use I-Regexps in the context of other | guidance to developers who want to use I-Regexps in the context of | |||
regular expression dialects. | other regular expression dialects. | |||
5.1. Multi-Character Escapes | 5.1. Multi-Character Escapes | |||
Common multi-character escapes (MCEs), and character classes built | I-Regexp does not support common multi-character escapes (MCEs) and | |||
around them, which are not supported in I-Regexp, can usually be | character classes built around them. These can usually be replaced | |||
replaced as shown for example in Table 1. | as shown by the examples in Table 1. | |||
+===========+==============+ | +============+===============+ | |||
| MCE/class | Replace with | | | MCE/class: | Replace with: | | |||
+===========+==============+ | +============+===============+ | |||
| \S | [^ \t\n\r] | | | \S | [^ \t\n\r] | | |||
+-----------+--------------+ | +------------+---------------+ | |||
| [\S ] | [^\t\n\r] | | | [\S ] | [^\t\n\r] | | |||
+-----------+--------------+ | +------------+---------------+ | |||
| \d | [0-9] | | | \d | [0-9] | | |||
+-----------+--------------+ | +------------+---------------+ | |||
Table 1: Example | Table 1: Example | |||
substitutes for multi- | Substitutes for Multi- | |||
character escapes | Character Escapes | |||
Note that the semantics of \d in XSD regular expressions is that of | Note that the semantics of \d in XSD regular expressions is that of | |||
\p{Nd}; however, this would include all Unicode characters that are | \p{Nd}; however, this would include all Unicode characters that are | |||
digits in various writing systems, which is almost certainly not what | digits in various writing systems, which is almost certainly not what | |||
is required in IETF publications. | is required in IETF publications. | |||
The construct \p{IsBasicLatin} is essentially a reference to legacy | The construct \p{IsBasicLatin} is essentially a reference to legacy | |||
ASCII, it can be replaced by the character class [\u0000-\u007f]. | ASCII; it can be replaced by the character class [\u0000-\u007f]. | |||
5.2. XSD Regexps | 5.2. XSD Regexps | |||
Any I-Regexp also is an XSD Regexp [XSD-2], so the mapping is an | Any I-Regexp is also an XSD regexp [XSD-2], so the mapping is an | |||
identity function. | identity function. | |||
Note that a few errata for [XSD-2] have been fixed in [XSD11-2], | Note that a few errata for [XSD-2] have been fixed in [XSD-1.1-2]; | |||
which is therefore also included as a normative reference. XSD 1.1 | therefore, it is also included in the Normative References | |||
is less widely implemented than XSD 1.0, and implementations of XSD | (Section 9.1). XSD 1.1 is less widely implemented than XSD 1.0, and | |||
1.0 are likely to include these bugfixes, so for the intents and | implementations of XSD 1.0 are likely to include these bugfixes; for | |||
purposes of this specification an implementation of XSD 1.0 regexps | the intents and purposes of this specification, an implementation of | |||
is equivalent to an implementation of XSD 1.1 regexps. | XSD 1.0 regexps is equivalent to an implementation of XSD 1.1 | |||
regexps. | ||||
5.3. ECMAScript Regexps | 5.3. ECMAScript Regexps | |||
Perform the following steps on an I-Regexp to obtain an ECMAScript | Perform the following steps on an I-Regexp to obtain an ECMAScript | |||
regexp [ECMA-262]: | regexp [ECMA-262]: | |||
* For any unescaped dots (.) outside character classes (first | * For any unescaped dots (.) outside character classes (first | |||
alternative of charClass production): replace dot by [^\n\r]. | alternative of charClass production), replace the dot with | |||
[^\n\r]. | ||||
* Envelope the result in ^(?: and )$. | * Envelope the result in ^(?: and )$. | |||
The ECMAScript regexp is to be interpreted as a Unicode pattern ("u" | The ECMAScript regexp is to be interpreted as a Unicode pattern ("u" | |||
flag; see Section 21.2.2 "Pattern Semantics" of [ECMA-262]). | flag; see Section 21.2.2 "Pattern Semantics" of [ECMA-262]). | |||
Note that where a regexp literal is required, the actual regexp needs | Note that where a regexp literal is required, the actual regexp needs | |||
to be enclosed in /. | to be enclosed in /. | |||
5.4. PCRE, RE2, Ruby Regexps | 5.4. PCRE, RE2, and Ruby Regexps | |||
Perform the same steps as in Section 5.3 to obtain a valid regexp in | To obtain a valid regexp in Perl Compatible Regular Expressions | |||
PCRE [PCRE2], the Go programming language [RE2], and the Ruby | (PCRE) [PCRE2], the Go programming language's RE2 regexp library | |||
programming language, except that the last step is: | [RE2], and the Ruby programming language, perform the same steps as | |||
in Section 5.3, except that the last step is: | ||||
* Enclose the regexp in \A(?: and )\z. | * Enclose the regexp in \A(?: and )\z. | |||
6. Motivation and Background | 6. Motivation and Background | |||
While regular expressions originally were intended to describe a | While regular expressions originally were intended to describe a | |||
formal language to support a Boolean matching function, they have | formal language to support a Boolean matching function, they have | |||
been enhanced with parsing functions that support the extraction and | been enhanced with parsing functions that support the extraction and | |||
replacement of arbitrary portions of the matched text. With this | replacement of arbitrary portions of the matched text. With this | |||
accretion of features, parsing regexp libraries have become more | accretion of features, parsing-regexp libraries have become more | |||
susceptible to bugs and surprising performance degradations which can | susceptible to bugs and surprising performance degradations that can | |||
be exploited in Denial of Service attacks by an attacker who controls | be exploited in denial-of-service attacks by an attacker who controls | |||
the regexp submitted for processing. I-Regexp is designed to offer | the regexp submitted for processing. I-Regexp is designed to offer | |||
interoperability, and to be less vulnerable to such attacks, with the | interoperability and to be less vulnerable to such attacks, with the | |||
trade-off that its only function is to offer a boolean response as to | trade-off that its only function is to offer a Boolean response as to | |||
whether a character sequence is matched by a regexp. | whether a character sequence is matched by a regexp. | |||
6.1. Implementing I-Regexp | 6.1. Implementing I-Regexp | |||
XSD regexps are relatively easy to implement or map to widely | XSD regexps are relatively easy to implement or map to widely | |||
implemented parsing regexp dialects, with these notable exceptions: | implemented parsing-regexp dialects, with these notable exceptions: | |||
* Character class subtraction. This is a very useful feature in | * Character class subtraction. This is a very useful feature in | |||
many specifications, but it is unfortunately mostly absent from | many specifications, but it is unfortunately mostly absent from | |||
parsing regexp dialects. Thus, it is omitted from I-Regexp. | parsing-regexp dialects. Thus, it is omitted from I-Regexp. | |||
* Multi-character escapes. \d, \w, \s and their uppercase | * Multi-character escapes. \d, \w, \s and their uppercase | |||
complement classes exhibit a large amount of variation between | complement classes exhibit a large amount of variation between | |||
regexp flavors. Thus, they are omitted from I-Regexp. | regexp flavors. Thus, they are omitted from I-Regexp. | |||
* Not all regexp implementations support accesses to Unicode tables | * Not all regexp implementations support access to Unicode tables | |||
that enable executing constructs such as \p{Nd}, although the | that enable executing constructs such as \p{Nd}, although the | |||
\p/\P feature in general is now quite widely available. While in | \p/\P feature in general is now quite widely available. While, in | |||
principle it is possible to translate these into character-class | principle, it is possible to translate these into character-class | |||
matches, this also requires access to those tables. Thus, regexp | matches, this also requires access to those tables. Thus, regexp | |||
libraries in severely constrained environments may not be able to | libraries in severely constrained environments may not be able to | |||
support I-Regexp conformance. | support I-Regexp conformance. | |||
7. IANA Considerations | 7. IANA Considerations | |||
This document makes no requests of IANA. | This document has no IANA actions. | |||
8. Security considerations | 8. Security Considerations | |||
While technically out of scope of this specification, Section 10 | While technically out of the scope of this specification, Section 10 | |||
(Security Considerations) of [STD63] applies to implementations. | ("Security Considerations") of [RFC3629] applies to implementations. | |||
Particular note needs to be taken of the last paragraph of Section 3 | Particular note needs to be taken of the last paragraph of Section 3 | |||
(UTF-8 definition) of [STD63]; an I-Regexp implementation may need to | ("UTF-8 definition") of [RFC3629]; an I-Regexp implementation may | |||
mitigate limitations of the platform implementation in this regard. | need to mitigate limitations of the platform implementation in this | |||
regard. | ||||
As discussed in Section 6, more complex regexp libraries may contain | As discussed in Section 6, more complex regexp libraries may contain | |||
exploitable bugs leading to crashes and remote code execution. There | exploitable bugs, which can lead to crashes and remote code | |||
is also the problem that such libraries often have hard-to-predict | execution. There is also the problem that such libraries often have | |||
performance characteristics, leading to attacks that overload an | performance characteristics that are hard to predict, leading to | |||
implementation by matching against an expensive attacker-controlled | attacks that overload an implementation by matching against an | |||
regexp. | expensive attacker-controlled regexp. | |||
I-Regexps have been designed to allow implementation in a way that is | I-Regexps have been designed to allow implementation in a way that is | |||
resilient to both threats; this objective needs to be addressed | resilient to both threats; this objective needs to be addressed | |||
throughout the implementation effort. Non-checking implementations | throughout the implementation effort. Non-checking implementations | |||
(see Section 3.1) are likely to expose security limitations of any | (see Section 3.1) are likely to expose security limitations of any | |||
regexp engine they use, which may be less problematic if that engine | regexp engine they use, which may be less problematic if that engine | |||
has been built with security considerations in mind (e.g., [RE2]); a | has been built with security considerations in mind (e.g., [RE2]). | |||
checking implementation is still RECOMMENDED. | In any case, a checking implementation is still RECOMMENDED. | |||
Implementations that specifically implement the I-Regexp subset can, | Implementations that specifically implement the I-Regexp subset can, | |||
with care, be designed to generally run in linear time and space in | with care, be designed to generally run in linear time and space in | |||
the input, and to detect when that would not be the case (see below). | the input and to detect when that would not be the case (see below). | |||
Existing regexp engines should be able to easily handle most | Existing regexp engines should be able to easily handle most | |||
I-Regexps (after the adjustments discussed in Section 5), but may | I-Regexps (after the adjustments discussed in Section 5), but may | |||
consume excessive resources for some types of I-Regexps or outright | consume excessive resources for some types of I-Regexps or outright | |||
reject them because they cannot guarantee efficient execution. (Note | reject them because they cannot guarantee efficient execution. (Note | |||
that different versions of the same regexp library may be more or | that different versions of the same regexp library may be more or | |||
less vulnerable to excessive resource consumption for these cases.) | less vulnerable to excessive resource consumption for these cases.) | |||
Specifically, range quantifiers (as in a{2,4}) provide particular | Specifically, range quantifiers (as in a{2,4}) provide particular | |||
challenges for both existing and I-Regexp focused implementations. | challenges for both existing and I-Regexp focused implementations. | |||
These may therefore limit range quantifiers in composability | Implementations may therefore limit range quantifiers in | |||
(disallowing nested range quantifiers such as (a{2,4}){2,4}) or range | composability (disallowing nested range quantifiers such as | |||
(disallowing very large ranges such as a{20,200000}), or detect and | (a{2,4}){2,4}) or range (disallowing very large ranges such as | |||
reject any excessive resource consumption caused by them. | a{20,200000}), or detect and reject any excessive resource | |||
consumption caused by range quantifiers. | ||||
I-Regexp implementations that are used to evaluate regexps from | I-Regexp implementations that are used to evaluate regexps from | |||
untrusted sources need to be robust to these cases. Implementers | untrusted sources need to be robust in these cases. Implementers | |||
using existing regexp libraries are encouraged to check their | using existing regexp libraries are encouraged: | |||
documentation to see if mitigations are configurable, such as limits | ||||
in resource consumption, and to document their own degree of | * to check their documentation to see if mitigations are | |||
robustness resulting from employing such mitigations. | configurable, such as limits in resource consumption, and | |||
* to document their own degree of robustness resulting from | ||||
employing such mitigations. | ||||
9. References | 9. References | |||
9.1. Normative References | 9.1. Normative References | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, | Requirement Levels", BCP 14, RFC 2119, | |||
DOI 10.17487/RFC2119, March 1997, | DOI 10.17487/RFC2119, March 1997, | |||
<https://www.rfc-editor.org/rfc/rfc2119>. | <https://www.rfc-editor.org/info/rfc2119>. | |||
[RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | [RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax | |||
Specifications: ABNF", STD 68, RFC 5234, | Specifications: ABNF", STD 68, RFC 5234, | |||
DOI 10.17487/RFC5234, January 2008, | DOI 10.17487/RFC5234, January 2008, | |||
<https://www.rfc-editor.org/rfc/rfc5234>. | <https://www.rfc-editor.org/info/rfc5234>. | |||
[RFC7405] Kyzivat, P., "Case-Sensitive String Support in ABNF", | [RFC7405] Kyzivat, P., "Case-Sensitive String Support in ABNF", | |||
RFC 7405, DOI 10.17487/RFC7405, December 2014, | RFC 7405, DOI 10.17487/RFC7405, December 2014, | |||
<https://www.rfc-editor.org/rfc/rfc7405>. | <https://www.rfc-editor.org/info/rfc7405>. | |||
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC | [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC | |||
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, | 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, | |||
May 2017, <https://www.rfc-editor.org/rfc/rfc8174>. | May 2017, <https://www.rfc-editor.org/info/rfc8174>. | |||
[XSD-2] Malhotra, A., Ed. and P. V. Biron, Ed., "XML Schema Part | [XSD-1.1-2] | |||
2: Datatypes Second Edition", W3C REC REC-xmlschema- | Peterson, D., Ed., Gao, S., Ed., Malhotra, A., Ed., | |||
Sperberg-McQueen, C. M., Ed., Thompson, H., Ed., and P. | ||||
Biron, Ed., "W3C XML Schema Definition Language (XSD) 1.1 | ||||
Part 2: Datatypes", W3C REC REC-xmlschema11-2-20120405, | ||||
W3C REC-xmlschema11-2-20120405, 5 April 2012, | ||||
<https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/>. | ||||
[XSD-2] Biron, P., Ed. and A. Malhotra, Ed., "XML Schema Part 2: | ||||
Datatypes Second Edition", W3C REC REC-xmlschema- | ||||
2-20041028, W3C REC-xmlschema-2-20041028, 28 October 2004, | 2-20041028, W3C REC-xmlschema-2-20041028, 28 October 2004, | |||
<https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>. | <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>. | |||
[XSD11-2] Malhotra, A., Ed., Peterson, D., Ed., Thompson, H., Ed., | ||||
Sperberg-McQueen, M., Ed., Biron, P. V., Ed., and S. Gao, | ||||
Ed., "W3C XML Schema Definition Language (XSD) 1.1 Part 2: | ||||
Datatypes", W3C REC REC-xmlschema11-2-20120405, W3C REC- | ||||
xmlschema11-2-20120405, 5 April 2012, | ||||
<https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/>. | ||||
9.2. Informative References | 9.2. Informative References | |||
[ECMA-262] Ecma International, "ECMAScript 2020 Language | [ECMA-262] Ecma International, "ECMAScript 2020 Language | |||
Specification", ECMA Standard ECMA-262, 11th Edition, June | Specification", Standard ECMA-262, 11th Edition, June | |||
2020, <https://www.ecma-international.org/wp- | 2020, <https://www.ecma-international.org/wp- | |||
content/uploads/ECMA-262.pdf>. | content/uploads/ECMA-262.pdf>. | |||
[I-D.ietf-jsonpath-base] | [JSONPATH-BASE] | |||
Gössner, S., Normington, G., and C. Bormann, "JSONPath: | Gössner, S., Ed., Normington, G., Ed., and C. Bormann, | |||
Query expressions for JSON", Work in Progress, Internet- | Ed., "JSONPath: Query expressions for JSON", Work in | |||
Draft, draft-ietf-jsonpath-base-14, 10 June 2023, | Progress, Internet-Draft, draft-ietf-jsonpath-base-20, 25 | |||
<https://datatracker.ietf.org/doc/html/draft-ietf- | August 2023, <https://datatracker.ietf.org/doc/html/draft- | |||
jsonpath-base-14>. | ietf-jsonpath-base-20>. | |||
[PCRE2] "Perl-compatible Regular Expressions (revised API: | [PCRE2] "Perl-compatible Regular Expressions (revised API: | |||
PCRE2)", n.d., <http://pcre.org/current/doc/html/>. | PCRE2)", <http://pcre.org/current/doc/html/>. | |||
[RE2] "RE2 is a fast, safe, thread-friendly alternative to | [RE2] "RE2 is a fast, safe, thread-friendly alternative to | |||
backtracking regular expression engines like those used in | backtracking regular expression engines like those used in | |||
PCRE, Perl, and Python. It is a C++ library.", n.d., | PCRE, Perl, and Python. It is a C++ library.", commit | |||
<https://github.com/google/re2>. | 73031bb, <https://github.com/google/re2>. | |||
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO | ||||
10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November | ||||
2003, <https://www.rfc-editor.org/info/rfc3629>. | ||||
[RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493, | [RFC7493] Bray, T., Ed., "The I-JSON Message Format", RFC 7493, | |||
DOI 10.17487/RFC7493, March 2015, | DOI 10.17487/RFC7493, March 2015, | |||
<https://www.rfc-editor.org/rfc/rfc7493>. | <https://www.rfc-editor.org/info/rfc7493>. | |||
[STD63] Yergeau, F., "UTF-8, a transformation format of ISO | ||||
10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November | ||||
2003, <https://www.rfc-editor.org/rfc/rfc3629>. | ||||
[UNICODE-GLOSSARY] | [UNICODE-GLOSSARY] | |||
Unicode, Inc., "Glossary of Unicode Terms", | Unicode, Inc., "Glossary of Unicode Terms", | |||
<https://unicode.org/glossary/>. | <https://unicode.org/glossary/>. | |||
Appendix A. Regexps and Similar Constructs in Recent Published RFCs | ||||
This section is to be removed before publishing as an RFC. | ||||
This appendix contains a number of regular expressions that have been | ||||
extracted from some recently published RFCs based on some ad-hoc | ||||
matching. Multi-line constructions were not included. With the | ||||
exception of some (often surprisingly dubious) usage of multi- | ||||
character escapes and a reference to the IsBasicLatin Unicode block, | ||||
all regular expressions validate against the ABNF in Figure 1. | ||||
rfc6021.txt 459 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*)))) | ||||
rfc6021.txt 513 \d*(\.\d*){1,127} | ||||
rfc6021.txt 529 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)? | ||||
rfc6021.txt 631 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)? | ||||
rfc6021.txt 647 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5} | ||||
rfc6021.txt 933 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5} | ||||
rfc6021.txt 938 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))| | ||||
rfc6021.txt 1026 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5} | ||||
rfc6021.txt 1031 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))| | ||||
rfc6020.txt 6647 [0-9a-fA-F]* | ||||
rfc6095.txt 2544 \S(.*\S)? | ||||
rfc6110.txt 1583 [aeiouy]* | ||||
rfc6110.txt 3222 [A-Z][a-z]* | ||||
rfc6536.txt 1583 \* | ||||
rfc6536.txt 1632 [^\*].* | ||||
rfc6643.txt 524 \p{IsBasicLatin}{0,255} | ||||
rfc6728.txt 3480 \S+ | ||||
rfc6728.txt 3500 \S(.*\S)? | ||||
rfc6991.txt 477 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*)))) | ||||
rfc6991.txt 525 \d*(\.\d*){1,127} | ||||
rfc6991.txt 541 [a-zA-Z_][a-zA-Z0-9\-_.]* | ||||
rfc6991.txt 542 .|..|[^xX].*|.[^mM].*|..[^lL].* | ||||
rfc6991.txt 571 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)? | ||||
rfc6991.txt 665 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)? | ||||
rfc6991.txt 693 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5} | ||||
rfc6991.txt 725 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)? | ||||
rfc6991.txt 743 [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}- | ||||
rfc6991.txt 1041 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5} | ||||
rfc6991.txt 1046 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))| | ||||
rfc6991.txt 1099 [0-9\.]* | ||||
rfc6991.txt 1109 [0-9a-fA-F:\.]* | ||||
rfc6991.txt 1164 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5} | ||||
rfc6991.txt 1169 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))| | ||||
rfc7407.txt 933 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){0,254} | ||||
rfc7407.txt 1494 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){4,31} | ||||
rfc7758.txt 703 \d{2}:\d{2}:\d{2}(\.\d+)? | ||||
rfc7758.txt 1358 \d{2}:\d{2}:\d{2}(\.\d+)? | ||||
rfc7895.txt 349 \d{4}-\d{2}-\d{2} | ||||
rfc7950.txt 8323 [0-9a-fA-F]* | ||||
rfc7950.txt 8355 [a-zA-Z_][a-zA-Z0-9\-_.]* | ||||
rfc7950.txt 8356 [xX][mM][lL].* | ||||
rfc8040.txt 4713 \d{4}-\d{2}-\d{2} | ||||
rfc8049.txt 6704 [A-Z]{2} | ||||
rfc8194.txt 629 \* | ||||
rfc8194.txt 637 [0-9]{8}\.[0-9]{6} | ||||
rfc8194.txt 905 Z|[\+\-]\d{2}:\d{2} | ||||
rfc8194.txt 963 (2((2[4-9])|(3[0-9]))\.).* | ||||
rfc8194.txt 974 (([fF]{2}[0-9a-fA-F]{2}):).* | ||||
rfc8299.txt 7986 [A-Z]{2} | ||||
rfc8341.txt 1878 \* | ||||
rfc8341.txt 1927 [^\*].* | ||||
rfc8407.txt 1723 [0-9\.]* | ||||
rfc8407.txt 1749 [a-zA-Z_][a-zA-Z0-9\-_.]* | ||||
rfc8407.txt 1750 .|..|[^xX].*|.[^mM].*|..[^lL].* | ||||
rfc8525.txt 550 \d{4}-\d{2}-\d{2} | ||||
rfc8776.txt 838 /?([a-zA-Z0-9\-_.]+)(/[a-zA-Z0-9\-_.]+)* | ||||
rfc8776.txt 874 ([a-zA-Z0-9\-_.]+:)* | ||||
rfc8819.txt 311 [\S ]+ | ||||
rfc8944.txt 596 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){7} | ||||
Figure 2: Example regular expressions extracted from RFCs | ||||
Acknowledgements | Acknowledgements | |||
This specification has been motivated by the discussion in the IETF | Discussion in the IETF JSONPATH WG about whether to include a regexp | |||
JSONPATH WG about whether to include a regexp mechanism into the | mechanism into the JSONPath query expression specification and | |||
JSONPath query expression specification, as well as by previous | previous discussions about the YANG pattern and Concise Data | |||
discussions about the YANG pattern and CDDL .regexp features. | Definition Language (CDDL) .regexp features motivated this | |||
specification. | ||||
The basic approach for this specification was inspired by The I-JSON | The basic approach for this specification was inspired by "The I-JSON | |||
Message Format [RFC7493]. | Message Format" [RFC7493]. | |||
Authors' Addresses | Authors' Addresses | |||
Carsten Bormann | Carsten Bormann | |||
Universität Bremen TZI | Universität Bremen TZI | |||
Postfach 330440 | Postfach 330440 | |||
D-28359 Bremen | D-28359 Bremen | |||
Germany | Germany | |||
Phone: +49-421-218-63921 | Phone: +49-421-218-63921 | |||
Email: cabo@tzi.org | Email: cabo@tzi.org | |||
End of changes. 63 change blocks. | ||||
275 lines changed or deleted | 205 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |