Network Working Group

Internet Engineering Task Force (IETF)                        C. Bormann
Internet-Draft
Request for Comments: 9485                        Universität Bremen TZI
Intended status:
Category: Standards Track                                        T. Bray
Expires: 31 December 2023
ISSN: 2070-1721                                               Textuality
                                                            29 June
                                                            October 2023

          I-Regexp: An Interoperable Regexp Regular Expression Format
                     draft-ietf-jsonpath-iregexp-08

Abstract

   This document specifies I-Regexp, a flavor of regular expressions expression that
   is limited in scope with the goal of interoperation across many
   different regular-expression regular expression libraries.

About This Document

   This note is to be removed before publishing as an RFC.

   Status information for this document may be found at
   https://datatracker.ietf.org/doc/draft-ietf-jsonpath-iregexp/.

   Discussion of this document takes place on the JSONPath Working Group
   mailing list (mailto:JSONPath@ietf.org), which is archived at
   https://mailarchive.ietf.org/arch/browse/JSONPath/.  Subscribe at
   https://www.ietf.org/mailman/listinfo/JSONPath/.

   Source for this draft and an issue tracker can be found at
   https://github.com/ietf-wg-jsonpath/iregexp.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents an Internet Standards Track document.

   This document is a product of the Internet Engineering Task Force
   (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list  It represents the consensus of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid the IETF community.  It has
   received public review and has been approved for a maximum publication by the
   Internet Engineering Steering Group (IESG).  Further information on
   Internet Standards is available in Section 2 of six months RFC 7841.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be updated, replaced, or obsoleted by other documents obtained at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 31 December 2023.
   https://www.rfc-editor.org/info/rfc9485.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info)
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Revised BSD License text as described in Section 4.e of the
   Trust Legal Provisions and are provided without warranty as described
   in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Objectives  . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  I-Regexp Syntax . . . . . . . . . . . . . . . . . . . . . . .   4
     3.1.  Checking Implementations  . . . . . . . . . . . . . . . .   5
   4.  I-Regexp Semantics  . . . . . . . . . . . . . . . . . . . . .   5
   5.  Mapping I-Regexp to Regexp Dialects . . . . . . . . . . . . .   5
     5.1.  Multi-Character Escapes . . . . . . . . . . . . . . . . .   6
     5.2.  XSD Regexps . . . . . . . . . . . . . . . . . . . . . . .   6
     5.3.  ECMAScript Regexps  . . . . . . . . . . . . . . . . . . .   6
     5.4.  PCRE, RE2, and Ruby Regexps . . . . . . . . . . . . . . . . .   7
   6.  Motivation and Background . . . . . . . . . . . . . . . . . .   7
     6.1.  Implementing I-Regexp . . . . . . . . . . . . . . . . . .   7
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8
   8.  Security considerations . . . . . . . . . . . . . . . . . . .   8 Considerations
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   9
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .   9
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  10
   Appendix A.  Regexps and Similar Constructs in Recent Published
           RFCs  . . . . . . . . . . . . . . . . . . . . . . . . . .  10
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  12
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

   This specification describes an interoperable regular expression
   ("regexp")
   (abbreviated as "regexp") flavor, I-Regexp.

   I-Regexp does not provide advanced regular expression features such
   as capture groups, lookahead, or backreferences.  It supports only a
   Boolean matching capability, i.e., testing whether a given regular
   expression matches a given piece of text.

   I-Regexp supports the entire repertoire of Unicode characters
   (Unicode scalar values); both the I-Regexp strings themselves and the
   strings they are matched against are sequences of Unicode scalar
   values (often represented in UTF-8 encoding form [STD63] [RFC3629] for
   interchange).

   I-Regexp is a subset of XSD XML Schema Definition (XSD) regular
   expressions [XSD-2].

   This document includes guidance for converting I-Regexps for use with
   several well-known regular expression idioms.

   The development of I-Regexp was motivated by the work of the JSONPath
   Working Group.  The Working Group (WG).  The WG wanted to include in its
   specification [I-D.ietf-jsonpath-base] support for the use of
   regular expressions in JSONPath filters, filters in its specification
   [JSONPATH-BASE], but was unable to find a useful specification for
   regular expressions which that would be interoperable across the popular
   libraries.

1.1.  Terminology

   This document uses the abbreviation "regexp" for what are is usually
   called regular expressions a "regular expression" in programming.  The term "I-Regexp" is
   used as a noun meaning a character string (sequence of Unicode scalar
   values) that conforms to the requirements in this specification; the
   plural is "I-Regexps".

   This specification uses Unicode terminology.  A terminology; a good entry point into
   that is
   provided by [UNICODE-GLOSSARY].

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

   The grammatical rules in this document are to be interpreted as ABNF,
   as described in [RFC5234] and [RFC7405], where the "characters" of
   Section 2.3 of [RFC5234] are Unicode scalar values.

2.  Objectives

   I-Regexps should handle the vast majority of practical cases where a
   matching regexp is needed in a data model data-model specification or a query query-
   language expression.

   The editors

   At the time of writing, an editor of this document conducted a survey
   of the regexp syntax used in recently published RFCs.  All examples
   found there should be covered by I-Regexps, both syntactically and
   with their intended semantics.  The exception is the use of multi-character multi-
   character escapes, for which workaround guidance is provided in
   Section 5.

3.  I-Regexp Syntax

   An I-Regexp MUST conform to the ABNF specification in Figure 1.

   i-regexp = branch *( "|" branch )
   branch = *piece
   piece = atom [ quantifier ]
   quantifier = ( "*" / "+" / "?" ) / range-quantifier
   range-quantifier = "{" QuantExact [ "," [ QuantExact ] ] "}"
   QuantExact = 1*%x30-39 ; '0'-'9'

   atom = NormalChar / charClass / ( "(" i-regexp ")" )
   NormalChar = ( %x00-27 / "," / "-" / %x2F-3E ; '/'-'>'
    / %x40-5A ; '@'-'Z'
    / %x5E-7A ; '^'-'z'
    / %x7E-10FFFF %x7E-D7FF ; skip surrogate code points
    / %xE000-10FFFF )
   charClass = "." / SingleCharEsc / charClassEsc / charClassExpr
   SingleCharEsc = "\" ( %x28-2B ; '('-'+'
    / "-" / "." / "?" / %x5B-5E ; '['-'^'
    / %s"n" / %s"r" / %s"t" / %x7B-7D ; '{'-'}'
    )
   charClassEsc = catEsc / complEsc
   charClassExpr = "[" [ "^" ] ( "-" / CCE1 ) *CCE1 [ "-" ] "]"
   CCE1 = ( CCchar [ "-" CCchar ] ) / charClassEsc
   CCchar = ( %x00-2C / %x2E-5A ; '.'-'Z'
    / %x5E-10FFFF %x5E-D7FF ; skip surrogate code points
    / %xE000-10FFFF ) / SingleCharEsc
   catEsc = %s"\p{" charProp "}"
   complEsc = %s"\P{" charProp "}"
   charProp = IsCategory
   IsCategory = Letters / Marks / Numbers / Punctuation / Separators /
       Symbols / Others
   Letters = %s"L" [ ( %s"l" / %s"m" / %s"o" / %s"t" / %s"u" ) ]
   Marks = %s"M" [ ( %s"c" / %s"e" / %s"n" ) ]
   Numbers = %s"N" [ ( %s"d" / %s"l" / %s"o" ) ]
   Punctuation = %s"P" [ ( %x63-66 ; 'c'-'f'
    / %s"i" / %s"o" / %s"s" ) ]
   Separators = %s"Z" [ ( %s"l" / %s"p" / %s"s" ) ]
   Symbols = %s"S" [ ( %s"c" / %s"k" / %s"m" / %s"o" ) ]
   Others = %s"C" [ ( %s"c" / %s"f" / %s"n" / %s"o" ) ]

                     Figure 1: I-Regexp Syntax in ABNF

   As an additional restriction, charClassExpr is not allowed to match
   [^], which which, according to this grammar grammar, would parse as a positive
   character class containing the single character ^.

   This is essentially an XSD regexp without without:

   *  character class subtraction,
   without

   *  multi-character escapes such as \s, \S, and \w, and without

   *  Unicode blocks.

   An I-Regexp implementation MUST be a complete implementation of this
   limited subset.  In particular, full support for the Unicode
   functionality defined in this specification is REQUIRED; the
   implementation REQUIRED.  The
   implementation:

   *  MUST NOT limit itself to 7- or 8-bit character sets such as ASCII ASCII,
      and

   *  MUST support the Unicode character property set in character
      classes.

3.1.  Checking Implementations

   A _checking_ I-Regexp implementation is one that checks a supplied
   regexp for compliance with this specification and reports any
   problems.  Checking implementations give their users confidence that
   they didn't accidentally insert non-interoperable syntax, syntax that is not interoperable, so
   checking is RECOMMENDED.  Exceptions to this rule may be made for
   low-effort implementations that map I-Regexp to another regexp
   library by simple steps such as performing the mapping operations
   discussed in Section 5; here, 5.  Here, the effort needed to do full checking may
   might dwarf the rest of the implementation effort.  Implementations
   SHOULD document whether or not they are checking or not. checking.

   Specifications that employ I-Regexp may want to define in which cases
   their implementations can work with a non-checking I-Regexp
   implementation and when full checking is needed, possibly in the
   process of defining their own implementation classes.

4.  I-Regexp Semantics

   This syntax is a subset of that of [XSD-2].  Implementations which that
   interpret I-Regexps MUST yield Boolean results as specified in
   [XSD-2].  (See also Section 5.2.)

5.  Mapping I-Regexp to Regexp Dialects

   The material in this section is non-normative, not normative; it is provided as
   guidance to developers who want to use I-Regexps in the context of
   other regular expression dialects.

5.1.  Multi-Character Escapes

   Common

   I-Regexp does not support common multi-character escapes (MCEs), (MCEs) and
   character classes built around them, which are not supported in I-Regexp, them.  These can usually be replaced
   as shown for example by the examples in Table 1.

                       +===========+==============+

                      +============+===============+
                      | MCE/class MCE/class: | Replace with with: |
                       +===========+==============+
                      +============+===============+
                      | \S         | [^ \t\n\r]    |
                       +-----------+--------------+
                      +------------+---------------+
                      | [\S ]      | [^\t\n\r]     |
                       +-----------+--------------+
                      +------------+---------------+
                      | \d         | [0-9]         |
                       +-----------+--------------+
                      +------------+---------------+

                             Table 1: Example
                          substitutes
                          Substitutes for multi-
                            character escapes Multi-
                            Character Escapes

   Note that the semantics of \d in XSD regular expressions is that of
   \p{Nd}; however, this would include all Unicode characters that are
   digits in various writing systems, which is almost certainly not what
   is required in IETF publications.

   The construct \p{IsBasicLatin} is essentially a reference to legacy
   ASCII,
   ASCII; it can be replaced by the character class [\u0000-\u007f].

5.2.  XSD Regexps

   Any I-Regexp also is also an XSD Regexp regexp [XSD-2], so the mapping is an
   identity function.

   Note that a few errata for [XSD-2] have been fixed in [XSD11-2],
   which [XSD-1.1-2];
   therefore, it is therefore also included as a normative reference. in the Normative References
   (Section 9.1).  XSD 1.1 is less widely implemented than XSD 1.0, and
   implementations of XSD 1.0 are likely to include these bugfixes, so bugfixes; for
   the intents and purposes of this specification specification, an implementation of
   XSD 1.0 regexps is equivalent to an implementation of XSD 1.1
   regexps.

5.3.  ECMAScript Regexps

   Perform the following steps on an I-Regexp to obtain an ECMAScript
   regexp [ECMA-262]:

   *  For any unescaped dots (.) outside character classes (first
      alternative of charClass production): production), replace the dot by with
      [^\n\r].

   *  Envelope the result in ^(?: and )$.

   The ECMAScript regexp is to be interpreted as a Unicode pattern ("u"
   flag; see Section 21.2.2 "Pattern Semantics" of [ECMA-262]).

   Note that where a regexp literal is required, the actual regexp needs
   to be enclosed in /.

5.4.  PCRE, RE2, and Ruby Regexps

   Perform the same steps as in Section 5.3 to

   To obtain a valid regexp in
   PCRE Perl Compatible Regular Expressions
   (PCRE) [PCRE2], the Go programming language language's RE2 regexp library
   [RE2], and the Ruby programming language, perform the same steps as
   in Section 5.3, except that the last step is:

   *  Enclose the regexp in \A(?: and )\z.

6.  Motivation and Background

   While regular expressions originally were intended to describe a
   formal language to support a Boolean matching function, they have
   been enhanced with parsing functions that support the extraction and
   replacement of arbitrary portions of the matched text.  With this
   accretion of features, parsing regexp parsing-regexp libraries have become more
   susceptible to bugs and surprising performance degradations which that can
   be exploited in Denial of Service denial-of-service attacks by an attacker who controls
   the regexp submitted for processing.  I-Regexp is designed to offer
   interoperability,
   interoperability and to be less vulnerable to such attacks, with the
   trade-off that its only function is to offer a boolean Boolean response as to
   whether a character sequence is matched by a regexp.

6.1.  Implementing I-Regexp

   XSD regexps are relatively easy to implement or map to widely
   implemented parsing regexp parsing-regexp dialects, with these notable exceptions:

   *  Character class subtraction.  This is a very useful feature in
      many specifications, but it is unfortunately mostly absent from
      parsing regexp
      parsing-regexp dialects.  Thus, it is omitted from I-Regexp.

   *  Multi-character escapes.  \d, \w, \s and their uppercase
      complement classes exhibit a large amount of variation between
      regexp flavors.  Thus, they are omitted from I-Regexp.

   *  Not all regexp implementations support accesses access to Unicode tables
      that enable executing constructs such as \p{Nd}, although the
      \p/\P feature in general is now quite widely available.  While  While, in
      principle
      principle, it is possible to translate these into character-class
      matches, this also requires access to those tables.  Thus, regexp
      libraries in severely constrained environments may not be able to
      support I-Regexp conformance.

7.  IANA Considerations

   This document makes has no requests of IANA. IANA actions.

8.  Security considerations Considerations

   While technically out of the scope of this specification, Section 10
   (Security Considerations)
   ("Security Considerations") of [STD63] [RFC3629] applies to implementations.
   Particular note needs to be taken of the last paragraph of Section 3
   (UTF-8 definition)
   ("UTF-8 definition") of [STD63]; [RFC3629]; an I-Regexp implementation may
   need to mitigate limitations of the platform implementation in this
   regard.

   As discussed in Section 6, more complex regexp libraries may contain
   exploitable bugs leading bugs, which can lead to crashes and remote code
   execution.  There is also the problem that such libraries often have hard-to-predict
   performance characteristics, characteristics that are hard to predict, leading to
   attacks that overload an implementation by matching against an
   expensive attacker-controlled regexp.

   I-Regexps have been designed to allow implementation in a way that is
   resilient to both threats; this objective needs to be addressed
   throughout the implementation effort.  Non-checking implementations
   (see Section 3.1) are likely to expose security limitations of any
   regexp engine they use, which may be less problematic if that engine
   has been built with security considerations in mind (e.g., [RE2]); [RE2]).
   In any case, a checking implementation is still RECOMMENDED.

   Implementations that specifically implement the I-Regexp subset can,
   with care, be designed to generally run in linear time and space in
   the input, input and to detect when that would not be the case (see below).

   Existing regexp engines should be able to easily handle most
   I-Regexps (after the adjustments discussed in Section 5), but may
   consume excessive resources for some types of I-Regexps or outright
   reject them because they cannot guarantee efficient execution.  (Note
   that different versions of the same regexp library may be more or
   less vulnerable to excessive resource consumption for these cases.)

   Specifically, range quantifiers (as in a{2,4}) provide particular
   challenges for both existing and I-Regexp focused implementations.
   These
   Implementations may therefore limit range quantifiers in
   composability (disallowing nested range quantifiers such as
   (a{2,4}){2,4}) or range (disallowing very large ranges such as
   a{20,200000}), or detect and reject any excessive resource
   consumption caused by them. range quantifiers.

   I-Regexp implementations that are used to evaluate regexps from
   untrusted sources need to be robust to in these cases.  Implementers
   using existing regexp libraries are encouraged encouraged:

   *  to check their documentation to see if mitigations are
      configurable, such as limits in resource consumption, and

   *  to document their own degree of robustness resulting from
      employing such mitigations.

9.  References

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/rfc/rfc2119>.
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/rfc/rfc5234>.
              <https://www.rfc-editor.org/info/rfc5234>.

   [RFC7405]  Kyzivat, P., "Case-Sensitive String Support in ABNF",
              RFC 7405, DOI 10.17487/RFC7405, December 2014,
              <https://www.rfc-editor.org/rfc/rfc7405>.
              <https://www.rfc-editor.org/info/rfc7405>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.

   [XSD-2]    Malhotra, A., Ed. and P. V. Biron, Ed., "XML Schema Part
              2: Datatypes Second Edition", W3C REC REC-xmlschema-
              2-20041028, W3C REC-xmlschema-2-20041028, 28 October 2004,
              <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>.

   [XSD11-2]  Malhotra, A., Ed., <https://www.rfc-editor.org/info/rfc8174>.

   [XSD-1.1-2]
              Peterson, D., Ed., Thompson, H., Gao, S., Ed., Malhotra, A., Ed.,
              Sperberg-McQueen, C. M., Ed., Biron, P. V., Thompson, H., Ed., and S. Gao, P.
              Biron, Ed., "W3C XML Schema Definition Language (XSD) 1.1
              Part 2: Datatypes", W3C REC REC-xmlschema11-2-20120405,
              W3C REC-
              xmlschema11-2-20120405, REC-xmlschema11-2-20120405, 5 April 2012,
              <https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/>.

   [XSD-2]    Biron, P., Ed. and A. Malhotra, Ed., "XML Schema Part 2:
              Datatypes Second Edition", W3C REC REC-xmlschema-
              2-20041028, W3C REC-xmlschema-2-20041028, 28 October 2004,
              <https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/>.

9.2.  Informative References

   [ECMA-262] Ecma International, "ECMAScript 2020 Language
              Specification", ECMA Standard ECMA-262, 11th Edition, June
              2020, <https://www.ecma-international.org/wp-
              content/uploads/ECMA-262.pdf>.

   [I-D.ietf-jsonpath-base]

   [JSONPATH-BASE]
              Gössner, S., Ed., Normington, G., Ed., and C. Bormann,
              Ed., "JSONPath: Query expressions for JSON", Work in
              Progress, Internet-
              Draft, draft-ietf-jsonpath-base-14, 10 June Internet-Draft, draft-ietf-jsonpath-base-20, 25
              August 2023,
              <https://datatracker.ietf.org/doc/html/draft-ietf-
              jsonpath-base-14>. <https://datatracker.ietf.org/doc/html/draft-
              ietf-jsonpath-base-20>.

   [PCRE2]    "Perl-compatible Regular Expressions (revised API:
              PCRE2)", n.d., <http://pcre.org/current/doc/html/>.

   [RE2]      "RE2 is a fast, safe, thread-friendly alternative to
              backtracking regular expression engines like those used in
              PCRE, Perl, and Python. It is a C++ library.", n.d., commit
              73031bb, <https://github.com/google/re2>.

   [RFC7493]  Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
              DOI 10.17487/RFC7493, March 2015,
              <https://www.rfc-editor.org/rfc/rfc7493>.

   [STD63]

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
              2003, <https://www.rfc-editor.org/rfc/rfc3629>. <https://www.rfc-editor.org/info/rfc3629>.

   [RFC7493]  Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
              DOI 10.17487/RFC7493, March 2015,
              <https://www.rfc-editor.org/info/rfc7493>.

   [UNICODE-GLOSSARY]
              Unicode, Inc., "Glossary of Unicode Terms",
              <https://unicode.org/glossary/>.

Appendix A.  Regexps and Similar Constructs in Recent Published RFCs

   This section is to be removed before publishing as an RFC.

   This appendix contains a number of regular expressions that have been
   extracted from some recently published RFCs based on some ad-hoc
   matching.  Multi-line constructions were not included.  With the
   exception of some (often surprisingly dubious) usage of multi-
   character escapes and a reference to the IsBasicLatin Unicode block,
   all regular expressions validate against the ABNF in Figure 1.

   rfc6021.txt  459 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
   rfc6021.txt  513 \d*(\.\d*){1,127}
   rfc6021.txt  529 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
   rfc6021.txt  631 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6021.txt  647 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
   rfc6021.txt  933 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6021.txt  938 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6021.txt 1026 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6021.txt 1031 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6020.txt 6647 [0-9a-fA-F]*
   rfc6095.txt 2544 \S(.*\S)?
   rfc6110.txt 1583 [aeiouy]*
   rfc6110.txt 3222 [A-Z][a-z]*
   rfc6536.txt 1583 \*
   rfc6536.txt 1632 [^\*].*
   rfc6643.txt  524 \p{IsBasicLatin}{0,255}
   rfc6728.txt 3480 \S+
   rfc6728.txt 3500 \S(.*\S)?
   rfc6991.txt  477 (([0-1](\.[1-3]?[0-9]))|(2\.(0|([1-9]\d*))))
   rfc6991.txt  525 \d*(\.\d*){1,127}
   rfc6991.txt  541 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc6991.txt  542 .|..|[^xX].*|.[^mM].*|..[^lL].*
   rfc6991.txt  571 \d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?
   rfc6991.txt  665 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6991.txt  693 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5}
   rfc6991.txt  725 ([0-9a-fA-F]{2}(:[0-9a-fA-F]{2})*)?
   rfc6991.txt  743 [0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-
   rfc6991.txt 1041 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6991.txt 1046 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc6991.txt 1099 [0-9\.]*
   rfc6991.txt 1109 [0-9a-fA-F:\.]*
   rfc6991.txt 1164 ((:|[0-9a-fA-F]{0,4}):)([0-9a-fA-F]{0,4}:){0,5}
   rfc6991.txt 1169 (([^:]+:){6}(([^:]+:[^:]+)|(.*\..*)))|
   rfc7407.txt  933 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){0,254}
   rfc7407.txt 1494 ([0-9a-fA-F]){2}(:([0-9a-fA-F]){2}){4,31}
   rfc7758.txt  703 \d{2}:\d{2}:\d{2}(\.\d+)?
   rfc7758.txt 1358 \d{2}:\d{2}:\d{2}(\.\d+)?
   rfc7895.txt  349 \d{4}-\d{2}-\d{2}
   rfc7950.txt 8323 [0-9a-fA-F]*
   rfc7950.txt 8355 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc7950.txt 8356 [xX][mM][lL].*
   rfc8040.txt 4713 \d{4}-\d{2}-\d{2}
   rfc8049.txt 6704 [A-Z]{2}
   rfc8194.txt  629 \*
   rfc8194.txt  637 [0-9]{8}\.[0-9]{6}
   rfc8194.txt  905 Z|[\+\-]\d{2}:\d{2}
   rfc8194.txt  963 (2((2[4-9])|(3[0-9]))\.).*
   rfc8194.txt  974 (([fF]{2}[0-9a-fA-F]{2}):).*
   rfc8299.txt 7986 [A-Z]{2}
   rfc8341.txt 1878 \*
   rfc8341.txt 1927 [^\*].*
   rfc8407.txt 1723 [0-9\.]*
   rfc8407.txt 1749 [a-zA-Z_][a-zA-Z0-9\-_.]*
   rfc8407.txt 1750 .|..|[^xX].*|.[^mM].*|..[^lL].*
   rfc8525.txt  550 \d{4}-\d{2}-\d{2}
   rfc8776.txt  838 /?([a-zA-Z0-9\-_.]+)(/[a-zA-Z0-9\-_.]+)*
   rfc8776.txt  874 ([a-zA-Z0-9\-_.]+:)*
   rfc8819.txt  311 [\S ]+
   rfc8944.txt  596 [0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){7}

         Figure 2: Example regular expressions extracted from RFCs

Acknowledgements

   This specification has been motivated by the discussion

   Discussion in the IETF JSONPATH WG about whether to include a regexp
   mechanism into the JSONPath query expression specification, as well as by specification and
   previous discussions about the YANG pattern and CDDL Concise Data
   Definition Language (CDDL) .regexp features. features motivated this
   specification.

   The basic approach for this specification was inspired by The "The I-JSON
   Message Format Format" [RFC7493].

Authors' Addresses

   Carsten Bormann
   Universität Bremen TZI
   Postfach 330440
   D-28359 Bremen
   Germany
   Phone: +49-421-218-63921
   Email: cabo@tzi.org

   Tim Bray
   Textuality
   Canada
   Email: tbray@textuality.com