Network Working Group K. Davies Internet-Draft ICANN Intended status: Informational A. Freytag Expires: March 27, 2014 ASMUS Inc. September 23, 2013 Representing Label Generation Rulesets using XML draft-davies-idntables-04 Abstract This memo describes a method of representing the domain name registration policy for a zone administrator using Extensible Markup Language (XML). These policies, known as "Label Generation Rulesets" (LGRs), are particularly used for the implementation of Internationalised Domain Names (IDNs). The rulesets are used to implement and share policy on which specific Unicode code points are permitted for registrations, which alternative code points are considered variants, and what actions may be performed on those variants. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on March 27, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Davies & Freytag Expires March 27, 2014 [Page 1] Internet-Draft Label Generation Rulesets in XML September 2013 carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . 5 3. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 6 4. LGR Format . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.1. Namespace . . . . . . . . . . . . . . . . . . . . . . . . 7 4.2. Basic structure . . . . . . . . . . . . . . . . . . . . . 7 4.3. Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3.1. The version element . . . . . . . . . . . . . . . . . 8 4.3.2. The date element . . . . . . . . . . . . . . . . . . . 8 4.3.3. The language element . . . . . . . . . . . . . . . . . 8 4.3.4. The domain element . . . . . . . . . . . . . . . . . . 9 4.3.5. The description element . . . . . . . . . . . . . . . 9 4.3.6. The validity-start and validity-end elements . . . . . 9 4.3.7. The unicode-version element . . . . . . . . . . . . . 10 4.3.8. The references element . . . . . . . . . . . . . . . . 10 5. Code Point Rules . . . . . . . . . . . . . . . . . . . . . . . 12 5.1. Sequences . . . . . . . . . . . . . . . . . . . . . . . . 12 5.2. Variants . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2.1. Basic variants . . . . . . . . . . . . . . . . . . . . 13 5.2.2. Null variants . . . . . . . . . . . . . . . . . . . . 14 5.2.3. Dispositions . . . . . . . . . . . . . . . . . . . . . 14 5.2.4. The ref attribute . . . . . . . . . . . . . . . . . . 14 5.2.5. Conditional variants . . . . . . . . . . . . . . . . . 15 5.2.6. The comment attribute . . . . . . . . . . . . . . . . 16 5.3. Code point tagging . . . . . . . . . . . . . . . . . . . . 16 6. Whole Label and Context Evaluation . . . . . . . . . . . . . . 17 6.1. Basic concepts . . . . . . . . . . . . . . . . . . . . . . 17 6.2. Character Classes . . . . . . . . . . . . . . . . . . . . 17 6.2.1. Tag-based classes . . . . . . . . . . . . . . . . . . 18 6.2.2. Unicode property based classes . . . . . . . . . . . . 18 6.2.3. Explicitly declared classes . . . . . . . . . . . . . 19 6.2.4. Combined classes . . . . . . . . . . . . . . . . . . . 20 6.3. Whole Label and Context Rules . . . . . . . . . . . . . . 21 6.3.1. The rule element . . . . . . . . . . . . . . . . . . . 21 6.3.2. Parameterized Context or When Rule . . . . . . . . . . 25 6.4. Action elements . . . . . . . . . . . . . . . . . . . . . 27 6.4.1. Recommended Disposition Values . . . . . . . . . . . . 28 6.4.2. Precedence . . . . . . . . . . . . . . . . . . . . . . 28 6.4.3. Implied actions . . . . . . . . . . . . . . . . . . . 28 Davies & Freytag Expires March 27, 2014 [Page 2] Internet-Draft Label Generation Rulesets in XML September 2013 7. Example table . . . . . . . . . . . . . . . . . . . . . . . . 30 8. Processing a label against an LGR . . . . . . . . . . . . . . 32 8.1. Determining eligibility for a label . . . . . . . . . . . 32 8.2. Determining variants for a label . . . . . . . . . . . . . 32 8.3. Determining a disposition for a label or variant label . . 32 9. Conversion between other formats . . . . . . . . . . . . . . . 34 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 35 11. Security Considerations . . . . . . . . . . . . . . . . . . . 36 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Appendix A. RelaxNG Schema . . . . . . . . . . . . . . . . . . . 38 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 45 Appendix C. Editorial Notes . . . . . . . . . . . . . . . . . . . 46 C.1. Known Issues and Future Work . . . . . . . . . . . . . . . 46 C.2. Change History . . . . . . . . . . . . . . . . . . . . . . 46 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 47 Davies & Freytag Expires March 27, 2014 [Page 3] Internet-Draft Label Generation Rulesets in XML September 2013 1. Introduction This memo describes a method of using Extensible Markup Language (XML) to describe the algorithm used to determine whether a given domain label is permitted, and under which conditions based on the code points it contains and their context. These algorithms are comprised of a list of permissible code points, variants, and a number of rules describing where certain relationships are applied. These algorithms form part of a zone administrator's policies, and can be referred to as Label Generation Rulesets (LGRs), or IDN tables. Administrators of the zones for top-level domain registries have historically published their LGRs using ASCII text or HTML. The formatting of these documents has been loosely based on the format used for the Language Variant Table in [RFC3743]. [RFC4290] also provides a "model table format" that describes a similar set of functionality. Through the first decade of IDN deployment, experience has shown that LGRs derived from these formats are difficult to consistently implement and compare due to their differing formats. A universal format, such as one using a structured XML format, will assist by improving machine-readability, consistency, reusability and maintainability of LGRs. It also provides for more complex conditional implementation of variants that reflects the known requirements of current zone administrator policies. While the predominant usage of this specification is to represent IDN label policy, the format is not limited to IDN usage may also be used for describing ASCII domain name label rulesets. Davies & Freytag Expires March 27, 2014 [Page 4] Internet-Draft Label Generation Rulesets in XML September 2013 2. Design Goals The following items are explicit design goals of this format: o MUST be in a format that can be implemented in a reasonably straightforward manner in software; o The format SHOULD be able to be checked for formatting errors, such that common mistakes can be caught; o An LGR MUST be able to express the set of valid code points that are allowed for registration under a specific zone administrator's policies; o MUST be able to express computed alternatives to a given domain name based on a one-to-one, or one-to-many relationship. These computed alternatives are commonly known as "variants"; o Variants SHOULD be able to be tagged with specific categories, such that the categories can be used to support registry policy (such as whether to list the computed variant in the zone, or to merely block it from registration); o Variants MUST be able to stipulated based on contextual information. For example, specific variants may only be applicable when they follow another specific code point, or when the code point is displayed in a specific presentation form; o The data contained within an LGR MUST be unambiguous, such that independent implementations that utilise the contents will arrive at the same results; o LGRs SHOULD be suitable for comparison and re-use, such that one could easily compare the contents of two or more to see the differences, to merge them, and so on. o As many existing IDN tables are practicable SHOULD be able to be migrated to the LGR format with all applicable logic retained. It is explicitly NOT the goal of this format to stipulate what code points should be listed in an LGR by a zone administrator. Which registration policies are used for a particular zone is outside the scope of this memo. Davies & Freytag Expires March 27, 2014 [Page 5] Internet-Draft Label Generation Rulesets in XML September 2013 3. Requirements To be able to fulfil the known utilisation of LGRs, the existing corpus of published IDN tables were reviewed to prepare this specification. In addition, the requirements of ICANN's work to implement an LGR for the DNS Root Zone [LGR-PROCEDURE] were also considered. In Section B of that document, five specific requirements for an LGR methodology were identified: o The ability to identify a set of code points that are permitted. o The ability to represent a list of variants, if any, for each code point. o A method of identifying code points that are related, using a tag. o The ability to describe rules regarding the possible actions that may be performed on the resulting label (such as blocked, allocatable, etc.) o The ability to describe rules that check for ill-formed combinations across the whole label. Finally, the syntax and rules in [RFC5892] were reviewed. Davies & Freytag Expires March 27, 2014 [Page 6] Internet-Draft Label Generation Rulesets in XML September 2013 4. LGR Format An LGR is expressed as a well-formed XML Document[XML]. 4.1. Namespace The XML Namespace URI is [TBD]. 4.2. Basic structure The basic XML framework of the document is as follows: ... Within the "lgr" element rests several sub-elements. Firstly is a "meta" element that contains all meta-data associated with the IDN table, such as its authorship, what it is used for, implementation notes and references. This is followed by a "data" element that contains the substantive code point data. Finally, an optional "rules" element contains information on whole-label evaluation rules, if any, along with any specific rules regarding the disposition of computed variants. ... ... ... A document MUST contain exactly one "lgr" element. Each "lgr" element MUST contain exactly one "data" element, optionally preceded by one "meta" element and optionally followed by one "rules" element. 4.3. Metadata The "meta" element is used to express meta-data associated within the LGR. It can be used to identify the author or relevant contact person, explain the intended usage of the LGR, and provide Davies & Freytag Expires March 27, 2014 [Page 7] Internet-Draft Label Generation Rulesets in XML September 2013 implementation notes as well as references. The data contained within is not required by software consuming the LGR in order to calculate valid labels, or to calculate variants. However, the "unicode-version" element should be used to identify whether a consumer of the table has the right Unicode data to perform operations on the table. 4.3.1. The version element The "version" element is used to uniquely identify each version of the LGR being represented. No specific format is required, but it is RECOMMENDED that it be a numerical positive integer, which is incremented with each revision of the file. An example of a typical first edition of a document: 1 4.3.2. The date element The "date" element is used to identify the date the LGR was posted. The contents of this element MUST be a valid ISO 8601 date string as described in[RFC3339]. Example of a date: 2009-11-01 4.3.3. The language element The "language" element signals that the LGR is associated with a specific language or script. The value of the language element must be a valid language tag as described in[RFC5646]. The tag may simply refer to a script if the LGR is not referring to a specific language. Example of an English language LGR: en If the LGR applies to a specific script, rather than a language, the "und" language tag should be used followed by the relevant [RFC5646] script subtag. For example, for a Cyrillic script LGR: und-Cyrl If the LGR covers a specific set of multiple languages or scripts, the language element can be repeated. However, for cases of Davies & Freytag Expires March 27, 2014 [Page 8] Internet-Draft Label Generation Rulesets in XML September 2013 insignificant admixture of characters from other scripts, the use of a single "language" element identifying the predominant script is RECOMMENDED. In the exceptional case where no script is predominant, use Zyyy (Common): und-Zyyy Note that that for the particular case of Japanese, a script tag "Japn" exists that matches the mixture of scripts used in writing that language. The preferred language element would be: und-Japn 4.3.4. The domain element This optional element refers to a domain to which this policy is applied. example.com There may be multiple tags used to reflect a list of domains. 4.3.5. The description element The "description" element is a free-form element that contains any additional relevant description that is useful for the user in its interpretation. Typically, this field contains authorship information, as well as additional context on how the LGR was formulated (such as citations and references), and how it has been applied. The element has an optional "type" attribute, which refers to the media type of the enclosed data. If the description lacks a type field, it will be assumed to be plain text. The "type" attribute may be used to specify the encoding within description element. The attribute should be a valid MIME type. If supplied, it will be assumed the contents is content of that encoding. Typical types would be "text/plain" or "text/html". "text/ plain" will be assumed if no type attribute is specified. 4.3.6. The validity-start and validity-end elements The "validity-start" and "validity-end" elements are optional elements that describe the time period from which the contents of the LGR become valid (i.e. are used in registry policy), and the contents of the LGR cease to be used. Davies & Freytag Expires March 27, 2014 [Page 9] Internet-Draft Label Generation Rulesets in XML September 2013 The times should conform to the format described in section 5.6 of [RFC5646]. It may be comprised of a date, or a date and time stamp. 4.3.7. The unicode-version element Whenever an IDN table depends on character properties from a given version of the Unicode standard, the minimum version number MUST be listed. If any software processing the table does not have the minimum requisite version, it MUST NOT perform any operations relating to whole-label evaluation. This is because some Unicode code points may not have been assigned in an earlier version, leaving properties for these code points undefined. It is RECOMMENDED to only reference stable or immutable properties as other may change between versions. 6.2 It is not necessary to include a "unicode-version" element for files that do not make use of Unicode properties. Because Unicode has been strictly additive from Version 1.1, the required minimum version for the repertoire can be uniquely determined by checking the code point values in any "cp" attributes against the "age" property in [UAX42]. 4.3.8. The references element An IDN table may define a list of references in an optional "references" element. The references element contains any number of "reference" elements, each containing an "id" attribute. It is RECOMMENDED that the "id" attribute be an integer. The "id" attributes MUST be unique. The value of the element may a citation of a standard, dictionary or other specification, in any suitable format. A reference can be associated with many elements contained in the "data" or "rules" elements, by using an optional "ref" attribute. The Unicode Standard, Version 7.0 Big-5: Computer Chinese Glyph and Character Code Mapping Table, Technical Report C-26, 1984 ISO/IEC 10646:2012 3rd edition ... ... ... Davies & Freytag Expires March 27, 2014 [Page 10] Internet-Draft Label Generation Rulesets in XML September 2013 A "ref" attribute may not occur on elements that are named references to character classes and rules and on certain specific other element types. See description of these elements below. Davies & Freytag Expires March 27, 2014 [Page 11] Internet-Draft Label Generation Rulesets in XML September 2013 5. Code Point Rules The bulk of a label generation ruleset is a description of which set of code points are eligible for a given label. For rulesets that perform operations that result in potential variants, the code point- level relationships between variants need to also be described. The code point data is collected within a "data" element. Within this element, a series of "char" and "range" elements describe eligible code points, or ranges of code points, respectively. Discrete permissible code points or code point sequences are declared with a "char" element, e.g. Ranges of permissible code points may be stipulated with a "range" element, e.g. The range is inclusive of the first and last code points. Whether code points are specified individually or as part of a range makes no difference in processing the data, and tools reading or writing the XML format may not retain a distinction. All attributes defined for a range element are as if applied to each code point within. Code points must be expressed in uppercase, hexadecimal, and zero padded to a minimum of 4 digits - in other words according to the standard Unicode convention but without the prefix "U+". The rationale for not allowing other encoding formats, including native Unicode encoding in XML, is explored in[UAX42]. The XML conventions used in this format, including the element and attribute names, mirror this document where practical and reasonable to do so. 5.1. Sequences A sequence of two or more code points may be specified in a LGR,, for example, when defining the source for n:m variant mappings. Another use of sequences would be in cases when the exact sequence of code points is required to occur in order for the constituent elements to be eligible, such as when a specific code point is only eligible when preceded or followed by another code point. The following would define the eligibility of the MIDDLE DOT (U+00B7) only when both preceded and followed by the LATIN SMALL LETTER L (U+006C): Davies & Freytag Expires March 27, 2014 [Page 12] Internet-Draft Label Generation Rulesets in XML September 2013 As an alternative to using sequences to define a required context, a "char" or "range" element may specify conditional context in a "when" attribute as described below inSection 5.2.5. The latter method is more flexible in that such conditional context is not limited to specific code point. and allows prohibited, as well as required context to be specified. 5.2. Variants While most LGRs typically only determine code point eligibility, others additionally specify a mapping of code points to other code points, known as "variants". What constitutes a variant is a matter of policy, and varies for each implementation. The following examples are intended to demonstrate the syntax; they are not necessarily typical. 5.2.1. Basic variants Variants are specified using one of more "var" elements as children of a "char" element. For example, to map LATIN SMALL LETTER V (U+0076) as a variant of LATIN SMALL LETTER U (U+0075): A sequence of multiple code points can be specified as a variant of a single code point. For example, the sequence of LATIN SMALL LETTER O (U+006F) then LATIN SMALL LETTER E (U+0065) might hypothetically be specified as a variant for an LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) as follows: The "var" element specifies Variants in only one direction, even though the variant relation is usually considered symmetric, that is, if A is a variant of B then B is typically also a variant of A. The format requires that the inverse of the variant is given explicitly to fully specify symmetric variant relations in the IDN table. This has the beneficial side effect of making the symmetry explicit: Davies & Freytag Expires March 27, 2014 [Page 13] Internet-Draft Label Generation Rulesets in XML September 2013 Both the source and target of a variant mapping may be sequences. As it is not possible to specify variants for ranges, ranges cannot be used for characters for which variant relations need to be defined. 5.2.2. Null variants To specify a null variant, which is a variant string that maps to no code point, use an empty cp attribute. For example, to mark a string with a ZERO WIDTH NON-JOINER (U+200C) to the same string without the ZERO WIDTH NON-JOINER: The symmetric form of a null variant cannot be expressed in the IDN table format. 5.2.3. Dispositions Variants may be given dispositions. These describe the policy state for a variant label that was generated using a particular variant. The dispositions are the same as described below inSection 6.4. A disposition may be of any value, but several conventional dispositions are predefined below in Section 6.4 to encourage common conventions in their application. If these values can represent registry policy, they SHOULD be used. Usually, if a variant label contains any instance of one of the blocked variants the label would be blocked, but if it contained only instances of allocated variants it could be allocated. See the discussion about implied actions inSection 6.4.3. 5.2.4. The ref attribute Reference information may optionally be specified by a "ref" attribute, consisting of a space delimited sequence of reference identifiers. Davies & Freytag Expires March 27, 2014 [Page 14] Internet-Draft Label Generation Rulesets in XML September 2013 This facility is typically used to give source information for characters or variant relations. This information is ignored when machine-processing an LGR. Specifying a "ref" attribute on a range element is equivalent to specifying the same ref attribute on every single code point of the range. The reference identifiers MUST match those declared in the "references" element (see Section 4.3.8). In addition to "char", "range" and "var" elements in the data section, a ref attribute may be present for literals ("char" inside a rule) as well as rules and class definitions, but not for named references to them. 5.2.5. Conditional variants Fundamentally, variants are mappings between two sequences of code points. However, in some instances for a variant relationship to exist, some context external to the code point sequence must be considered. For example, in some cases the positional context determines whether two code point sequences are variants of each other. An example are the Arabic characters, which can have different forms based on position. This positional context cannot be solely derived from the code point, as the code point is the same for the various forms. To specify a conditional variant relationship the "when" attribute is used. The variant relationship exists when the condition in the "when" attribute is satisfied. A "not-when" attribute may be used for conditions that must not be satisfied. The value of each "when" or "not-when" attributes is a a parameterized context rule as described below inSection 6.3.2. Assuming the "rules" element contains suitably defined rules for "arabic-isolated" and "arabic-final", the following example shows how to mark ARABIC LETTER ALEF WITH WAVY HAMZA BELOW (U+0673) as a variant of ARABIC LETTER ALEF WITH HAMZA BELOW (U+0625), but only when it appears in isolated or final forms: Only a single "when" or "not-when" attribute can be applied to any Davies & Freytag Expires March 27, 2014 [Page 15] Internet-Draft Label Generation Rulesets in XML September 2013 "var" element, however, multiple "var" elements using the same mapping, but different "when" or "not-when" attributes may be specified. While currently Arabic is the only script known for which such conditional variants are defined. there are other scripts, such as Mongolian, which share the concept of positional forms. By requiring explicit definitions for these rules, this mechanism can easily handle any additional types of conditional variants that are required. As described in Section 5.1 a "when" or "not-when" attribute may also be specified to any "char" element in the data section to define required or prohibited contextual conditions under which a code point is valid. 5.2.6. The comment attribute Any "char", "range" or "variant" element may contain a comment in a "comment" attribute. The contents of a comment attribute are free- form plain text. Comments are ignored in machine processing of the table. Comment attributes may also be placed on certain elements in the "rules" section of the document, such as actions and literals ("char"), as well as definitions of classes and rules, but not named references to them. Finally, in the metadata the "version" and "reference" elements may have comment attributes to match the syntax in [RFC3743] 5.3. Code point tagging Typically, LGRs are used to explicitly designate allowable code points, with any label with a code point not explicitly listed in the LGR being considered an ineligible label according to the ruleset. For more complex registry rules, there may be a need to discern code points of certain types. This can be accomplished by applying a "tag" attribute, and then filtering on results based on the tag using whole label evaluation. Tag attributes may be of any value, and multiple values are separated by space. A simple example would be to label preferred code points (as in[RFC3743]) by adding "preferred" to the tag, and then using a rule such as shown in Section 6.3.1 to filter out labels that consist entirely of such preferred code points. Davies & Freytag Expires March 27, 2014 [Page 16] Internet-Draft Label Generation Rulesets in XML September 2013 6. Whole Label and Context Evaluation 6.1. Basic concepts The code points in a label sometimes need to satisfy context-based rules, for example for the label to be considered valid, or to satisfy the context for a variant mapping (see the description of the "when" attribute inSection 6.3.2). A Whole Label Evaluation rule (WLE) is applied to the whole label. It is used to validate both original labels and variant labels derived from them. A conditional context rules is a specialized form of WLE specific to the context around a single code point or code point sequence. For example, if a rule is referenced in the "when" attribute of a variant mapping it is used to describe the conditional context under which the particular variant mapping is defined to exist. Each rule is defined in a "rule" element. A rule may contain the following as child elements: o literal code points or code point sequences o character classes, which defines sets of code points to be used for context comparisons; and o context operators, which define when character classes and literals may appear 6.2. Character Classes Character classes are sets of characters, that often share a particular property. They can be specified in several ways: 1. by defining the property via matching a tag in the code point data. All characters with the same tag attribute are part of the same class. 2. by referencing one of the Unicode character properties defined in the Unicode Character Database[UAX42]; 3. by explicitly listing all the code points in the class; or 4. by defining the class as a combination of any number of these definitions or other classes. A character class has an optional "name" attribute, consisting of a single identifier not containing spaces. If it is omitted, the class Davies & Freytag Expires March 27, 2014 [Page 17] Internet-Draft Label Generation Rulesets in XML September 2013 is anonymous and exists only inside the rule or combined class where it is defined. A named character class is defined independently and can be referenced by name by both rules and other character classes. ... An empty "class" element with a name attribute is a reference to an existing named class. Such an element MUST not have either "comment" or "ref" attributes as those may only be placed on a class definition. 6.2.1. Tag-based classes The char element may contain a tag attribute that consists of one or more space separated identifiers, for example: This defines two tags for use with code point U+0061, the tag "letter" and the tag "lower". Implicitly, this defines two named character classes, the class "letter" and the class "lower", the first with 0061 and 4E00 as elements and the latter with 0061, but not 4E00 as an element. The document MUST not contain an explicitly named class definition of the same name as an implicitly named tag- derived class. 6.2.2. Unicode property based classes A class is defined in terms of Unicode properties by giving the Unicode property alias and the property value or property value alias, separated by a colon. The example above selects all characters for which the Unicode canonical combining class (ccc) value is 9. This value of the ccc is assigned in the to all characters that are viramas. The string "ccc" is the short-alias for the canonical combining class, as defined in the Unicode Character Database [UAX42]. Davies & Freytag Expires March 27, 2014 [Page 18] Internet-Draft Label Generation Rulesets in XML September 2013 Unicode properties may, in principle, change between versions of the Unicode Standard. However, the values assigned for a given version are fixed. If Unicode Properties are used, a minimum Unicode version MUST be declared in the header. (Note, some Unicode properties are by definition stable across versions and do not change once assigned.) 6.2.3. Explicitly declared classes A class of code points may also be declared by listing the code points that are a member of the class. This is useful when tagging cannot be used because code points are not listed individually as part of the eligible set of code points for the given LGR, for example because they only occur in code point sequences. To define a class in terms of an explicit list of code points: This defines a class named "abc" containing the code points for characters "a", "b" and "c". The ordering of the code points is not material, but it is RECOMMENDED to list them in ascending order. Range operators may also be used to represent any series of consecutive code points. The same declaration can be made as follows: Range and code point declarations can be freely intermixed. A shorthand notation exists where code points are directly represented by space separated hexadecimal values, and ranges are represented by a start and end value separated by a hyphen. 0061 0062-0063 would be a more streamlined expression of the same class using the shorthand notation. Davies & Freytag Expires March 27, 2014 [Page 19] Internet-Draft Label Generation Rulesets in XML September 2013 6.2.4. Combined classes Classes may be combined using logical operators for inversion, union, intersection, difference and symmetric difference (exclusive-or). +-------------------+------------------------------------+ | Logical Operation | Example | +-------------------+------------------------------------+ | Inversion | | +-------------------+------------------------------------+ | Union | | | | | | | | | | | +-------------------+------------------------------------+ | Intersection | | | | | | | | | | | +-------------------+------------------------------------+ | Difference | | | | | | | | | | | +-------------------+------------------------------------+ | Symmetric | | | Difference | | | | | | | | +-------------------+------------------------------------+ Combinations can be anonymous or named. This creates a named class that represents the union of classes "xxx" and "yyy", and which can be referenced in other classes or rules as . Note that the reference to a named class is always via a "class" element, independent of how the character class was defined. An "intersection", "symmetric-difference" or "difference" element MUST contain precisely two, and a "not" element MUST contain Davies & Freytag Expires March 27, 2014 [Page 20] Internet-Draft Label Generation Rulesets in XML September 2013 precisely one "class" or one of the operator elements, while a "union" element MUST contain two or more elements. 6.3. Whole Label and Context Rules Each rule is comprised of a series of matching operators that must be satisfied in order to determine whether a label meets a given condition. Rules may reference other rules or character classes defined elsewhere in the table. 6.3.1. The rule element A matching rule is defined by a "rule" element, which contains combinations character classes with literal code point sequences and context operators contained in child elements. In evaluating a rule, each child element is matched in order. Rules may optionally be named using a "name" attribute containing a single identifier string with no spaces. If the name attribute is omitted, the rule is anonymous and may not incorporated by reference into another rule or referenced by an action or "when" attribute. A simple rule to match a label where all characters are members of the class "preferred": Rules are paired with explicit and implied actions, triggering these actions when a rule matches a label. For example, a simple explicit action for the rule shown above would be: which has the effect of setting the policy disposition for a label made up entirely of "preferred" code points, to "preferred". Explicit actions are further discussed in Section 6.4 and use of rules in conditional context for implied actions is discussed in Section 5.2.5 and Section 6.4.3. 6.3.1.1. The count attribute The number of times a specific character class or rule may appear in an expression defined by a rule is given by the "count" attribute. The attribute consists of a number, optionally followed by a "+" sign. The number MUST be an integer of value 0 or higher, and gives the number of times the class or rule may appear in matching. If the Davies & Freytag Expires March 27, 2014 [Page 21] Internet-Draft Label Generation Rulesets in XML September 2013 number is followed by a plus sign ("+"), it means that any number of additional occurrences are allowed beyond the number stated. Therefore, "1" would mean exactly one occurrence, whereas "1+" would indicate one or more occurrences. If no count attribute is specified, the number of occurrences is "1". 6.3.1.2. The choice element For cases where several alternates could be chosen, the "choice" element can encode a list of choices: Each child element of a "choice" represents one alternative. The first matching alternative determines the match for the choice element. To express a choice where one alternative consists of a sequence of elements, they can be wrapped in an anonymous rule. 6.3.1.3. Literal code point sequences A literal code point sequence matches a single code point or a sequence. It is defined by a "char" element, with the code point or sequence to be matched given by the "cp" attribute. When used as a literal, a "char" element may contain a "count" in addition to the "cp" attribute, comments or references, but no conditional contexts or child elements. 6.3.1.4. The any element The "any" element matches any single code point. It may have a "count" attribute. For an example see Section 6.3.1.8 The "any" element" may have neither a "comment" nor a "ref" attribute. 6.3.1.5. The start and end elements To match the beginning or end of a label, use the "start" or "end" element. Davies & Freytag Expires March 27, 2014 [Page 22] Internet-Draft Label Generation Rulesets in XML September 2013 Start and end elements do not have a "count" or any other attribute. When their use is not required, it is RECOMMENDED to us the "match" attribute instead. One case where start or end elements are required is when only some, but not all of the alternatives in a "choice" need to match beginning or end of a label. 6.3.1.6. The match attribute Whole Label Evaluation Rules in principle always apply to the entire label, but in practice, for example to express a requirement to not start a label with a digit, some rules do not need to cover the whole label. Use attribute "match" with value "whole-label" to identify a rule applicable to the entire label. For other rules use "from- start", "anywhere" and "to-end" to define rules that need to match in specific positions of the label. Certain parameterized context rules (see Section 6.3.2) have a match attribute value of "context". The defaults is "anywhere". A "match" attribute present in the definition of a rule is ignored if a rule is referenced by name inside another rule. An anonymous rule may not have a "match" attribute. 6.3.1.7. The name attribute Rules and classes may be named using a "name" attribute and can be nested either directly or, if named, by reference. Here's an example of a rule requiring that all labels be letters (optionally followed by combining marks) and possibly digits. The example shows rules and classes referenced by name. Davies & Freytag Expires March 27, 2014 [Page 23] Internet-Draft Label Generation Rulesets in XML September 2013 6.3.1.8. Example rule from IDNA2008 This sections shows an example of the whole label evaluation rule from[RFC5892]forbidding the mixture of the Arabic-Indic and extended Arabic-Indic digits in the same label. The preceding example also demonstrates several instances of the use of anonymous rules for grouping. Davies & Freytag Expires March 27, 2014 [Page 24] Internet-Draft Label Generation Rulesets in XML September 2013 6.3.2. Parameterized Context or When Rule A special type of rule provides a context for evaluating the validity of a code point or variant mapping. This rule is invoked by the "when" attribute described inSection 5.2.5. For a context rule, the match attribute is normally "context". Such "when rules" contain a special place holder, represented by a "match" element (not to be confused with the "match" attribute). When evaluated, the "match" element is replaced by a literal corresponding to the "cp" attribute of the element for which the rule in its "when" attribute is being evaluated. For example, the Greek lower numeral sign is invalid if not immediately preceding a character in the Greek script. This is most naturally addressed with a when rule using look-ahead: ... In evaluating this rule, the "match" element is treated as if it was replaced by a literal The action implied by a context rule is always a disposition of "invalid" if the when rule is not matched. Unlike other rules, these rules may not be associated with arbitrary actions via "action" elements. 6.3.2.1. The look-behind and look-ahead elements Context rules use the "look-behind" and "look-ahead" elements to define context before and after the code point sequence matched by the "match" element. If the "match" element is omitted, neither the "look-behind" nor the "look-ahead" element may be present. Here is an example of a rule that defines an "initial" context for an Arabic code point: Davies & Freytag Expires March 27, 2014 [Page 25] Internet-Draft Label Generation Rulesets in XML September 2013 A when rule contains any combination of "look-ahead" , "match" and "look-behind" elements in that order. Each of these elements occurs at most once, and none have a "count" attribute. If a context rule contains a look-ahead or look-behind element, it MUST contain a "match" element. If a "match" element is present the rule MUST have a "match" attribute with a value of "context". 6.3.2.2. Omitting the match element If the "match" element is omitted, the evaluation of the context rule is not tied to the position of the code point or sequence associated with the "when" attribute. Katakana middle dot is invalid in any label not containing at least one Japanese character anywhere in the label. Because this requirement is independent of the position of the middle dot, the rule does not require a "match" element and the "match" attribute is "anywhere". Davies & Freytag Expires March 27, 2014 [Page 26] Internet-Draft Label Generation Rulesets in XML September 2013 The Katakana middle dot is used only with Han, Katakana or Hiragana. The "when" rule requires that at least one code point in the label is in one of these scripts. (Note that the Katakana middle dot itself is of script Common). 6.4. Action elements The purpose of a rule is to trigger a specific action. Often, the action simply results in blocking a label that does not match a rule. An example of an action invalidating a label: An action may contain precisely one "match" or "not-match" attribute, but not both. Because rules may be compound rules that contain other rules, only a single rule may be named as the value of the "match" or "not-match" attribute. An action may contain either one of a set of optional attributes matching the variant disposition from the "disposition" attributed of any "var" element used in generating the variant label being evaluated. Assuming all variants have been given suitable "disposition" attributes of "blocked" or "allocate" and that a rule is defined matching labels consisting entirely of code points tagged as "preferred" the following actions evaluate the disposition for the variant label: The first action matches any variant label for which at least one of the code point variants carries the disposition "blocked". The second matches any variant label for which all of the code point variants carry the disposition "allocate". Neither action matches a label that is not a variant label. If necessary repeat an action so it applies to an ordinary label: Davies & Freytag Expires March 27, 2014 [Page 27] Internet-Draft Label Generation Rulesets in XML September 2013 6.4.1. Recommended Disposition Values The precise nature of the policy action taken in response to a disposition and the name of the corresponding "disposition" attributes are only partially defined here. It is strongly RECOMMENDED to use the following actions only with their conventional sense. invalid The resulting string is not a valid label. This disposition may be assigned implicitly, seeSection 6.4.3. block The resulting string is a valid label, but should be blocked from registration. This would typically apply for a derived variant that has no practical use, such as blocking confusingly similar by undesirable variants. allocate The resulting string should be reserved for use by the same operator of the origin string, but not automatically allocated for use. activate The resulting string should be activated for use. (This is the typical default action if no tagging is used, and is known as a "preferred" variant in [RFC3743]) 6.4.2. Precedence Actions are applied in the order of their appearance in the file. This defines their relative precedence. The first action for which the rule is matched or not-matched as required for a particular label defines the disposition for that label. The conventional order of precedence for the actions defined here is "invalid", "block", "allocate", "activate". In order to define a different order of precedence or when additional actions are defined, list the actions in the appropriate order. 6.4.3. Implied actions The context, or "when" rules carry an implied action with a disposition of "invalid". These rules are evaluated at the time a label's code points and variants are checked for validity (seeSection 8) . In other words, before any whole-label evaluation rules and with higher precedence. The context rules for variant mappings are evaluated when variants are generated and / or when variant tables are made symmetric and transitive. They have an implied action with a disposition of "invalid" which means a putative variant mapping doesn't exist in the given context. Note that such non-existing variant mapping is different from a Davies & Freytag Expires March 27, 2014 [Page 28] Internet-Draft Label Generation Rulesets in XML September 2013 blocked variant, which is variant code point mapping that exists, but results in a label that may not be allocated. Variant mappings may be given a disposition attribute . An implied action relates these to the disposition for the entire variant label. For example, a variant label in which any variant code point is a blocked code point variant is blocked. The default order of precedence for evaluating dispositions is as given above. The default precedence applies if no actions are defined that match specific variant dispositions. Davies & Freytag Expires March 27, 2014 [Page 29] Internet-Draft Label Generation Rulesets in XML September 2013 7. Example table The following presents a sample XML LGR showing a near complete collection of most of the elements and attributes defined in this specification in somewhat typical context. 1 2010-01-01 sv example Swedish examples institute. ]]> The Unicode Standard 6.3 RFC 5892 Big-5: Computer Chinese Glyph and Character Code Mapping Table, Technical Report C-26, 1984 Davies & Freytag Expires March 27, 2014 [Page 30] Internet-Draft Label Generation Rulesets in XML September 2013 006E 0070-0078 Davies & Freytag Expires March 27, 2014 [Page 31] Internet-Draft Label Generation Rulesets in XML September 2013 8. Processing a label against an LGR 8.1. Determining eligibility for a label In order to use a table to test a specific domain label for membership in the LGR, a consumer of the LGR must iterate through each code point within a given U-label, and test that each code point is a member of the LGR. If any code point is not a member of the LGR, it shall be deemed as not eligible in accordance with the table. A code point is deemed a member of the table when it is listed with the "char" element, and all necessary condition listed in "when" or "not-when" attributes are correctly satisfied. 8.2. Determining variants for a label For a given eligible label, the set of variants is deemed to be each possible permutation of "var" elements, whereby all "when" and "not- when" attributes are correctly satisfied for each var element in the given permutation and all applicable whole label evaluation rules are satisfied as follows: o Create each possible permutation of a label, by substituting each code point or code point sequence in turn by any defined variant mapping o Apply variant mappings with "when" or "not-when" attributes only if the conditions are satisfied o Record each of the "disposition" values on the variant mappings used in creating a given variant label o Evaluate each variant against any actions for which the disposition is "invalid", remove any that satisfy the conditions. 8.3. Determining a disposition for a label or variant label For a given label, the disposition for the is determined by evaluating in order of their appearance all actions for which the label or variant label satisfies the conditions. o For any label, the "disposition" value for the first action applies, for which the label matches or doesn't match the whole label evaluation rule, given in the "match" or "not-match" attribute for that action, o For any variant label, the "disposition" value for the first action applies, for which the label matches or doesn't match the Davies & Freytag Expires March 27, 2014 [Page 32] Internet-Draft Label Generation Rulesets in XML September 2013 whole label evaluation rule, given in the "match" or "not-match" attribute, and for which any or all of the recorded variant dispositions match the conditions for that action. o For any remaining variant label, assign the variant label the disposition matching the most restrictive disposition recorded. for any of its variants The order from most restrictive to least is "invalid", "blocked", "allocated", "active". o Variants dispositions outside the predefined default set, and for which no action is defined are ignored. Davies & Freytag Expires March 27, 2014 [Page 33] Internet-Draft Label Generation Rulesets in XML September 2013 9. Conversion between other formats Both [RFC3743] and [RFC4290] provide different grammars for IDN tables. These formats are unable to fully cater for the increased requirements of contemporary IDN variant policies. This specification is a superset of functionality provided by these IDN table formats, thus any table expressed in those formats can be expressed in this format. Automated conversion can be conducted between tables conformant with the grammar specified in each document. Davies & Freytag Expires March 27, 2014 [Page 34] Internet-Draft Label Generation Rulesets in XML September 2013 10. IANA Considerations This document does not specify any IANA actions. Davies & Freytag Expires March 27, 2014 [Page 35] Internet-Draft Label Generation Rulesets in XML September 2013 11. Security Considerations There are no security considerations for this memo. Davies & Freytag Expires March 27, 2014 [Page 36] Internet-Draft Label Generation Rulesets in XML September 2013 12. References [LGR-PROCEDURE] Internet Corporation for Assigned Names and Numbers, "Procedure to Develop and Maintain the Label Generation Rules for the Root Zone in Respect of IDNA Labels". [RFC3339] Klyne, G., Ed. and C. Newman, "Date and Time on the Internet: Timestamps", RFC 3339, July 2002. [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint Engineering Team (JET) Guidelines for Internationalized Domain Names (IDN) Registration and Administration for Chinese, Japanese, and Korean", RFC 3743, April 2004. [RFC4290] Klensin, J., "Suggested Practices for Registration of Internationalized Domain Names (IDN)", RFC 4290, December 2005. [RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman, "Linguistic Guidelines for the Use of the Arabic Language in Internet Domains", RFC 5564, February 2010. [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying Languages", BCP 47, RFC 5646, September 2009. [RFC5892] Faltstrom, P., "The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)", RFC 5892, August 2010. [UAX42] Unicode Consortium, "Unicode Character Database in XML". [XML] "Extensible Markup Language (XML) 1.0". Davies & Freytag Expires March 27, 2014 [Page 37] Internet-Draft Label Generation Rulesets in XML September 2013 Appendix A. RelaxNG Schema [TODO: this needs to be updated to reflect additions to the syntax.] Davies & Freytag Expires March 27, 2014 [Page 38] Internet-Draft Label Generation Rulesets in XML September 2013 Davies & Freytag Expires March 27, 2014 [Page 39] Internet-Draft Label Generation Rulesets in XML September 2013 Davies & Freytag Expires March 27, 2014 [Page 40] Internet-Draft Label Generation Rulesets in XML September 2013 Davies & Freytag Expires March 27, 2014 [Page 41] Internet-Draft Label Generation Rulesets in XML September 2013 Davies & Freytag Expires March 27, 2014 [Page 42] Internet-Draft Label Generation Rulesets in XML September 2013 Davies & Freytag Expires March 27, 2014 [Page 43] Internet-Draft Label Generation Rulesets in XML September 2013 Davies & Freytag Expires March 27, 2014 [Page 44] Internet-Draft Label Generation Rulesets in XML September 2013 Appendix B. Acknowledgements This format builds upon the work on documenting IDN tables by many different registry operators. Notably, a comprehensive language table for Chinese, Japanese and Korean was developed by the "Joint Engineering Team" [RFC3743] that is the basis of many registry policies; and a set of guidelines for Arabic script registrations [RFC5564] was published by the Arabic-language community. Contributions that have shaped this document have been provided by Francisco Arias, Mark Davis, Nicholas Ostler, Thomas Roessler, Steve Sheng, Michel Suignard, John Yunker and Andrew Sullivan. Davies & Freytag Expires March 27, 2014 [Page 45] Internet-Draft Label Generation Rulesets in XML September 2013 Appendix C. Editorial Notes This appendix to be removed prior to final publication. C.1. Known Issues and Future Work o A method of specifying the origin URI for a table, and an expiration or refresh policy, as meta-data may be a useful way to declare how the table will be updated. C.2. Change History -00 Initial draft. -01 Add an XML Namespace, and fix other XML nits. Add support for sequences of code points. Improve on consistently using Unicode nomenclature. -02 Add support for validity periods. -03 Incorporate requirements from the Label Generation Ruleset Procedure for the DNS Root Zone. These requirements include a detailed grammar for specifying whole-label variants, and the ability to explicitly declare of the actions associated with a specific variant. The document also consistently applies the term "Label Generation Ruleset", rather than "IDN table", to reflect the policy term now being used to describe these. -04 Support reference information per [RFC3743]. Update description in response to feedback. Extend the context rules to "char" elements and allow for inverse matching ("not-when"). Extend the description of label processing and implied actions, and allow for actions that reference disposition attributes on any or all variant mappings used in the generation of a variant label. Davies & Freytag Expires March 27, 2014 [Page 46] Internet-Draft Label Generation Rulesets in XML September 2013 Authors' Addresses Kim Davies Internet Corporation for Assigned Names and Numbers 12025 Waterfront Drive Los Angeles, CA 90094 US Phone: +1 310 301 5800 Email: kim.davies@icann.org URI: http://www.iana.org/ Asmus Freytag ASMUS Inc. Email: asmus@unicode.org Davies & Freytag Expires March 27, 2014 [Page 47]