rfc9309.original | rfc9309.txt | |||
---|---|---|---|---|
Network Working Group M. Koster, Ed. | Internet Engineering Task Force (IETF) M. Koster | |||
Internet-Draft Stalworthy Computing, Ltd. | Request for Comments: 9309 | |||
Intended status: Standards Track G. Illyes, Ed. | Category: Standards Track G. Illyes | |||
Expires: 7 January 2023 H. Zeller, Ed. | ISSN: 2070-1721 H. Zeller | |||
L. Sassman, Ed. | L. Sassman | |||
Google LLC. | Google LLC | |||
6 July 2022 | September 2022 | |||
Robots Exclusion Protocol | Robots Exclusion Protocol | |||
draft-koster-rep-12 | ||||
Abstract | Abstract | |||
This document specifies and extends the "Robots Exclusion Protocol" | This document specifies and extends the "Robots Exclusion Protocol" | |||
method originally defined by Martijn Koster in 1996 for service | method originally defined by Martijn Koster in 1994 for service | |||
owners to control how content served by their services may be | owners to control how content served by their services may be | |||
accessed, if at all, by automatic clients known as crawlers. | accessed, if at all, by automatic clients known as crawlers. | |||
Specifically, it adds definition language for the protocol and | Specifically, it adds definition language for the protocol, | |||
instructions for handling errors and caching. | instructions for handling errors, and instructions for caching. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This is an Internet Standards Track document. | |||
provisions of BCP 78 and BCP 79. | ||||
Internet-Drafts are working documents of the Internet Engineering | ||||
Task Force (IETF). Note that other groups may also distribute | ||||
working documents as Internet-Drafts. The list of current Internet- | ||||
Drafts is at https://datatracker.ietf.org/drafts/current/. | ||||
Internet-Drafts are draft documents valid for a maximum of six months | This document is a product of the Internet Engineering Task Force | |||
and may be updated, replaced, or obsoleted by other documents at any | (IETF). It represents the consensus of the IETF community. It has | |||
time. It is inappropriate to use Internet-Drafts as reference | received public review and has been approved for publication by the | |||
material or to cite them other than as "work in progress." | Internet Engineering Steering Group (IESG). Further information on | |||
Internet Standards is available in Section 2 of RFC 7841. | ||||
This Internet-Draft will expire on 7 January 2023. | Information about the current status of this document, any errata, | |||
and how to provide feedback on it may be obtained at | ||||
https://www.rfc-editor.org/info/rfc9309. | ||||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2022 IETF Trust and the persons identified as the | Copyright (c) 2022 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents | |||
license-info) in effect on the date of publication of this document. | (https://trustee.ietf.org/license-info) in effect on the date of | |||
Please review these documents carefully, as they describe your rights | publication of this document. Please review these documents | |||
and restrictions with respect to this document. Code Components | carefully, as they describe your rights and restrictions with respect | |||
extracted from this document must include Revised BSD License text as | to this document. Code Components extracted from this document must | |||
described in Section 4.e of the Trust Legal Provisions and are | include Revised BSD License text as described in Section 4.e of the | |||
provided without warranty as described in the Revised BSD License. | Trust Legal Provisions and are provided without warranty as described | |||
in the Revised BSD License. | ||||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction | |||
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | 1.1. Requirements Language | |||
2. Specification . . . . . . . . . . . . . . . . . . . . . . . . 3 | 2. Specification | |||
2.1. Protocol Definition . . . . . . . . . . . . . . . . . . . 3 | 2.1. Protocol Definition | |||
2.2. Formal Syntax . . . . . . . . . . . . . . . . . . . . . . 3 | 2.2. Formal Syntax | |||
2.2.1. The User-Agent Line . . . . . . . . . . . . . . . . . 4 | 2.2.1. The User-Agent Line | |||
2.2.2. The Allow and Disallow Lines . . . . . . . . . . . . 6 | 2.2.2. The "Allow" and "Disallow" Lines | |||
2.2.3. Special Characters . . . . . . . . . . . . . . . . . 7 | 2.2.3. Special Characters | |||
2.2.4. Other Records . . . . . . . . . . . . . . . . . . . . 8 | 2.2.4. Other Records | |||
2.3. Access Method . . . . . . . . . . . . . . . . . . . . . . 9 | 2.3. Access Method | |||
2.3.1. Access Results . . . . . . . . . . . . . . . . . . . 9 | 2.3.1. Access Results | |||
2.3.1.1. Successful Access . . . . . . . . . . . . . . . . 9 | 2.3.1.1. Successful Access | |||
2.3.1.2. Redirects . . . . . . . . . . . . . . . . . . . . 9 | 2.3.1.2. Redirects | |||
2.3.1.3. Unavailable Status . . . . . . . . . . . . . . . 9 | 2.3.1.3. "Unavailable" Status | |||
2.3.1.4. Unreachable Status . . . . . . . . . . . . . . . 10 | 2.3.1.4. "Unreachable" Status | |||
2.3.1.5. Parsing Errors . . . . . . . . . . . . . . . . . 10 | 2.3.1.5. Parsing Errors | |||
2.4. Caching . . . . . . . . . . . . . . . . . . . . . . . . . 10 | 2.4. Caching | |||
2.5. Limits . . . . . . . . . . . . . . . . . . . . . . . . . 10 | 2.5. Limits | |||
3. Security Considerations . . . . . . . . . . . . . . . . . . . 10 | 3. Security Considerations | |||
4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 | 4. IANA Considerations | |||
5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 11 | 5. Examples | |||
5.1. Simple Example . . . . . . . . . . . . . . . . . . . . . 11 | 5.1. Simple Example | |||
5.2. Longest Match . . . . . . . . . . . . . . . . . . . . . . 12 | 5.2. Longest Match | |||
6. References . . . . . . . . . . . . . . . . . . . . . . . . . 12 | 6. References | |||
6.1. Normative References . . . . . . . . . . . . . . . . . . 12 | 6.1. Normative References | |||
6.2. Informative References . . . . . . . . . . . . . . . . . 13 | 6.2. Informative References | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13 | Authors' Addresses | |||
1. Introduction | 1. Introduction | |||
This document applies to services that provide resources that clients | This document applies to services that provide resources that clients | |||
can access through URIs as defined in [RFC3986]. For example, in the | can access through URIs as defined in [RFC3986]. For example, in the | |||
context of HTTP, a browser is a client that displays the content of a | context of HTTP, a browser is a client that displays the content of a | |||
web page. | web page. | |||
Crawlers are automated clients. Search engines for instance have | Crawlers are automated clients. Search engines, for instance, have | |||
crawlers to recursively traverse links for indexing as defined in | crawlers to recursively traverse links for indexing as defined in | |||
[RFC8288]. | [RFC8288]. | |||
It may be inconvenient for service owners if crawlers visit the | It may be inconvenient for service owners if crawlers visit the | |||
entirety of their URI space. This document specifies the rules | entirety of their URI space. This document specifies the rules | |||
originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT] | originally defined by the "Robots Exclusion Protocol" [ROBOTSTXT] | |||
that crawlers are requested to honor when accessing URIs. | that crawlers are requested to honor when accessing URIs. | |||
These rules are not a form of access authorization. | These rules are not a form of access authorization. | |||
1.1. Requirements Language | 1.1. Requirements Language | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
"OPTIONAL" in this document are to be interpreted as described in BCP | "OPTIONAL" in this document are to be interpreted as described in | |||
14 [RFC2119] [RFC8174] when, and only when, they appear in all | BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | |||
capitals, as shown here. | capitals, as shown here. | |||
2. Specification | 2. Specification | |||
2.1. Protocol Definition | 2.1. Protocol Definition | |||
The protocol language consists of rule(s) and group(s) that the | The protocol language consists of rule(s) and group(s) that the | |||
service makes available in a file named 'robots.txt' as described in | service makes available in a file named "robots.txt" as described in | |||
Section 2.3: | Section 2.3: | |||
* Rule: A line with a key-value pair that defines how a crawler may | Rule: A line with a key-value pair that defines how a crawler may | |||
access URIs. See Section 2.2.2. | access URIs. See Section 2.2.2. | |||
* Group: One or more user-agent lines that is followed by one or | Group: One or more user-agent lines that are followed by one or more | |||
more rules. The group is terminated by a user-agent line or end | rules. The group is terminated by a user-agent line or end of | |||
of file. See Section 2.2.1. The last group may have no rules, | file. See Section 2.2.1. The last group may have no rules, which | |||
which means it implicitly allows everything. | means it implicitly allows everything. | |||
2.2. Formal Syntax | 2.2. Formal Syntax | |||
Below is an Augmented Backus-Naur Form (ABNF) description, as | Below is an Augmented Backus-Naur Form (ABNF) description, as | |||
described in [RFC5234]. | described in [RFC5234]. | |||
robotstxt = *(group / emptyline) | robotstxt = *(group / emptyline) | |||
group = startgroupline ; We start with a user-agent | group = startgroupline ; We start with a user-agent | |||
; line | ||||
*(startgroupline / emptyline) ; ... and possibly more | *(startgroupline / emptyline) ; ... and possibly more | |||
; user-agents | ; user-agent lines | |||
*(rule / emptyline) ; followed by rules relevant | *(rule / emptyline) ; followed by rules relevant | |||
; for UAs | ; for the preceding | |||
; user-agent lines | ||||
startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL | startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL | |||
rule = *WS ("allow" / "disallow") *WS ":" | rule = *WS ("allow" / "disallow") *WS ":" | |||
*WS (path-pattern / empty-pattern) EOL | *WS (path-pattern / empty-pattern) EOL | |||
; parser implementors: define additional lines you need (for | ; parser implementors: define additional lines you need (for | |||
; example, sitemaps). | ; example, Sitemaps). | |||
product-token = identifier / "*" | product-token = identifier / "*" | |||
path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern | path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern | |||
empty-pattern = *WS | empty-pattern = *WS | |||
identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) | identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) | |||
comment = "#" *(UTF8-char-noctl / WS / "#") | comment = "#" *(UTF8-char-noctl / WS / "#") | |||
emptyline = EOL | emptyline = EOL | |||
EOL = *WS [comment] NL ; end-of-line may have | EOL = *WS [comment] NL ; end-of-line may have | |||
; optional trailing comment | ; optional trailing comment | |||
NL = %x0D / %x0A / %x0D.0A | NL = %x0D / %x0A / %x0D.0A | |||
WS = %x20 / %x09 | WS = %x20 / %x09 | |||
; UTF8 derived from RFC3629, but excluding control characters | ; UTF8 derived from RFC 3629, but excluding control characters | |||
UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 | UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 | |||
UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, '#' | UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space, "#" | |||
UTF8-2 = %xC2-DF UTF8-tail | UTF8-2 = %xC2-DF UTF8-tail | |||
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / | UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / | |||
%xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail | %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail | |||
UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / | UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / | |||
%xF4 %x80-8F 2UTF8-tail | %xF4 %x80-8F 2UTF8-tail | |||
UTF8-tail = %x80-BF | UTF8-tail = %x80-BF | |||
2.2.1. The User-Agent Line | 2.2.1. The User-Agent Line | |||
Crawlers set their own name, which is called a product token, to find | Crawlers set their own name, which is called a product token, to find | |||
relevant groups. The product token MUST contain only upper and | relevant groups. The product token MUST contain only uppercase and | |||
lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens | lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens | |||
("-"). The product token SHOULD be a substring of the identification | ("-"). The product token SHOULD be a substring of the identification | |||
string that the crawler sends to the service (for example, in the | string that the crawler sends to the service. For example, in the | |||
case of HTTP, the product token SHOULD be a substring in the user- | case of HTTP [RFC9110], the product token SHOULD be a substring in | |||
agent header). The identification string SHOULD describe the purpose | the User-Agent header. The identification string SHOULD describe the | |||
of the crawler. Here's an example of a user-agent HTTP request | purpose of the crawler. Here's an example of a User-Agent HTTP | |||
header with a link pointing to a page describing the purpose of the | request header with a link pointing to a page describing the purpose | |||
ExampleBot crawler, which appears as a substring in the user-agent | of the ExampleBot crawler, which appears as a substring in the User- | |||
HTTP header and as a product token in the robots.txt user-agent line: | Agent HTTP header and as a product token in the robots.txt user-agent | |||
line: | ||||
+===================================+=================+ | +==========================================+========================+ | |||
| user-agent HTTP header | robots.txt | | | User-Agent HTTP header | robots.txt user-agent | | |||
| | user-agent line | | | | line | | |||
+===================================+=================+ | +==========================================+========================+ | |||
| user-agent: Mozilla/5.0 | user-agent: | | | User-Agent: Mozilla/5.0 (compatible; | user-agent: ExampleBot | | |||
| (compatible; ExampleBot/0.1; | ExampleBot | | | ExampleBot/0.1; | | | |||
| https://www.example.com/bot.html) | | | | https://www.example.com/bot.html) | | | |||
+-----------------------------------+-----------------+ | +------------------------------------------+------------------------+ | |||
Table 1: Example of a user-agent HTTP header and | Figure 1: Example of a User-Agent HTTP header and robots.txt | |||
robots.txt user-agent line for the ExampleBot | user-agent line for the ExampleBot product token | |||
product token. Note that the product token | ||||
(ExampleBot) is a substring of the user-agent HTTP | Note that the product token (ExampleBot) is a substring of the User- | |||
header | Agent HTTP header. | |||
Crawlers MUST use case-insensitive matching to find the group that | Crawlers MUST use case-insensitive matching to find the group that | |||
matches the product token, and then obey the rules of the group. If | matches the product token and then obey the rules of the group. If | |||
there is more than one group matching the user-agent, the matching | there is more than one group matching the user-agent, the matching | |||
groups' rules MUST be combined into one group and parsed according to | groups' rules MUST be combined into one group and parsed according to | |||
Section 2.2.2. | Section 2.2.2. | |||
+========================+================+ | +========================================+========================+ | |||
| Two groups that match | Merged group | | | Two groups that match the same product | Merged group | | |||
| the same product token | | | | token exactly | | | |||
| exactly | | | +========================================+========================+ | |||
+========================+================+ | | user-agent: ExampleBot | user-agent: ExampleBot | | |||
| user-agent: ExampleBot | user-agent: | | | disallow: /foo | disallow: /foo | | |||
| disallow: /foo | ExampleBot | | | disallow: /bar | disallow: /bar | | |||
| disallow: /bar | disallow: /foo | | | | disallow: /baz | | |||
| | disallow: /bar | | | user-agent: ExampleBot | | | |||
| user-agent: ExampleBot | disallow: /baz | | | disallow: /baz | | | |||
| disallow: /baz | | | +----------------------------------------+------------------------+ | |||
+------------------------+----------------+ | ||||
Table 2: Example of how to merge two | Figure 2: Example of how to merge two robots.txt groups that | |||
robots.txt groups that match the same | match the same product token | |||
product token | ||||
If no matching group exists, crawlers MUST obey the group with a | If no matching group exists, crawlers MUST obey the group with a | |||
user-agent line with the "*" value, if present. | user-agent line with the "*" value, if present. | |||
+====================+=============+ | +==================================+======================+ | |||
| Two groups that | Applicable | | | Two groups that don't explicitly | Applicable group for | | |||
| don't explicitly | group for | | | match ExampleBot | ExampleBot | | |||
| match ExampleBot | ExampleBot | | +==================================+======================+ | |||
+====================+=============+ | | user-agent: * | user-agent: * | | |||
| user-agent: * | user-agent: | | | disallow: /foo | disallow: /foo | | |||
| disallow: /foo | * | | | disallow: /bar | disallow: /bar | | |||
| disallow: /bar | disallow: | | | | | | |||
| | /foo | | | user-agent: BazBot | | | |||
| user-agent: BazBot | disallow: | | | disallow: /baz | | | |||
| disallow: /baz | /bar | | +----------------------------------+----------------------+ | |||
+--------------------+-------------+ | ||||
Table 3: Example of no matching | Figure 3: Example of no matching groups other than the "*" for | |||
groups other than the '*' for | the ExampleBot product token | |||
the ExampleBot product token | ||||
If no group matches the product token and there is no group with a | If no group matches the product token and there is no group with a | |||
user-agent line with the "*" value, or no groups are present at all, | user-agent line with the "*" value, or no groups are present at all, | |||
no rules apply. | no rules apply. | |||
2.2.2. The Allow and Disallow Lines | 2.2.2. The "Allow" and "Disallow" Lines | |||
These lines indicate whether accessing a URI that matches the | These lines indicate whether accessing a URI that matches the | |||
corresponding path is allowed or disallowed. | corresponding path is allowed or disallowed. | |||
To evaluate if access to a URI is allowed, a crawler MUST match the | To evaluate if access to a URI is allowed, a crawler MUST match the | |||
paths in allow and disallow rules against the URI. The matching | paths in "allow" and "disallow" rules against the URI. The matching | |||
SHOULD be case sensitive. The matching MUST start with the first | SHOULD be case sensitive. The matching MUST start with the first | |||
octet of the path. The most specific match found MUST be used. The | octet of the path. The most specific match found MUST be used. The | |||
most specific match is the match that has the most octets. Duplicate | most specific match is the match that has the most octets. Duplicate | |||
rules in a group MAY be deduplicated. If an allow and disallow rule | rules in a group MAY be deduplicated. If an "allow" rule and a | |||
are equivalent, then the allow rule SHOULD be used. If no match is | "disallow" rule are equivalent, then the "allow" rule SHOULD be used. | |||
found amongst the rules in a group for a matching user-agent, or | If no match is found amongst the rules in a group for a matching | |||
there are no rules in the group, the URI is allowed. The /robots.txt | user-agent or there are no rules in the group, the URI is allowed. | |||
URI is implicitly allowed. | The /robots.txt URI is implicitly allowed. | |||
Octets in the URI and robots.txt paths outside the range of the US- | Octets in the URI and robots.txt paths outside the range of the ASCII | |||
ASCII coded character set, and those in the reserved range defined by | coded character set, and those in the reserved range defined by | |||
[RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to | [RFC3986], MUST be percent-encoded as defined by [RFC3986] prior to | |||
comparison. | comparison. | |||
If a percent-encoded US-ASCII octet is encountered in the URI, it | If a percent-encoded ASCII octet is encountered in the URI, it MUST | |||
MUST be unencoded prior to comparison, unless it is a reserved | be unencoded prior to comparison, unless it is a reserved character | |||
character in the URI as defined by [RFC3986] or the character is | in the URI as defined by [RFC3986] or the character is outside the | |||
outside the unreserved character range. The match evaluates | unreserved character range. The match evaluates positively if and | |||
positively if and only if the end of the path from the rule is | only if the end of the path from the rule is reached before a | |||
reached before a difference in octets is encountered. | difference in octets is encountered. | |||
For example: | For example: | |||
+===================+======================+======================+ | +==================+=======================+=======================+ | |||
| Path | Encoded Path | Path to Match | | | Path | Encoded Path | Path to Match | | |||
+===================+======================+======================+ | +==================+=======================+=======================+ | |||
| /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | | | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | | |||
+-------------------+----------------------+----------------------+ | +------------------+-----------------------+-----------------------+ | |||
| /foo/bar?baz=http | /foo/bar?baz=http%3A | /foo/bar?baz=http%3A | | | /foo/bar?baz= | /foo/bar?baz= | /foo/bar?baz= | | |||
| ://foo.bar | %2F%2Ffoo.bar | %2F%2Ffoo.bar | | | https://foo.bar | https%3A%2F%2Ffoo.bar | https%3A%2F%2Ffoo.bar | | |||
+-------------------+----------------------+----------------------+ | +------------------+-----------------------+-----------------------+ | |||
| /foo/bar/U+E38384 | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | | /foo/bar/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | |||
+-------------------+----------------------+----------------------+ | | U+E38384 | | | | |||
| /foo/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | +------------------+-----------------------+-----------------------+ | |||
| bar/%E3%83%84 | | | | | /foo/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | |||
+-------------------+----------------------+----------------------+ | | bar/%E3%83%84 | | | | |||
| /foo/ | /foo/bar/%62%61%7A | /foo/bar/baz | | +------------------+-----------------------+-----------------------+ | |||
| bar/%62%61%7A | | | | | /foo/ | /foo/bar/%62%61%7A | /foo/bar/baz | | |||
+-------------------+----------------------+----------------------+ | | bar/%62%61%7A | | | | |||
+------------------+-----------------------+-----------------------+ | ||||
Table 4: Examples of matching percent-encoded URI components | Figure 4: Examples of matching percent-encoded URI components | |||
The crawler SHOULD ignore "disallow" and "allow" rules that are not | The crawler SHOULD ignore "disallow" and "allow" rules that are not | |||
in any group (for example, any rule that precedes the first user- | in any group (for example, any rule that precedes the first user- | |||
agent line). | agent line). | |||
Implementers MAY bridge encoding mismatches if they detect that the | Implementors MAY bridge encoding mismatches if they detect that the | |||
robots.txt file is not UTF8 encoded. | robots.txt file is not UTF-8 encoded. | |||
2.2.3. Special Characters | 2.2.3. Special Characters | |||
Crawlers MUST allow the following special characters: | Crawlers MUST support the following special characters: | |||
+===========+===================+==============================+ | +===========+===================+==============================+ | |||
| Character | Description | Example | | | Character | Description | Example | | |||
+===========+===================+==============================+ | +===========+===================+==============================+ | |||
| "#" | Designates an end | "allow: / # comment in line" | | | # | Designates a line | allow: / # comment in line | | |||
| | of line comment. | | | | | comment. | | | |||
| | | "# comment on its own line" | | | | | # comment on its own line | | |||
+-----------+-------------------+------------------------------+ | +-----------+-------------------+------------------------------+ | |||
| "$" | Designates the | "allow: /this/path/exactly$" | | | $ | Designates the | allow: /this/path/exactly$ | | |||
| | end of the match | | | | | end of the match | | | |||
| | pattern. | | | | | pattern. | | | |||
+-----------+-------------------+------------------------------+ | +-----------+-------------------+------------------------------+ | |||
| "*" | Designates 0 or | "allow: /this/*/exactly" | | | * | Designates 0 or | allow: /this/*/exactly | | |||
| | more instances of | | | | | more instances of | | | |||
| | any character. | | | | | any character. | | | |||
+-----------+-------------------+------------------------------+ | +-----------+-------------------+------------------------------+ | |||
Table 5: List of special characters in robots.txt files | Figure 5: List of special characters in robots.txt files | |||
If crawlers match special characters verbatim in the URI, crawlers | If crawlers match special characters verbatim in the URI, crawlers | |||
SHOULD use "%" encoding. For example: | SHOULD use "%" encoding. For example: | |||
+============================+===============================+ | +============================+====================================+ | |||
| Percent-encoded Pattern | URI | | | Percent-encoded Pattern | URI | | |||
+============================+===============================+ | +============================+====================================+ | |||
| /path/file-with-a-%2A.html | https://www.example.com/path/ | | | /path/file-with-a-%2A.html | https://www.example.com/path/ | | |||
| | file-with-a-*.html | | | | file-with-a-*.html | | |||
+----------------------------+-------------------------------+ | +----------------------------+------------------------------------+ | |||
| /path/foo-%24 | https://www.example.com/path/ | | | /path/foo-%24 | https://www.example.com/path/foo-$ | | |||
| | foo-$ | | +----------------------------+------------------------------------+ | |||
+----------------------------+-------------------------------+ | ||||
Table 6: Example of percent-encoding | Figure 6: Example of percent-encoding | |||
2.2.4. Other Records | 2.2.4. Other Records | |||
Crawlers MAY interpret other records that are not part of the | Crawlers MAY interpret other records that are not part of the | |||
robots.txt protocol. For example, 'sitemap' [SITEMAPS]. Crawlers | robots.txt protocol -- for example, "Sitemaps" [SITEMAPS]. Crawlers | |||
MAY be lenient when interpreting other records. For example, | MAY be lenient when interpreting other records. For example, | |||
crawlers may accept common typos of the record. | crawlers may accept common misspellings of the record. | |||
Parsing of other records MUST NOT interfere with the parsing of | Parsing of other records MUST NOT interfere with the parsing of | |||
explicitly defined records in Section 2. | explicitly defined records in Section 2. For example, a "Sitemaps" | |||
record MUST NOT terminate a group. | ||||
2.3. Access Method | 2.3. Access Method | |||
The rules MUST be accessible in a file named "/robots.txt" (all lower | The rules MUST be accessible in a file named "/robots.txt" (all | |||
case) in the top level path of the service. The file MUST be UTF-8 | lowercase) in the top-level path of the service. The file MUST be | |||
encoded (as defined in [RFC3629]) and Internet Media Type "text/ | UTF-8 encoded (as defined in [RFC3629]) and Internet Media Type | |||
plain" (as defined in [RFC2046]). | "text/plain" (as defined in [RFC2046]). | |||
As per [RFC3986], the URI of the robots.txt is: | As per [RFC3986], the URI of the robots.txt file is: | |||
"scheme:[//authority]/robots.txt" | "scheme:[//authority]/robots.txt" | |||
For example, in the context of HTTP or FTP, the URI is: | For example, in the context of HTTP or FTP, the URI is: | |||
https://www.example.com/robots.txt | https://www.example.com/robots.txt | |||
ftp://ftp.example.com/robots.txt | ftp://ftp.example.com/robots.txt | |||
2.3.1. Access Results | 2.3.1. Access Results | |||
2.3.1.1. Successful Access | 2.3.1.1. Successful Access | |||
If the crawler successfully downloads the robots.txt, the crawler | If the crawler successfully downloads the robots.txt file, the | |||
MUST follow the parseable rules. | crawler MUST follow the parseable rules. | |||
2.3.1.2. Redirects | 2.3.1.2. Redirects | |||
It's possible that a server responds to a robots.txt fetch request | It's possible that a server responds to a robots.txt fetch request | |||
with a redirect, such as HTTP 301 and HTTP 302 in case of HTTP. The | with a redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. | |||
crawlers SHOULD follow at least five consecutive redirects, even | The crawlers SHOULD follow at least five consecutive redirects, even | |||
across authorities (for example, hosts in case of HTTP), as defined | across authorities (for example, hosts in the case of HTTP). | |||
in [RFC1945]. | ||||
If a robots.txt file is reached within five consecutive redirects, | If a robots.txt file is reached within five consecutive redirects, | |||
the robots.txt file MUST be fetched, parsed, and its rules followed | the robots.txt file MUST be fetched, parsed, and its rules followed | |||
in the context of the initial authority. | in the context of the initial authority. | |||
If there are more than five consecutive redirects, crawlers MAY | If there are more than five consecutive redirects, crawlers MAY | |||
assume that the robots.txt is unavailable. | assume that the robots.txt file is unavailable. | |||
2.3.1.3. Unavailable Status | 2.3.1.3. "Unavailable" Status | |||
Unavailable means the crawler tries to fetch the robots.txt, and the | "Unavailable" means the crawler tries to fetch the robots.txt file | |||
server responds with unavailable status codes. For example, in the | and the server responds with status codes indicating that the | |||
context of HTTP, unavailable status codes are in the 400-499 range. | resource in question is unavailable. For example, in the context of | |||
HTTP, such status codes are in the 400-499 range. | ||||
If a server status code indicates that the robots.txt file is | If a server status code indicates that the robots.txt file is | |||
unavailable to the crawler, then the crawler MAY access any resources | unavailable to the crawler, then the crawler MAY access any resources | |||
on the server. | on the server. | |||
2.3.1.4. Unreachable Status | 2.3.1.4. "Unreachable" Status | |||
If the robots.txt is unreachable due to server or network errors, | If the robots.txt file is unreachable due to server or network | |||
this means the robots.txt is undefined and the crawler MUST assume | errors, this means the robots.txt file is undefined and the crawler | |||
complete disallow. For example, in the context of HTTP, an | MUST assume complete disallow. For example, in the context of HTTP, | |||
unreachable robots.txt has a response code in the 500-599 range. | server errors are identified by status codes in the 500-599 range. | |||
If the robots.txt is undefined for a reasonably long period of time | If the robots.txt file is undefined for a reasonably long period of | |||
(for example, 30 days), crawlers MAY assume the robots.txt is | time (for example, 30 days), crawlers MAY assume that the robots.txt | |||
unavailable as defined in Section 2.3.1.3 or continue to use a cached | file is unavailable as defined in Section 2.3.1.3 or continue to use | |||
copy. | a cached copy. | |||
2.3.1.5. Parsing Errors | 2.3.1.5. Parsing Errors | |||
Crawlers MUST try to parse each line of the robots.txt file. | Crawlers MUST try to parse each line of the robots.txt file. | |||
Crawlers MUST use the parseable rules. | Crawlers MUST use the parseable rules. | |||
2.4. Caching | 2.4. Caching | |||
Crawlers MAY cache the fetched robots.txt file's contents. Crawlers | Crawlers MAY cache the fetched robots.txt file's contents. Crawlers | |||
MAY use standard cache control as defined in [RFC9111]. Crawlers | MAY use standard cache control as defined in [RFC9111]. Crawlers | |||
SHOULD NOT use the cached version for more than 24 hours, unless the | SHOULD NOT use the cached version for more than 24 hours, unless the | |||
robots.txt is unreachable. | robots.txt file is unreachable. | |||
2.5. Limits | 2.5. Limits | |||
Crawlers SHOULD impose a parsing limit to protect their systems; see | Crawlers SHOULD impose a parsing limit to protect their systems; see | |||
Section 3. The parsing limit MUST be at least 500 kibibytes [KiB]. | Section 3. The parsing limit MUST be at least 500 kibibytes [KiB]. | |||
3. Security Considerations | 3. Security Considerations | |||
The Robots Exclusion Protocol is not a substitute for more valid | The Robots Exclusion Protocol is not a substitute for valid content | |||
content security measures. Listing paths in the robots.txt file | security measures. Listing paths in the robots.txt file exposes them | |||
exposes them publicly and thus makes the paths discoverable. To | publicly and thus makes the paths discoverable. To control access to | |||
control access to the URI paths in a robots.txt file, users of the | the URI paths in a robots.txt file, users of the protocol should | |||
protocol should employ a valid security measure relevant to the | employ a valid security measure relevant to the application layer on | |||
application layer on which the robots.txt file is served. For | which the robots.txt file is served -- for example, in the case of | |||
example, in case of HTTP, HTTP Authentication defined in [RFC9110]. | HTTP, HTTP Authentication as defined in [RFC9110]. | |||
To protect against attacks against their system, implementors of | To protect against attacks against their system, implementors of | |||
robots.txt parsing and matching logic should take the following | robots.txt parsing and matching logic should take the following | |||
considerations into account: | considerations into account: | |||
* Memory management: Section 2.5 defines the lower limit of bytes | Memory management: Section 2.5 defines the lower limit of bytes that | |||
that must be processed, which inherently also protects the parser | must be processed, which inherently also protects the parser from | |||
from out of memory scenarios. | out-of-memory scenarios. | |||
* Invalid characters: Section 2.2 defines a set of characters that | Invalid characters: Section 2.2 defines a set of characters that | |||
parsers and matchers can expect in robots.txt files. Out of bound | parsers and matchers can expect in robots.txt files. Out-of-bound | |||
characters should be rejected as invalid, which limits the | characters should be rejected as invalid, which limits the | |||
available attack vectors that attempt to compromise the system. | available attack vectors that attempt to compromise the system. | |||
* Untrusted content: Implementors should treat the content of a | Untrusted content: Implementors should treat the content of a | |||
robots.txt file as untrusted content, as defined by the | robots.txt file as untrusted content, as defined by the | |||
specification of the application layer used. For example, in the | specification of the application layer used. For example, in the | |||
context of HTTP, implementors should follow the security | context of HTTP, implementors should follow the Security | |||
considerations section of [RFC9110]. | Considerations section of [RFC9110]. | |||
4. IANA Considerations | 4. IANA Considerations | |||
This document has no actions for IANA. | This document has no IANA actions. | |||
5. Examples | 5. Examples | |||
5.1. Simple Example | 5.1. Simple Example | |||
The following example shows: | The following example shows: | |||
* *: A group that's relevant to all user-agents that don't have an | *: A group that's relevant to all user agents that don't have an | |||
explicitly defined matching group. It allows access to the URLs | explicitly defined matching group. It allows access to the URLs | |||
with the /publications/ path prefix, and restricts access to the | with the /publications/ path prefix, and it restricts access to | |||
URLs with the /example/ path prefix and to all URLs with .gif | the URLs with the /example/ path prefix and to all URLs with a | |||
suffix. The * character designates any character, including the | .gif suffix. The "*" character designates any character, | |||
otherwise required forward slash; see Section 2.2. | including the otherwise-required forward slash; see Section 2.2. | |||
* foobot: A regular case. A single user-agent followed by rules. | foobot: A regular case. A single user agent followed by rules. The | |||
The crawler only has access to two URL path prefixes on the site, | crawler only has access to two URL path prefixes on the site -- | |||
/example/page.html and /example/allowed.gif. The rules of the | /example/page.html and /example/allowed.gif. The rules of the | |||
group are missing the optional whitespace character, which is | group are missing the optional space character, which is | |||
acceptable as defined in Section 2.2. | acceptable as defined in Section 2.2. | |||
* barbot and bazbot: A group that's relevant for more than one user- | barbot and bazbot: A group that's relevant for more than one user | |||
agent. The crawlers are not allowed to access the URLs with the | agent. The crawlers are not allowed to access the URLs with the | |||
/example/page.html path prefix, but otherwise have unrestricted | /example/page.html path prefix but otherwise have unrestricted | |||
access to the rest of the URLs on the site. | access to the rest of the URLs on the site. | |||
* quxbot: An empty group at end of the file. The crawler has | quxbot: An empty group at the end of the file. The crawler has | |||
unrestricted access to the URLs on the site. | unrestricted access to the URLs on the site. | |||
User-agent: * | User-Agent: * | |||
Disallow: *.gif$ | Disallow: *.gif$ | |||
Disallow: /example/ | Disallow: /example/ | |||
Allow: /publications/ | Allow: /publications/ | |||
User-Agent: foobot | User-Agent: foobot | |||
Disallow:/ | Disallow:/ | |||
Allow:/example/page.html | Allow:/example/page.html | |||
Allow:/example/allowed.gif | Allow:/example/allowed.gif | |||
User-Agent: barbot | User-Agent: barbot | |||
skipping to change at page 12, line 38 ¶ | skipping to change at line 516 ¶ | |||
example.com/example/page/disallow.gif. | example.com/example/page/disallow.gif. | |||
User-Agent: foobot | User-Agent: foobot | |||
Allow: /example/page/ | Allow: /example/page/ | |||
Disallow: /example/page/disallowed.gif | Disallow: /example/page/disallowed.gif | |||
6. References | 6. References | |||
6.1. Normative References | 6.1. Normative References | |||
[RFC1945] Berners-Lee, T., Fielding, R., and H. Frystyk, "Hypertext | ||||
Transfer Protocol -- HTTP/1.0", RFC 1945, | ||||
DOI 10.17487/RFC1945, May 1996, | ||||
<https://www.rfc-editor.org/info/rfc1945>. | ||||
[RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | |||
Extensions (MIME) Part Two: Media Types", RFC 2046, | Extensions (MIME) Part Two: Media Types", RFC 2046, | |||
DOI 10.17487/RFC2046, November 1996, | DOI 10.17487/RFC2046, November 1996, | |||
<https://www.rfc-editor.org/info/rfc2046>. | <https://www.rfc-editor.org/info/rfc2046>. | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, | Requirement Levels", BCP 14, RFC 2119, | |||
DOI 10.17487/RFC2119, March 1997, | DOI 10.17487/RFC2119, March 1997, | |||
<https://www.rfc-editor.org/info/rfc2119>. | <https://www.rfc-editor.org/info/rfc2119>. | |||
skipping to change at page 13, line 39 ¶ | skipping to change at line 560 ¶ | |||
DOI 10.17487/RFC9110, June 2022, | DOI 10.17487/RFC9110, June 2022, | |||
<https://www.rfc-editor.org/info/rfc9110>. | <https://www.rfc-editor.org/info/rfc9110>. | |||
[RFC9111] Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke, | [RFC9111] Fielding, R., Ed., Nottingham, M., Ed., and J. Reschke, | |||
Ed., "HTTP Caching", STD 98, RFC 9111, | Ed., "HTTP Caching", STD 98, RFC 9111, | |||
DOI 10.17487/RFC9111, June 2022, | DOI 10.17487/RFC9111, June 2022, | |||
<https://www.rfc-editor.org/info/rfc9111>. | <https://www.rfc-editor.org/info/rfc9111>. | |||
6.2. Informative References | 6.2. Informative References | |||
[KiB] "Kibibyte - Simple English Wikipedia, the free | [KiB] "Kibibyte", Simple English Wikipedia, the free | |||
encyclopedia", n.d., | encyclopedia, 17 September 2020, | |||
<https://simple.wikipedia.org/wiki/Kibibyte>. | <https://simple.wikipedia.org/wiki/Kibibyte>. | |||
[ROBOTSTXT] | [ROBOTSTXT] | |||
"Robots Exclusion Protocol", n.d., | "The Web Robots Pages (including /robots.txt)", 2007, | |||
<http://www.robotstxt.org/>. | <https://www.robotstxt.org/>. | |||
[SITEMAPS] "Sitemaps Protocol", n.d., | [SITEMAPS] "What are Sitemaps? (Sitemap protocol)", April 2020, | |||
<https://www.sitemaps.org/index.html>. | <https://www.sitemaps.org/index.html>. | |||
Authors' Addresses | Authors' Addresses | |||
Martijn Koster (editor) | ||||
Stalworthy Computing, Ltd. | Martijn Koster | |||
Stalworthy Manor Farm | ||||
Suton Lane | Suton Lane | |||
Wymondham, Norfolk | Wymondham, Norfolk | |||
NR18 9JG | NR18 9JG | |||
United Kingdom | United Kingdom | |||
Email: m.koster@greenhills.co.uk | Email: m.koster@greenhills.co.uk | |||
Gary Illyes (editor) | Gary Illyes | |||
Google LLC. | Google LLC | |||
Brandschenkestrasse 110 | Brandschenkestrasse 110 | |||
CH-8002 Zurich | CH-8002 Zürich | |||
Switzerland | Switzerland | |||
Email: garyillyes@google.com | Email: garyillyes@google.com | |||
Henner Zeller (editor) | Henner Zeller | |||
Google LLC. | Google LLC | |||
1600 Amphitheatre Pkwy | 1600 Amphitheatre Pkwy | |||
Mountain View, CA, 94043 | Mountain View, CA 94043 | |||
United States of America | United States of America | |||
Email: henner@google.com | Email: henner@google.com | |||
Lizzi Sassman (editor) | Lizzi Sassman | |||
Google LLC. | Google LLC | |||
Brandschenkestrasse 110 | Brandschenkestrasse 110 | |||
CH-8002 Zurich | CH-8002 Zürich | |||
Switzerland | Switzerland | |||
Email: lizzi@google.com | Email: lizzi@google.com | |||
End of changes. 83 change blocks. | ||||
252 lines changed or deleted | 245 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |