<?xml version="1.0"encoding="US-ASCII"?>encoding="UTF-8"?> <!DOCTYPE rfcSYSTEM "rfc2629.dtd"[ <!ENTITYRFC1945 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.1945.xml">nbsp " "> <!ENTITYRFC2046 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2046.xml">zwsp "​"> <!ENTITYRFC2119 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">nbhy "‑"> <!ENTITYRFC3629 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3629.xml"> <!ENTITY RFC3986 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3986.xml"> <!ENTITY RFC5234 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5234.xml"> <!ENTITY RFC8174 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"> <!ENTITY RFC8288 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.8288.xml"> <!ENTITY RFC9110 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.9110.xml"> <!ENTITY RFC9111 PUBLIC "" "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.9111.xml">wj "⁠"> ]> <rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902"category="std"docName="draft-koster-rep-12"> <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?> <?rfc toc="yes" ?> <?rfc tocdepth="4" ?> <?rfc symrefs="yes" ?> <?rfc sortrefs="yes"?> <?rfc compact="yes" ?> <?rfc subcompact="no"?>number="9309" obsoletes="" updates="" submissionType="IETF" category="std" consensus="true" xml:lang="en" tocInclude="true" tocDepth="4" symRefs="true" sortRefs="true" version="3"> <!-- xml2rfc v2v3 conversion 3.13.0 --> <front> <titleabbrev="REP">Robotsabbrev="Robots Exclusion Protocol (REP)">Robots Exclusion Protocol</title> <seriesInfo name="RFC" value="9309"/> <author initials="M." surname="Koster" fullname="MartijnKoster" role="editor"> <organization>Stalworthy Computing, Ltd.</organization>Koster"> <address> <postal> <extaddr>Stalworthy Manor Farm</extaddr> <street>Suton Lane</street> <city>Wymondham, Norfolk</city> <code>NR18 9JG</code> <country>United Kingdom</country> </postal> <email>m.koster@greenhills.co.uk</email> </address> </author> <author initials="G." surname="Illyes" fullname="GaryIllyes" role="editor">Illyes"> <organization>GoogleLLC.</organization>LLC</organization> <address> <postal> <street>Brandschenkestrasse 110</street><city>Zurich</city><city>Zürich</city> <code>8002</code> <country>Switzerland</country> </postal> <email>garyillyes@google.com</email> </address> </author> <author initials="H." surname="Zeller" fullname="HennerZeller" role="editor">Zeller"> <organization>GoogleLLC.</organization>LLC</organization> <address> <postal> <street>1600 Amphitheatre Pkwy</street> <city>MountainView, CA</city>View</city> <region>CA</region> <code>94043</code><country>USA</country><country>United States of America</country> </postal> <email>henner@google.com</email> </address> </author> <author initials="L." surname="Sassman" fullname="LizziSassman" role="editor">Sassman"> <organization>GoogleLLC.</organization>LLC</organization> <address> <postal> <street>Brandschenkestrasse 110</street><city>Zurich</city><city>Zürich</city> <code>8002</code> <country>Switzerland</country> </postal> <email>lizzi@google.com</email> </address> </author> <date year="2022"month="July" day="06"/> <area>General</area> <keyword>internet-drafts</keyword>month="September"/> <keyword>robot</keyword> <keyword>crawler</keyword> <keyword>robots.txt</keyword> <abstract> <t> This document specifies and extends the"Robots"Robots ExclusionProtocol"Protocol" method originally defined by Martijn Koster in19961994 for service owners to control how content served by their services may be accessed, if at all, by automatic clients known as crawlers. Specifically, it adds definition language for theprotocol andprotocol, instructions for handlingerrorserrors, and instructions for caching. </t> </abstract> </front> <middle> <section anchor="introduction"title="Introduction">numbered="true" toc="default"> <name>Introduction</name> <t> This document applies to services that provide resources that clients can access through URIs as defined in <xreftarget="RFC3986"/>.target="RFC3986" format="default"/>. For example, in the context of HTTP, a browser is a client that displays the content of a web page. </t> <t> Crawlers are automated clients. Searchenginesengines, forinstanceinstance, have crawlers to recursively traverse links for indexing as defined in <xreftarget="RFC8288"/>.target="RFC8288" format="default"/>. </t> <t> It may be inconvenient for service owners if crawlers visit the entirety of their URI space. This document specifies the rules originally defined by the"Robots"Robots ExclusionProtocol"Protocol" <xreftarget="ROBOTSTXT"/>target="ROBOTSTXT" format="default"/> that crawlers are requested to honor when accessing URIs. </t> <t> These rules are not a form of access authorization. </t> <section anchor="requirements-language"title="Requirements Language"> <t> Thenumbered="true" toc="default"> <name>Requirements Language</name> <t>The key words"<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>", "<bcp14>MAY</bcp14>","<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>", "<bcp14>MAY</bcp14>", and"<bcp14>OPTIONAL</bcp14>""<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as described inBCP 14BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they appear in all capitals, as shownhere. </t>here.</t> </section> </section> <section anchor="specification"title="Specification">numbered="true" toc="default"> <name>Specification</name> <section anchor="protocol-definition"title="Protocol Definition">numbered="true" toc="default"> <name>Protocol Definition</name> <t> The protocol language consists of rule(s) and group(s) that the service makes available in a file named'robots.txt'"robots.txt" as described in <xref target="access-method"/>:format="default"/>: </t><t> <list style="symbols"> <t> Rule:<dl spacing="normal"> <dt> Rule:</dt><dd> A line with a key-value pair that defines how a crawler may access URIs. See <xref target="the-allow-and-disallow-lines"/>. </t> <t> Group:format="default"/>. </dd> <dt> Group:</dt><dd> One or more user-agent lines thatisare followed by one or more rules. The group is terminated by a user-agent line or end of file. See <xref target="the-user-agent-line"/>.format="default"/>. The last group may have no rules, which means it implicitly allows everything.</t> </list> </t></dd> </dl> </section> <section anchor="formal-syntax"title="Formal Syntax">numbered="true" toc="default"> <name>Formal Syntax</name> <t> Below is an Augmented Backus-Naur Form (ABNF) description, as described in <xreftarget="RFC5234"/>.target="RFC5234" format="default"/>. </t><figure><artwork> <![CDATA[<sourcecode name="" type="abnf"><![CDATA[ robotstxt = *(group / emptyline) group = startgroupline ; We start with a user-agent ; line *(startgroupline / emptyline) ; ... and possibly more ;user-agentsuser-agent lines *(rule / emptyline) ; followed by rules relevant ; forUAsthe preceding ; user-agent lines startgroupline = *WS "user-agent" *WS ":" *WS product-token EOL rule = *WS ("allow" / "disallow") *WS ":" *WS (path-pattern / empty-pattern) EOL ; parser implementors: define additional lines you need (for ; example,sitemaps).Sitemaps). product-token = identifier / "*" path-pattern = "/" *UTF8-char-noctl ; valid URI path pattern empty-pattern = *WS identifier = 1*(%x2D / %x41-5A / %x5F / %x61-7A) comment = "#" *(UTF8-char-noctl / WS / "#") emptyline = EOL EOL = *WS [comment] NL ; end-of-line may have ; optional trailing comment NL = %x0D / %x0A / %x0D.0A WS = %x20 / %x09 ; UTF8 derived fromRFC3629,RFC 3629, but excluding control characters UTF8-char-noctl = UTF8-1-noctl / UTF8-2 / UTF8-3 / UTF8-4 UTF8-1-noctl = %x21 / %x22 / %x24-7F ; excluding control, space,'#'"#" UTF8-2 = %xC2-DF UTF8-tail UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2UTF8-tail / %xED %x80-9F UTF8-tail / %xEE-EF 2UTF8-tail UTF8-4 = %xF0 %x90-BF 2UTF8-tail / %xF1-F3 3UTF8-tail / %xF4 %x80-8F 2UTF8-tail UTF8-tail = %x80-BF]]> </artwork></figure>]]></sourcecode> <section anchor="the-user-agent-line"title="Thenumbered="true" toc="default"> <name>The User-AgentLine">Line</name> <t> Crawlers set their own name, which is called a product token, to find relevant groups. The product token <bcp14>MUST</bcp14> contain onlyupperuppercase and lowercase letters("a-z"("a-z" and"A-Z"),"A-Z"), underscores("_"),("_"), and hyphens("-").("-"). The product token <bcp14>SHOULD</bcp14> be a substring of the identification string that the crawler sends to theservice (forservice. For example, in the case ofHTTP,HTTP <xref target="RFC9110" format="default"/>, the product token <bcp14>SHOULD</bcp14> be a substring in theuser-agent header).User-Agent header. The identification string <bcp14>SHOULD</bcp14> describe the purpose of the crawler.Here'sHere's an example of auser-agentUser-Agent HTTP request header with a link pointing to a page describing the purpose of the ExampleBot crawler, which appears as a substring in theuser-agentUser-Agent HTTP header and as a product token in the robots.txt user-agent line: </t><texttable title="Example<figure anchor="fig-1"> <name>Example of auser-agentUser-Agent HTTP header and robots.txt user-agent line for the ExampleBot producttoken.token</name> <artwork name="" type="" align="center" alt=""><![CDATA[ +==========================================+========================+ | User-Agent HTTP header | robots.txt user-agent | | | line | +==========================================+========================+ | User-Agent: Mozilla/5.0 (compatible; | user-agent: ExampleBot | | ExampleBot/0.1; | | | https://www.example.com/bot.html) | | +------------------------------------------+------------------------+ ]]></artwork> </figure> <t> Note that the product token (ExampleBot) is a substring of theuser-agent HTTP header"> <ttcol align="left">user-agentUser-Agent HTTPheader</ttcol> <ttcol align="left">robots.txt user-agent line</ttcol> <c>user-agent: Mozilla/5.0 (compatible; ExampleBot/0.1; https://www.example.com/bot.html)</c> <c>user-agent: ExampleBot</c> </texttable>header.</t> <t> Crawlers <bcp14>MUST</bcp14> use case-insensitive matching to find the group that matches the producttoken,token and then obey the rules of the group. If there is more than one group matching the user-agent, the matching groups' rules <bcp14>MUST</bcp14> be combined into one group and parsed according to <xref target="the-allow-and-disallow-lines"/>. </t> <texttable title="Exampleformat="default"/>.</t> <figure anchor="fig-2"> <name>Example of how to merge two robots.txt groups that match the same producttoken"> <ttcol align="left">Twotoken</name> <artwork name="" type="" align="center" alt=""><![CDATA[ +========================================+========================+ | Two groups that match the same product | Merged group | | tokenexactly</ttcol> <ttcol align="left">Merged group</ttcol> <c>user-agent: ExampleBot<br />exactly | | +========================================+========================+ | user-agent: ExampleBot | user-agent: ExampleBot | | disallow:/foo<br />/foo | disallow:/bar<br /> <br /> user-agent: ExampleBot<br />/foo | | disallow:/baz </c> <c>user-agent: ExampleBot<br />/bar | disallow:/foo<br />/bar | | | disallow:/bar<br />/baz | | user-agent: ExampleBot | | | disallow:/baz</c> </texttable>/baz | | +----------------------------------------+------------------------+ ]]></artwork> </figure> <t> If no matching group exists, crawlers <bcp14>MUST</bcp14> obey the group with a user-agent line with the "*" value, if present. </t><texttable title="Example<figure anchor="fig-3"> <name>Example of no matching groups other than the'*'"*" for the ExampleBot producttoken"> <ttcol align="left">Twotoken</name> <artwork name="" type="" align="center" alt=""><![CDATA[ +==================================+======================+ | Two groups that don't explicitlymatch ExampleBot</ttcol> <ttcol align="left">Applicable| Applicable group forExampleBot</ttcol> <c>user-agent: *<br />| | match ExampleBot | ExampleBot | +==================================+======================+ | user-agent: * | user-agent: * | | disallow:/foo<br />/foo | disallow:/bar<br /> <br /> user-agent: BazBot<br />/foo | | disallow:/baz </c> <c>user-agent: *<br />/bar | disallow:/foo<br />/bar | | | | | user-agent: BazBot | | | disallow:/bar</c> </texttable>/baz | | +----------------------------------+----------------------+ ]]></artwork> </figure> <t> If no group matches the product token and there is no group with a user-agent line with the "*" value, or no groups are present at all, no rules apply. </t> </section> <section anchor="the-allow-and-disallow-lines"title="The Allownumbered="true" toc="default"> <name>The "Allow" andDisallow Lines">"Disallow" Lines</name> <t> These lines indicate whether accessing a URI that matches the corresponding path is allowed or disallowed. </t> <t> To evaluate if access to a URI is allowed, a crawler <bcp14>MUST</bcp14> match the paths inallow"allow" anddisallow"disallow" rules against the URI. The matching <bcp14>SHOULD</bcp14> be case sensitive. The matching <bcp14>MUST</bcp14> start with the first octet of the path. The most specific match found <bcp14>MUST</bcp14> be used. The most specific match is the match that has the most octets. Duplicate rules in a group <bcp14>MAY</bcp14> be deduplicated. If anallow"allow" rule anddisallowa "disallow" rule are equivalent, then theallow"allow" rule <bcp14>SHOULD</bcp14> be used. If no match is found amongst the rules in a group for a matchinguser-agent,user-agent or there are no rules in the group, the URI is allowed. The /robots.txt URI is implicitly allowed. </t> <t> Octets in the URI and robots.txt paths outside the range of theUS-ASCIIASCII coded character set, and those in the reserved range defined by <xreftarget="RFC3986"/>,target="RFC3986" format="default"/>, <bcp14>MUST</bcp14> be percent-encoded as defined by <xreftarget="RFC3986"></xref>target="RFC3986" format="default"/> prior to comparison. </t> <t> If a percent-encodedUS-ASCIIASCII octet is encountered in the URI, it <bcp14>MUST</bcp14> be unencoded prior to comparison, unless it is a reserved character in the URI as defined by <xreftarget="RFC3986"/>target="RFC3986" format="default"/> or the character is outside the unreserved character range. The match evaluates positively if and only if the end of the path from the rule is reached before a difference in octets is encountered. </t> <t> For example: </t><texttable title="Examples<figure anchor="fig-4"> <name>Examples of matching percent-encoded URIcomponents"> <ttcol align='left'>Path</ttcol> <ttcol align='left'>Encoded Path</ttcol> <ttcol align='left'>Path to Match</ttcol> <c>/foo/bar?baz=quz</c> <c>/foo/bar?baz=quz</c> <c>/foo/bar?baz=quz</c> <c>/foo/bar?baz=http<br />://foo.bar</c> <c>/foo/bar?baz=http%3A<br />%2F%2Ffoo.bar</c> <c>/foo/bar?baz=http%3A<br />%2F%2Ffoo.bar</c> <c>/foo/bar/U+E38384</c> <c>/foo/bar/%E3%83%84</c> <c>/foo/bar/%E3%83%84</c> <c>/foo/bar/%E3%83%84</c> <c>/foo/bar/%E3%83%84</c> <c>/foo/bar/%E3%83%84</c> <c>/foo/bar/%62%61%7A</c> <c>/foo/bar/%62%61%7A</c> <c>/foo/bar/baz</c> </texttable>components</name> <artwork name="" type="" align="center" alt=""><![CDATA[ +==================+=======================+=======================+ | Path | Encoded Path | Path to Match | +==================+=======================+=======================+ | /foo/bar?baz=quz | /foo/bar?baz=quz | /foo/bar?baz=quz | +------------------+-----------------------+-----------------------+ | /foo/bar?baz= | /foo/bar?baz= | /foo/bar?baz= | | https://foo.bar | https%3A%2F%2Ffoo.bar | https%3A%2F%2Ffoo.bar | +------------------+-----------------------+-----------------------+ | /foo/bar/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | U+E38384 | | | +------------------+-----------------------+-----------------------+ | /foo/ | /foo/bar/%E3%83%84 | /foo/bar/%E3%83%84 | | bar/%E3%83%84 | | | +------------------+-----------------------+-----------------------+ | /foo/ | /foo/bar/%62%61%7A | /foo/bar/baz | | bar/%62%61%7A | | | +------------------+-----------------------+-----------------------+ ]]></artwork> </figure> <t> The crawler <bcp14>SHOULD</bcp14> ignore"disallow""disallow" and"allow""allow" rules that are not in any group (for example, any rule that precedes the first user-agent line). </t> <t>ImplementersImplementors <bcp14>MAY</bcp14> bridge encoding mismatches if they detect that the robots.txt file is notUTF8UTF-8 encoded. </t> </section> <section anchor="special-characters"title="Special Characters">numbered="true" toc="default"> <name>Special Characters</name> <t> Crawlers <bcp14>MUST</bcp14>allowsupport the following special characters: </t><texttable title="List<figure anchor="fig-5"> <name>List of special characters in robots.txtfiles"> <ttcol align='left'>Character</ttcol> <ttcol align='left'>Description</ttcol> <ttcol align='left'>Example</ttcol> <c>"#"</c> <c>Designates an end offiles</name> <artwork name="" type="" align="center" alt=""><![CDATA[ +===========+===================+==============================+ | Character | Description | Example | +===========+===================+==============================+ | # | Designates a linecomment.</c> <c>"allow:| allow: / # comment inline"<br /><br />"#line | | | comment. | | | | | # comment on its ownline"</c> <c>"$"</c> <c>Designates theline | +-----------+-------------------+------------------------------+ | $ | Designates the | allow: /this/path/exactly$ | | | end of the matchpattern.</c> <c>"allow: /this/path/exactly$"</c> <c>"*"</c> <c>Designates| | | | pattern. | | +-----------+-------------------+------------------------------+ | * | Designates 0 or | allow: /this/*/exactly | | | more instances of | | | | anycharacter.</c> <c>"allow: /this/*/exactly"</c> </texttable>character. | | +-----------+-------------------+------------------------------+ ]]></artwork> </figure> <t> If crawlers match special characters verbatim in the URI, crawlers <bcp14>SHOULD</bcp14> use"%""%" encoding. For example: </t><texttable title="Example of percent-encoding"> <ttcol align='left'>Percent-encoded Pattern</ttcol> <ttcol align='left'>URI</ttcol> <c>/path/file-with-a-%2A.html</c> <c>https://www.example.com/path/file-with-a-*.html</c> <c>/path/foo-%24</c> <c>https://www.example.com/path/foo-$</c> </texttable><figure anchor="fig-6"> <name>Example of percent-encoding</name> <artwork name="" type="" align="center" alt=""><![CDATA[ +============================+====================================+ | Percent-encoded Pattern | URI | +============================+====================================+ | /path/file-with-a-%2A.html | https://www.example.com/path/ | | | file-with-a-*.html | +----------------------------+------------------------------------+ | /path/foo-%24 | https://www.example.com/path/foo-$ | +----------------------------+------------------------------------+ ]]></artwork> </figure> </section> <section anchor="other-records"title="Other Records">numbered="true" toc="default"> <name>Other Records</name> <t> Crawlers <bcp14>MAY</bcp14> interpret other records that are not part of the robots.txtprotocol. Forprotocol -- for example,'sitemap'"Sitemaps" <xreftarget="SITEMAPS"/>.target="SITEMAPS" format="default"/>. CrawlersMAY<bcp14>MAY</bcp14> be lenient when interpreting other records. For example, crawlers may accept commontyposmisspellings of the record. </t> <t> Parsing of other records <bcp14>MUST NOT</bcp14> interfere with the parsing of explicitly defined records in <xref target="specification"/>.format="default"/>. For example, a "Sitemaps" record <bcp14>MUST NOT</bcp14> terminate a group. </t> </section> </section> <section anchor="access-method"title="Access Method">numbered="true" toc="default"> <name>Access Method</name> <t> The rules <bcp14>MUST</bcp14> be accessible in a file named"/robots.txt""/robots.txt" (alllower case)lowercase) in thetop leveltop-level path of the service. The file <bcp14>MUST</bcp14> be UTF-8 encoded (as defined in <xreftarget="RFC3629"/>)target="RFC3629" format="default"/>) and Internet Media Type"text/plain""text/plain" (as defined in <xreftarget="RFC2046"/>).target="RFC2046" format="default"/>). </t> <t> As per <xreftarget="RFC3986"/>,target="RFC3986" format="default"/>, the URI of the robots.txt file is: </t> <t>"scheme:[//authority]/robots.txt""scheme:[//authority]/robots.txt" </t> <t> For example, in the context of HTTP or FTP, the URI is: </t><figure> <artwork><![CDATA[<artwork name="" type="" align="left" alt=""><![CDATA[ https://www.example.com/robots.txt ftp://ftp.example.com/robots.txt ]]></artwork></figure><section anchor="access-results"title="Access Results">numbered="true" toc="default"> <name>Access Results</name> <section anchor="successful-access"title="Successful Access">numbered="true" toc="default"> <name>Successful Access</name> <t> If the crawler successfully downloads therobots.txt,robots.txt file, the crawler <bcp14>MUST</bcp14> follow the parseable rules. </t> </section> <section anchor="redirects"title="Redirects">numbered="true" toc="default"> <name>Redirects</name> <t> It's possible that a server responds to a robots.txt fetch request with a redirect, such as HTTP 301andor HTTP 302 in the case of HTTP. The crawlers <bcp14>SHOULD</bcp14> follow at least five consecutive redirects, even across authorities (for example, hosts in the case ofHTTP), as defined in <xref target="RFC1945"/>.HTTP). </t> <t> If a robots.txt file is reached within five consecutive redirects, the robots.txt file <bcp14>MUST</bcp14> be fetched, parsed, and its rules followed in the context of the initial authority. </t> <t> If there are more than five consecutive redirects, crawlers <bcp14>MAY</bcp14> assume that the robots.txt file is unavailable. </t> </section> <section anchor="unavailable-status"title="Unavailable Status">numbered="true" toc="default"> <name>"Unavailable" Status</name> <t>Unavailable"Unavailable" means the crawler tries to fetch therobots.txt,robots.txt file and the server responds withunavailablestatuscodes.codes indicating that the resource in question is unavailable. For example, in the context of HTTP,unavailablesuch status codes are in the 400-499 range. </t> <t> If a server status code indicates that the robots.txt file is unavailable to the crawler, then the crawlerMAY<bcp14>MAY</bcp14> access any resources on the server. </t> </section> <section anchor="unreachable-status"title="Unreachable Status">numbered="true" toc="default"> <name>"Unreachable" Status</name> <t> If the robots.txt file is unreachable due to server or network errors, this means the robots.txt file is undefined and the crawler <bcp14>MUST</bcp14> assume complete disallow. For example, in the context of HTTP,an unreachable robots.txt has a response codeserver errors are identified by status codes in the 500-599 range. </t> <t> If the robots.txt file is undefined for a reasonably long period of time (for example, 30 days), crawlers <bcp14>MAY</bcp14> assume that the robots.txt file is unavailable as defined in <xreftarget="unavailable-status"/>target="unavailable-status" format="default"/> or continue to use a cached copy. </t> </section> <section anchor="parsing-errors"title="Parsing Errors">numbered="true" toc="default"> <name>Parsing Errors</name> <t> Crawlers <bcp14>MUST</bcp14> try to parse each line of the robots.txt file. Crawlers <bcp14>MUST</bcp14> use the parseable rules. </t> </section> </section> </section> <section anchor="caching"title="Caching">numbered="true" toc="default"> <name>Caching</name> <t> Crawlers <bcp14>MAY</bcp14> cache the fetched robots.txtfile'sfile's contents. Crawlers <bcp14>MAY</bcp14> use standard cache control as defined in <xreftarget="RFC9111"/>.target="RFC9111" format="default"/>. Crawlers <bcp14>SHOULD NOT</bcp14> use the cached version for more than 24 hours, unless the robots.txt file is unreachable. </t> </section> <section anchor="limits"title="Limits">numbered="true" toc="default"> <name>Limits</name> <t> CrawlersSHOULD<bcp14>SHOULD</bcp14> impose a parsing limit to protect their systems; see <xreftarget="security"/>.target="security" format="default"/>. The parsing limitMUST<bcp14>MUST</bcp14> be at least 500 kibibytes <xreftarget="KiB"/>.target="KiB" format="default"/>. </t> </section> </section> <section anchor="security"title="Security Considerations">numbered="true" toc="default"> <name>Security Considerations</name> <t> The Robots Exclusion Protocol is not a substitute formorevalid content security measures. Listing paths in the robots.txt file exposes them publicly and thus makes the paths discoverable. To control access to the URI paths in a robots.txt file, users of the protocol should employ a valid security measure relevant to the application layer on which the robots.txt file isserved. Forserved -- for example, in the case of HTTP, HTTP Authentication as defined in <xreftarget="RFC9110"/>.target="RFC9110" format="default"/>. </t> <t> To protect against attacks against their system, implementors of robots.txt parsing and matching logic should take the following considerations into account: </t><t> <list style="symbols"> <t><dl spacing="normal"> <dt> Memorymanagement:management:</dt><dd> <xref target="limits"/>format="default"/> defines the lower limit of bytes that must be processed, which inherently also protects the parser fromout of memoryout-of-memory scenarios.</t> <t></dd> <dt> Invalidcharacters:characters:</dt><dd> <xref target="formal-syntax"/>format="default"/> defines a set of characters that parsers and matchers can expect in robots.txt files.Out of boundOut-of-bound characters should be rejected as invalid, which limits the available attack vectors that attempt to compromise the system.</t> <t></dd> <dt> Untrustedcontent:content:</dt><dd> Implementors should treat the content of a robots.txt file as untrusted content, as defined by the specification of the application layer used. For example, in the context of HTTP, implementors should follow thesecurity considerationsSecurity Considerations section of <xreftarget="RFC9110"/>. </t> </list> </t>target="RFC9110" format="default"/>. </dd> </dl> </section> <section anchor="IANA"title="IANA Considerations">numbered="true" toc="default"> <name>IANA Considerations</name> <t> This document has noactions for IANA.IANA actions. </t> </section> <section anchor="examples"title="Examples">numbered="true" toc="default"> <name>Examples</name> <section anchor="simple-example"title="Simple Example">numbered="true" toc="default"> <name>Simple Example</name> <t> The following example shows: </t><t> <list style="symbols"> <t> *:<dl spacing="normal"> <dt> *:</dt><dd> A group that's relevant to alluser-agentsuser agents that don't have an explicitly defined matching group. It allows access to the URLs with the /publications/ path prefix, and it restricts access to the URLs with the /example/ path prefix and to all URLs with a .gif suffix. The*"*" character designates any character, including theotherwise requiredotherwise-required forward slash; see <xref target="formal-syntax"/>. </t> <t> foobot:format="default"/>. </dd> <dt> foobot:</dt><dd> A regular case. A singleuser-agentuser agent followed by rules. The crawler only has access to two URL path prefixes on thesite,site -- /example/page.html and /example/allowed.gif. The rules of the group are missing the optionalwhitespacespace character, which is acceptable as defined in <xref target="formal-syntax"/>. </t> <t>format="default"/>. </dd> <dt> barbot andbazbot:bazbot:</dt><dd> A groupthat'sthat's relevant for more than oneuser-agent.user agent. The crawlers are not allowed to access the URLs with the /example/page.html pathprefix,prefix but otherwise have unrestricted access to the rest of the URLs on the site.</t> <t> quxbot:</dd> <dt> quxbot:</dt><dd> An empty group at the end of the file. The crawler has unrestricted access to the URLs on the site.</t> </list> </t> <figure> <artwork><![CDATA[ User-agent:</dd> </dl> <artwork name="" type="" align="left" alt=""><![CDATA[ User-Agent: * Disallow: *.gif$ Disallow: /example/ Allow: /publications/ User-Agent: foobot Disallow:/ Allow:/example/page.html Allow:/example/allowed.gif User-Agent: barbot User-Agent: bazbot Disallow: /example/page.html User-Agent: quxbot EOF ]]></artwork></figure></section> <section anchor="longest-match"title="Longest Match">numbered="true" toc="default"> <name>Longest Match</name> <t> The following example shows that in the case of two rules, the longest one is used for matching. In the following case, /example/page/disallowed.gif <bcp14>MUST</bcp14> be used for the URI example.com/example/page/disallow.gif. </t><figure> <artwork><![CDATA[<artwork name="" type="" align="left" alt=""><![CDATA[ User-Agent: foobot Allow: /example/page/ Disallow: /example/page/disallowed.gif ]]></artwork></figure></section> </section> </middle> <back><references title='Normative References'> &RFC1945; &RFC2046; &RFC2119; &RFC3629; &RFC3986; &RFC5234; &RFC8174; &RFC8288; &RFC9110; &RFC9111;<references> <name>References</name> <references> <name>Normative References</name> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2046.xml"/> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3629.xml"/> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3986.xml"/> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.5234.xml"/> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8288.xml"/> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9110.xml"/> <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9111.xml"/> </references><references title='Informative References'><references> <name>Informative References</name> <reference anchor="ROBOTSTXT"target="http://www.robotstxt.org/">target="https://www.robotstxt.org/"> <front><title>Robots Exclusion Protocol</title><title>The Web Robots Pages (including /robots.txt)</title> <author><organization></organization><organization/> </author><date year="n.d."/><date>2007</date> </front> </reference> <reference anchor="SITEMAPS" target="https://www.sitemaps.org/index.html"> <front><title>Sitemaps Protocol</title><title>What are Sitemaps? (Sitemap protocol)</title> <author><organization></organization><organization/> </author><date year="n.d."/><date>April 2020</date> </front> </reference> <reference anchor="KiB" target="https://simple.wikipedia.org/wiki/Kibibyte"> <front><title>Kibibyte - Simple English Wikipedia, the free encyclopedia</title><title>Kibibyte</title> <author><organization></organization><organization/> </author> <dateyear="n.d."/>day="17" month="September" year="2020"/> </front> <refcontent>Simple English Wikipedia, the free encyclopedia</refcontent> </reference> </references> </references> </back> </rfc>