Internet-Draft cbcp April 2025
Illyes Expires 10 October 2025 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-illyes-cbcp-latest
Published:
Intended Status:
Informational
Expires:
Author:
G. Illyes
Independent

Crawler best practices

Abstract

TODO Abstract

Discussion Venues

This note is to be removed before publishing as an RFC.

Source for this draft and an issue tracker can be found at https://github.com/garyillyes/cbcp.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 10 October 2025.

Table of Contents

1. Introduction

Having an expected behavior of utomatic clients (i.e. crawlers, bots), how their behavior can be influenced, and how to identify them and opt out of their crawling processes is helpful for all parties involved. To help website owners, we propose inviting other crawler operators to conform to similar crawling policies, and create together a central website where website owners can look up well behaving crawlers.

Note that while self declared research crawlers, including privacy and malware discovery crawlers, and contractual crawlers are welcome to add themselves to adopt these practices, due to the nature of the relationship with sites they may exempt themselves from any of the Crawler Code of Conduct policies with a rationale.

2. Good Practices

The following practices are employed by the vast majority of large scale crawlers on the internet:

  1. Crawlers must support the Robots Exclusion Protocol.

  2. Crawlers must be easily identifiable through their user agent string.

  3. Crawlers must not interfere with the normal operation of a site.

  4. Crawlers must expose the IP ranges they're crawling from in a standardized format.

  5. Crawlers must expose a page where they explain how the crawled data is used.

2.1. Crawlers must support the Robots Exclusion Protocol

All well behaved crawlers must support the REP as defined in RFC9309 to allow site owners to opt out from crawling.

2.2. Crawlers must be easily identifiable through their user agent string

As stipulated in RFC9309 (Robots Exclusion Protocol; REP), the HTTP request header user-agent should identify the crawler clearly, typically by including a URL that hosts the crawler's description. For example User-Agent: Mozilla/5.0 (compatible; ExampleBot/0.1; +https://www.example.com/bot.html). This is already a widely supported mechanism among crawler operators.

To be compliant, crawler operators must specify identifiers unique for their crawlers within the case-insensitive user-agent, like "contains 'googlebot' and 'https://url/...'".

2.3. Crawlers must not interfere with the normal operation of a site

Depending on a site's setup (computing resources, software efficiency) and its size, crawling may slow down the site or take it offline altogether. Crawler operators must ensure that their crawlers are equipped with back-out logic that rely on at least the standard signals defined by RFC9110, preferably also additional heuristics such as relative response time of the server.

2.4. Crawlers must expose the IP ranges they use for crawling

To complement the REP, crawler operators should expose the IP ranges they allocated for crawling in a standardized, machine readable format, and keep it reasonably up to date (i.e. shouldn't get older than 7 days).

The object containing the IP addresses must be linked from the page describing the crawler and it must be also referenced in the metadata of the page for machine readability. For example:

<link rel="help" href="https://example.com/crawlerips.json" />

2.5. Crawlers must explain how the crawled data is used

Similar to section Crawlers must be easily identifiable through their user agent string crawlers should explain how the data they crawled will be used. In practice this is generally done through the documentation page referenced in the user-agent of the crawler.

3. Conventions and Definitions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

4. Security Considerations

TODO Security

5. IANA Considerations

This document has no IANA actions.

6. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

Acknowledgments

TODO acknowledge.

Author's Address

Gary Illyes
Independent