robots.txt

Posted on July 7, 2025 (Last modified on September 7, 2025) | Genx • Other languages: Japanese

A robots.txt file is a standard used by website owners to communicate with web crawlers and bots about which parts of their website should or should not be accessed or indexed. It’s a plain text file placed in the root directory of a website, and it acts as a guide for search engine crawlers (like Googlebot) and other automated agents.

Purpose of `robots.txt`

Control Web Crawling: It gives instructions on which parts of the website should be crawled or ignored.
Manage Server Resources: By restricting access to certain areas of the site, you can reduce the load on your server caused by unnecessary crawling.
Prevent Indexing of Sensitive or Irrelevant Information: For example, you might use it to block admin pages, staging areas, or duplicate content.

How Does `robots.txt` Work?

When a crawler (like a search engine bot) visits your site, it first checks for the existence of a robots.txt file.
The crawler reads this file and follows the instructions provided within it.
If no robots.txt file is available, the crawler assumes it has permission to access all publicly available pages on the site.

Syntax of `robots.txt`

The file consists of directives that specify user agents (specific crawlers) and the areas they are allowed or disallowed to access.

Basic Example:

User-agent: *  
Disallow: /private/

User-agent: * indicates the rule applies to all crawlers.
Disallow: /private/ blocks crawlers from accessing any URL that starts with /private/.

Allowing Access:

User-agent: *  
Allow: /public/

This explicitly allows crawlers to access URLs starting with /public/.

Blocking a Specific Crawler:

User-agent: Googlebot  
Disallow: /

This blocks only Googlebot from accessing the entire site.

Combining Rules for Different Crawlers:

User-agent: Googlebot  
Disallow: /private/  
  
User-agent: Bingbot  
Allow: /

Googlebot cannot access /private/.
Bingbot can access everything.

Limitations of `robots.txt`

Compliance is Voluntary: Not all crawlers respect robots.txt. Malicious bots often ignore it and may still scrape restricted areas.
Does Not Prevent Indexing: If a URL is linked elsewhere, search engines can still index it even if it’s disallowed in robots.txt. To prevent indexing, use meta tags or HTTP headers like noindex.

Where to Place `robots.txt`?

The file must be located in the root directory of your website (e.g., https://example.com/robots.txt). This is the default location crawlers look for.

Checking Your `robots.txt`

You can view the robots.txt file of any site by adding /robots.txt after the domain name. For example:

Google’s robots.txt: https://www.google.com/robots.txt

Best Practices

Keep your robots.txt file simple and accurate.
Avoid blocking resources (e.g., JavaScript, CSS) necessary for proper rendering of your site.
Regularly test your file using tools like Google’s Robots.txt Tester in Search Console.

By properly using the robots.txt file, you can effectively manage how crawlers interact with your site, optimizing both performance and search engine visibility.

robots.txt

Table of Contents

Purpose of robots.txt

How Does robots.txt Work?

Syntax of robots.txt