Web Guide  > webguide > 4. Content collection > 4.3 Metadata > 4.3f Metadata, search engines and their robots

4.3f Metadata, search engines and their robots

Guideline

Meta tags are collected from Web pages by visiting robots; applications that automatically crawl the Internet to index Web pages. These robots are called spiders and they submit the meta tags to a Web index that can be accessed by a search engine. Using the "robots" property of the meta element, one can give instructions to a visiting robot as to how it should crawl the Web site. The robot may or may not respect these instructions.

The format is very simple:

<meta name="ROBOTS" content="value" />

where "value" is replaced by one or more instructions (keywords) which provide the directions to the spider robot. Multiple instructions are separated by commas.

The most common set of instructions would be:

<meta name="robots" content="index,follow" />

As this is the default behaviour for robots, you do not have to include this tag.

You obviously should be careful not to specify conflicting or repeating directives such as:

<meta name="robots" content="INDEX,NOINDEX" />

In addition to server-wide robot control using the file "robots.txt", it is possible to specify certain pages that should not be indexed (by search engine spider robots), or that the linked pages should not be indexed. The robots meat tag, placed in the HTML "head" section of a page, can specify either or both of these actions.

Most spider robots will recognize this tag and follow the rules for each page. Included below are the most commonly used values:

  • NOINDEX - instructs a search engine not to include a page in its index (search results) but follow the links contained in that page;
  • NOFOLLOW - instructs a search engine to index the page but not to follow the links contained in that page;
  • NOINDEX, NOFOLLOW - use this is for pages that should not be indexed. If you place this in every page, the site will not be indexed;
  • NOARCHIVE - instructs a search engine not to provide archived copies of a page to users (of that search engine);
  • NOIMAGEINDEX - is similar to NOFOLLOW, but instructs a search engine not to index images found on the page (honoured by only some search engines); and
  • NONE: Instructs a search engine not to do anything with a page (cannot be combined with any other instruction).

Robots.txt files

A robots.txt is a regular text file that is uploaded to the root directory of Web sites that by defining a few rules can instruct robots to not crawl or index certain files and directories. Each unique domain can have only one robots.txt file and clients must contact the WebGuide if any changes are required.