A robots.txt file allows you to control how web crawlers, like Google, Bing , etc. access parts of your website. The file iteslf sits in the root folder of your website and adheres to the Robots Exclusion Protocol. This protocol allow you to control access to your website by URL or by the type of crawler.
Not all crawlers/spiders follow this protocol to the letter and some ignore it completely; e.g. spambots, malware etc.
Google is indexing your website and gets to the URL www.yoursite.com/news/ Just before loading this page th spider/crawler looks for www.yoursite.com/robots.txt and finds your robots file. The format may look like the following;
User-agent: *
Disallow: /
The above placed inside a robotx.txt file will instruct all crawlers that they should not crawl any pages on the website. To do the opposite and allow all spiders to crawl al pages on your website your robots.txt file would look like the following;
User-agent: *
Allow: /
Additional exmples of robots.txt file can be found below.
User-agent: *
Allow: /folder/
User-agent: *
Allow: /news.html