Directives are simple instructions for bots trying to access to your domain, reading these directives bots know when and how to visit the pages contained in your website.
Each directive has the following format:
<directive> ':' [<whitespace>] <data> [<whitespace>] [<comment>] <end-of-line>
where <data>
depend upon the directive and items in "[" and "]" are optional.
Unless otherwise noted, each directive can appear more than once in a given
rule set.
Interfaces
Internally there are some interfaces to use:
DirectiveInterface
General directives interface.
This interface must be implemented to allow the directive to be considered by the parser.
PathDirectiveInterface
Path based directives, currently only Allow
and Disallow
.
StartOfGroupInterface
Interface for directives starting a group. Generally these directives are also group members.
Currently it's implemented only by the UserAgent
directive.
GroupMemberInterface
Directives to be grouped.
NonGroupInterface
Directives indipendent from a group, like Sitemap
or Host
.
These are considered part of the robots.txt but not part of a single ruleset.
Supported Directives
Currently supported directives are:
User-Agent
This directive defines the robot the following rules applies to (e.g. Googlebot).
Allow
This directive (if included) and the Disallow
directive are to be processed in
the order they appear in the rule set.
This is to simplify the processing, avoid ambiguity and allow more control over
what is and isn't allowed.
If a URL is not covered by any allow or disallow rules, then the URL is to be allowed.
Disallow
This directive and the Allow: directive (if included) are to be processed in the order they appear in the rule set. This is to simplify the processing, avoid ambiguity and allow more control over what is and isn't allowed.
If a URL is not covered by any allow or disallow rules, then the URL is to be allowed.
The directives above shouldn't be used directly, use them from the Rulesets
:
if ($rulesets->isUserAgentAllowed('my-bot', '/path')) {
// access allowed
} else {
// access denied
}
Sitemap
This directive is independent from the User-Agent, so it doesn't matter where it is placed in the file.
A Sitemap index file can be included the location of just that file.
More than one Sitemap
could be specified in the robots.txt file.
To obtain all the sitemaps, just call:
foreach ($rulesets->getSitemaps() as $sitemap) {
// $sitemap instanceof \Bisarca\RobotsTxt\Directive\Sitemap
echo $sitemap->getValue(); // http://www.example.com/sitemap.xml
}
Host
The Host
directive is used to define the main domain the bot should follow.
This directive does not guarantee that the specified main domain will be selected.
For every robots.txt file, only one Host directive is processed. If several directives are indicated in the file, the robot will use the first one.
if ($rulesets->hasHost()) {
$host = $rulesets->getHost(); // instanceof \Bisarca\RobotsTxt\Directive\Host
}
Comment
These are comments that the robot is encouraged to send back to the author/user
of the robot. All Comment
's in a rule set are to be sent back (at least,
that's the intention). This can be used to explain the robot policy of a site
(say, that one government site that hates robots).
To get the comments from a Ruleset
there's the method getComments
that
returns a Generator
:
foreach ($ruleset->getComments() as $comment) {
echo $comment; // will write "Command: lorem ipsum"
}
To obtain all the comments contained in the robots.txt you just need to iterate
over all the Ruleset
s:
foreach ($rulesets as $ruleset) {
foreach ($ruleset->getComments() as $comment) {
echo $comment; // will write "Command: lorem ipsum"
}
}