The bisarca/robots-txt library is used to model and manage robots.txt files and directives.
This library follows the common instructions (or standard de facto) about how robots should interact with the domain pages, list a set of rules the robots should follow, like what could be visited, which robots can visit a page, when that page can be visited, etc...
It's based on some documents available online:
Usage
The goal of this library is to allow an easier interaction with robots.txt rules while keeping everything.
It's based on a Parser
, one or more set of directives (generally started by a
User-Agent directive), and the collection of these sets, called Rulesets
that
represents the entire robots.txt.
use Bisarca\RobotsTxt\Parser;
$parser = new Parser();
$content = file_get_contents('http://www.example.com/robots.txt');
$rulesets = $parser->parse($content); // instanceof \Bisarca\RobotsTxt\Rulesets
// is my-bot allowed to access to /path?
$allowed = $rulesets->isUserAgentAllowed('my-bot', '/path');
// is my-bot allowed to access (to /)?
$allowed = $rulesets->isUserAgentAllowed('my-bot');
This Rulesets
contains one or more (could contain also zero) Ruleset
, that
represents a set of Directive
.
The Ruleset
could be finally build into a clean version of the originally
parsed robots.txt, using
use Bisarca\RobotsTxt\Builder;
// $rulesets instanceof \Bisarca\RobotsTxt\Rulesets
$builder = new Builder();
$content = $builder->build($rulesets);
file_put_contents('/path/to/your/public/robots.txt', $content);