Crawler
in package
implements
Countable, IteratorAggregate
Crawler eases navigation of a list of \DOMNode objects.
Tags
Table of Contents
Interfaces
- Countable
- IteratorAggregate
Properties
- $uri : string|null
- $baseHref : string|null
- The base href value.
- $defaultNamespacePrefix : string
- The default namespace prefix to be used with XPath and CSS expressions.
- $document : DOMDocument|null
- $html5Parser : HTML5|null
- $isHtml : bool
- Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).
- $namespaces : array<string, string>
- A map of manually registered namespaces.
- $nodes : array<string|int, DOMNode>
Methods
- __construct() : mixed
- add() : mixed
- Adds a node to the current list of nodes.
- addContent() : mixed
- Adds HTML/XML content.
- addDocument() : mixed
- Adds a \DOMDocument to the list of nodes.
- addHtmlContent() : mixed
- Adds an HTML content to the list of nodes.
- addNode() : mixed
- Adds a \DOMNode instance to the list of nodes.
- addNodeList() : mixed
- Adds a \DOMNodeList to the list of nodes.
- addNodes() : mixed
- Adds an array of \DOMNode instances to the list of nodes.
- addXmlContent() : mixed
- Adds an XML content to the list of nodes.
- attr() : string|null
- Returns the attribute value of the first node of the list.
- children() : static
- Returns the children nodes of the current selection.
- clear() : mixed
- Removes all the nodes.
- closest() : self|null
- Return first parents (heading toward the document root) of the Element that matches the provided selector.
- count() : int
- each() : array<string|int, mixed>
- Calls an anonymous function on each node of the list.
- eq() : static
- Returns a node given its position in the node list.
- evaluate() : array<string|int, mixed>|Crawler
- Evaluates an XPath expression.
- extract() : array<string|int, mixed>
- Extracts information from the list of nodes.
- filter() : static
- Filters the list of nodes with a CSS selector.
- filterXPath() : static
- Filters the list of nodes with an XPath expression.
- first() : static
- Returns the first node of the current selection.
- form() : Form
- Returns a Form object for the first node in the list.
- getBaseHref() : string|null
- Returns base href.
- getIterator() : ArrayIterator|array<string|int, DOMNode>
- getNode() : DOMNode|null
- getUri() : string|null
- Returns the current URI.
- html() : string
- Returns the first node of the list as HTML.
- image() : Image
- Returns an Image object for the first node in the list.
- images() : array<string|int, Image>
- Returns an array of Image objects for the nodes in the list.
- last() : static
- Returns the last node of the current selection.
- link() : Link
- Returns a Link object for the first node in the list.
- links() : array<string|int, Link>
- Returns an array of Link objects for the nodes in the list.
- matches() : bool
- nextAll() : static
- Returns the next siblings nodes of the current selection.
- nodeName() : string
- Returns the node name of the first node of the list.
- outerHtml() : string
- parents() : static
- Returns the parents nodes of the current selection.
- previousAll() : static
- Returns the previous sibling nodes of the current selection.
- reduce() : static
- Reduces the list of nodes by calling an anonymous function.
- registerNamespace() : mixed
- selectButton() : static
- Selects a button by name or alt value for images.
- selectImage() : static
- Selects images by alt value.
- selectLink() : static
- Selects links by name or alt value for clickable images.
- setDefaultNamespacePrefix() : mixed
- Overloads a default namespace prefix to be used with XPath and CSS expressions.
- siblings() : static
- Returns the siblings nodes of the current selection.
- slice() : static
- Slices the list of nodes by $offset and $length.
- text() : string
- Returns the text of the first node of the list.
- xpathLiteral() : string
- Converts string for XPath expressions.
- sibling() : array<string|int, mixed>
- canParseHtml5String() : bool
- convertToHtmlEntities() : string
- Converts charset to HTML-entities to ensure valid parsing.
- createCssSelectorConverter() : CssSelectorConverter
- createDOMXPath() : DOMXPath
- createSubCrawler() : static
- Creates a crawler for some subnodes.
- discoverNamespace() : string|null
- filterRelativeXPath() : static
- Filters the list of nodes with an XPath expression.
- findNamespacePrefixes() : array<string|int, mixed>
- isValidHtml5Heading() : bool
- parseHtml5() : DOMDocument
- parseHtmlString() : DOMDocument
- Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.
- parseXhtml() : DOMDocument
- relativize() : string
- Make the XPath relative to the current context.
Properties
$uri
protected
string|null
$uri
$baseHref
The base href value.
private
string|null
$baseHref
$defaultNamespacePrefix
The default namespace prefix to be used with XPath and CSS expressions.
private
string
$defaultNamespacePrefix
= 'default'
$document
private
DOMDocument|null
$document
$html5Parser
private
HTML5|null
$html5Parser
$isHtml
Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).
private
bool
$isHtml
= true
$namespaces
A map of manually registered namespaces.
private
array<string, string>
$namespaces
= []
$nodes
private
array<string|int, DOMNode>
$nodes
= []
Methods
__construct()
public
__construct([DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node = null ][, string $uri = null ][, string $baseHref = null ]) : mixed
Parameters
- $node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null = null
-
A Node to use as the base for the crawling
- $uri : string = null
- $baseHref : string = null
add()
Adds a node to the current list of nodes.
public
add(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node) : mixed
This method uses the appropriate specialized add*() method based on the type of the argument.
Parameters
- $node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null
-
A node
Tags
addContent()
Adds HTML/XML content.
public
addContent(string $content[, string|null $type = null ]) : mixed
If the charset is not set via the content type, it is assumed to be UTF-8, or ISO-8859-1 as a fallback, which is the default charset defined by the HTTP 1.1 specification.
Parameters
- $content : string
-
A string to parse as HTML/XML
- $type : string|null = null
-
The content type of the string
addDocument()
Adds a \DOMDocument to the list of nodes.
public
addDocument(DOMDocument $dom) : mixed
Parameters
- $dom : DOMDocument
-
A \DOMDocument instance
addHtmlContent()
Adds an HTML content to the list of nodes.
public
addHtmlContent(string $content[, string $charset = 'UTF-8' ]) : mixed
The libxml errors are disabled when the content is parsed.
If you want to get parsing errors, be sure to enable internal errors via libxml_use_internal_errors(true) and then, get the errors via libxml_get_errors(). Be sure to clear errors with libxml_clear_errors() afterward.
Parameters
- $content : string
-
The HTML content
- $charset : string = 'UTF-8'
-
The charset
addNode()
Adds a \DOMNode instance to the list of nodes.
public
addNode(DOMNode $node) : mixed
Parameters
- $node : DOMNode
-
A \DOMNode instance
addNodeList()
Adds a \DOMNodeList to the list of nodes.
public
addNodeList(DOMNodeList $nodes) : mixed
Parameters
- $nodes : DOMNodeList
-
A \DOMNodeList instance
addNodes()
Adds an array of \DOMNode instances to the list of nodes.
public
addNodes(array<string|int, DOMNode> $nodes) : mixed
Parameters
- $nodes : array<string|int, DOMNode>
-
An array of \DOMNode instances
addXmlContent()
Adds an XML content to the list of nodes.
public
addXmlContent(string $content[, string $charset = 'UTF-8' ][, int $options = LIBXML_NONET ]) : mixed
The libxml errors are disabled when the content is parsed.
If you want to get parsing errors, be sure to enable internal errors via libxml_use_internal_errors(true) and then, get the errors via libxml_get_errors(). Be sure to clear errors with libxml_clear_errors() afterward.
Parameters
- $content : string
-
The XML content
- $charset : string = 'UTF-8'
-
The charset
- $options : int = LIBXML_NONET
-
Bitwise OR of the libxml option constants LIBXML_PARSEHUGE is dangerous, see http://symfony.com/blog/security-release-symfony-2-0-17-released
attr()
Returns the attribute value of the first node of the list.
public
attr(string $attribute) : string|null
Parameters
- $attribute : string
-
The attribute name
Tags
Return values
string|null —The attribute value or null if the attribute does not exist
children()
Returns the children nodes of the current selection.
public
children() : static
Tags
Return values
staticclear()
Removes all the nodes.
public
clear() : mixed
closest()
Return first parents (heading toward the document root) of the Element that matches the provided selector.
public
closest(string $selector) : self|null
Parameters
- $selector : string
Tags
Return values
self|nullcount()
public
count() : int
Attributes
Return values
inteach()
Calls an anonymous function on each node of the list.
public
each(Closure $closure) : array<string|int, mixed>
The anonymous function receives the position and the node wrapped in a Crawler instance as arguments.
Example:
$crawler->filter('h1')->each(function ($node, $i) {
return $node->text();
});
Parameters
- $closure : Closure
-
An anonymous function
Return values
array<string|int, mixed> —An array of values returned by the anonymous function
eq()
Returns a node given its position in the node list.
public
eq(int $position) : static
Parameters
- $position : int
-
The position
Return values
staticevaluate()
Evaluates an XPath expression.
public
evaluate(string $xpath) : array<string|int, mixed>|Crawler
Since an XPath expression might evaluate to either a simple type or a \DOMNodeList, this method will return either an array of simple types or a new Crawler instance.
Parameters
- $xpath : string
-
An XPath expression
Return values
array<string|int, mixed>|Crawler —An array of evaluation results or a new Crawler instance
extract()
Extracts information from the list of nodes.
public
extract(array<string|int, mixed> $attributes) : array<string|int, mixed>
You can extract attributes or/and the node value (_text).
Example:
$crawler->filter('h1 a')->extract(['_text', 'href']);
Parameters
- $attributes : array<string|int, mixed>
-
An array of attributes
Return values
array<string|int, mixed> —An array of extracted values
filter()
Filters the list of nodes with a CSS selector.
public
filter(string $selector) : static
This method only works if you have installed the CssSelector Symfony Component.
Parameters
- $selector : string
-
A CSS selector
Tags
Return values
staticfilterXPath()
Filters the list of nodes with an XPath expression.
public
filterXPath(string $xpath) : static
The XPath expression is evaluated in the context of the crawler, which is considered as a fake parent of the elements inside it. This means that a child selector "div" or "./div" will match only the div elements of the current crawler, not their children.
Parameters
- $xpath : string
-
An XPath expression
Return values
staticfirst()
Returns the first node of the current selection.
public
first() : static
Return values
staticform()
Returns a Form object for the first node in the list.
public
form([array<string|int, mixed> $values = null ][, string $method = null ]) : Form
Parameters
- $values : array<string|int, mixed> = null
-
An array of values for the form fields
- $method : string = null
-
The method for the form
Tags
Return values
Form —A Form instance
getBaseHref()
Returns base href.
public
getBaseHref() : string|null
Return values
string|nullgetIterator()
public
getIterator() : ArrayIterator|array<string|int, DOMNode>
Attributes
Return values
ArrayIterator|array<string|int, DOMNode>getNode()
public
getNode(int $position) : DOMNode|null
Parameters
- $position : int
Return values
DOMNode|nullgetUri()
Returns the current URI.
public
getUri() : string|null
Return values
string|nullhtml()
Returns the first node of the list as HTML.
public
html() : string
Tags
Return values
string —The node html
image()
Returns an Image object for the first node in the list.
public
image() : Image
Tags
Return values
Image —An Image instance
images()
Returns an array of Image objects for the nodes in the list.
public
images() : array<string|int, Image>
Return values
array<string|int, Image> —An array of Image instances
last()
Returns the last node of the current selection.
public
last() : static
Return values
staticlink()
Returns a Link object for the first node in the list.
public
link([string $method = 'get' ]) : Link
Parameters
- $method : string = 'get'
-
The method for the link (get by default)
Tags
Return values
Link —A Link instance
links()
Returns an array of Link objects for the nodes in the list.
public
links() : array<string|int, Link>
Tags
Return values
array<string|int, Link> —An array of Link instances
matches()
public
matches(string $selector) : bool
Parameters
- $selector : string
Return values
boolnextAll()
Returns the next siblings nodes of the current selection.
public
nextAll() : static
Tags
Return values
staticnodeName()
Returns the node name of the first node of the list.
public
nodeName() : string
Tags
Return values
string —The node name
outerHtml()
public
outerHtml() : string
Return values
stringparents()
Returns the parents nodes of the current selection.
public
parents() : static
Tags
Return values
staticpreviousAll()
Returns the previous sibling nodes of the current selection.
public
previousAll() : static
Tags
Return values
staticreduce()
Reduces the list of nodes by calling an anonymous function.
public
reduce(Closure $closure) : static
To remove a node from the list, the anonymous function must return false.
Parameters
- $closure : Closure
-
An anonymous function
Return values
staticregisterNamespace()
public
registerNamespace(string $prefix, string $namespace) : mixed
Parameters
- $prefix : string
- $namespace : string
selectButton()
Selects a button by name or alt value for images.
public
selectButton(string $value) : static
Parameters
- $value : string
-
The button text
Return values
staticselectImage()
Selects images by alt value.
public
selectImage(string $value) : static
Parameters
- $value : string
-
The image alt
Return values
static —A new instance of Crawler with the filtered list of nodes
selectLink()
Selects links by name or alt value for clickable images.
public
selectLink(string $value) : static
Parameters
- $value : string
-
The link text
Return values
staticsetDefaultNamespacePrefix()
Overloads a default namespace prefix to be used with XPath and CSS expressions.
public
setDefaultNamespacePrefix(string $prefix) : mixed
Parameters
- $prefix : string
siblings()
Returns the siblings nodes of the current selection.
public
siblings() : static
Tags
Return values
staticslice()
Slices the list of nodes by $offset and $length.
public
slice([int $offset = 0 ][, int $length = null ]) : static
Parameters
- $offset : int = 0
- $length : int = null
Return values
statictext()
Returns the text of the first node of the list.
public
text() : string
Pass true as the second argument to normalize whitespaces.
Tags
Return values
string —The node value
xpathLiteral()
Converts string for XPath expressions.
public
static xpathLiteral(string $s) : string
Escaped characters are: quotes (") and apostrophe (').
Examples:
echo Crawler::xpathLiteral('foo " bar'); //prints 'foo " bar'
echo Crawler::xpathLiteral("foo ' bar"); //prints "foo ' bar"
echo Crawler::xpathLiteral('a'b"c'); //prints concat('a', "'", 'b"c')
Parameters
- $s : string
-
String to be escaped
Return values
string —Converted string
sibling()
protected
sibling(DOMElement $node[, string $siblingDir = 'nextSibling' ]) : array<string|int, mixed>
Parameters
- $node : DOMElement
- $siblingDir : string = 'nextSibling'
Return values
array<string|int, mixed>canParseHtml5String()
private
canParseHtml5String(string $content) : bool
Parameters
- $content : string
Return values
boolconvertToHtmlEntities()
Converts charset to HTML-entities to ensure valid parsing.
private
convertToHtmlEntities(string $htmlContent[, string $charset = 'UTF-8' ]) : string
Parameters
- $htmlContent : string
- $charset : string = 'UTF-8'
Return values
stringcreateCssSelectorConverter()
private
createCssSelectorConverter() : CssSelectorConverter
Tags
Return values
CssSelectorConvertercreateDOMXPath()
private
createDOMXPath(DOMDocument $document[, array<string|int, mixed> $prefixes = [] ]) : DOMXPath
Parameters
- $document : DOMDocument
- $prefixes : array<string|int, mixed> = []
Tags
Return values
DOMXPathcreateSubCrawler()
Creates a crawler for some subnodes.
private
createSubCrawler(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $nodes) : static
Parameters
- $nodes : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null
Return values
staticdiscoverNamespace()
private
discoverNamespace(DOMXPath $domxpath, string $prefix) : string|null
Parameters
- $domxpath : DOMXPath
- $prefix : string
Tags
Return values
string|nullfilterRelativeXPath()
Filters the list of nodes with an XPath expression.
private
filterRelativeXPath(string $xpath) : static
The XPath expression should already be processed to apply it in the context of each node.
Parameters
- $xpath : string
Return values
staticfindNamespacePrefixes()
private
findNamespacePrefixes(string $xpath) : array<string|int, mixed>
Parameters
- $xpath : string
Return values
array<string|int, mixed>isValidHtml5Heading()
private
isValidHtml5Heading(string $heading) : bool
Parameters
- $heading : string
Return values
boolparseHtml5()
private
parseHtml5(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument
Parameters
- $htmlContent : string
- $charset : string = 'UTF-8'
Return values
DOMDocumentparseHtmlString()
Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.
private
parseHtmlString(string $content, string $charset) : DOMDocument
Use libxml parser otherwise.
Parameters
- $content : string
- $charset : string
Return values
DOMDocumentparseXhtml()
private
parseXhtml(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument
Parameters
- $htmlContent : string
- $charset : string = 'UTF-8'
Return values
DOMDocumentrelativize()
Make the XPath relative to the current context.
private
relativize(string $xpath) : string
The returned XPath will match elements matching the XPath inside the current crawler when running in the context of a node of the crawler.
Parameters
- $xpath : string