HumHub Documentation (unofficial)

Crawler
in package

Application

implements Countable, IteratorAggregate

Crawler eases navigation of a list of \DOMNode objects.

Interfaces

Countable
IteratorAggregate

Properties

$uri : string|null
$baseHref : string|null: The base href value.
$defaultNamespacePrefix : string: The default namespace prefix to be used with XPath and CSS expressions.
$document : DOMDocument|null
$html5Parser : HTML5|null
$isHtml : bool: Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).
$namespaces : array<string, string>: A map of manually registered namespaces.
$nodes : array<string|int, DOMNode>

Methods

__construct() : mixed
add() : mixed: Adds a node to the current list of nodes.
addContent() : mixed: Adds HTML/XML content.
addDocument() : mixed: Adds a \DOMDocument to the list of nodes.
addHtmlContent() : mixed: Adds an HTML content to the list of nodes.
addNode() : mixed: Adds a \DOMNode instance to the list of nodes.
addNodeList() : mixed: Adds a \DOMNodeList to the list of nodes.
addNodes() : mixed: Adds an array of \DOMNode instances to the list of nodes.
addXmlContent() : mixed: Adds an XML content to the list of nodes.
attr() : string|null: Returns the attribute value of the first node of the list.
children() : static: Returns the children nodes of the current selection.
clear() : mixed: Removes all the nodes.
closest() : self|null: Return first parents (heading toward the document root) of the Element that matches the provided selector.
count() : int
each() : array<string|int, mixed>: Calls an anonymous function on each node of the list.
eq() : static: Returns a node given its position in the node list.
evaluate() : array<string|int, mixed>|Crawler: Evaluates an XPath expression.
extract() : array<string|int, mixed>: Extracts information from the list of nodes.
filter() : static: Filters the list of nodes with a CSS selector.
filterXPath() : static: Filters the list of nodes with an XPath expression.
first() : static: Returns the first node of the current selection.
form() : Form: Returns a Form object for the first node in the list.
getBaseHref() : string|null: Returns base href.
getIterator() : ArrayIterator|array<string|int, DOMNode>
getNode() : DOMNode|null
getUri() : string|null: Returns the current URI.
html() : string: Returns the first node of the list as HTML.
image() : Image: Returns an Image object for the first node in the list.
images() : array<string|int, Image>: Returns an array of Image objects for the nodes in the list.
last() : static: Returns the last node of the current selection.
link() : Link: Returns a Link object for the first node in the list.
links() : array<string|int, Link>: Returns an array of Link objects for the nodes in the list.
matches() : bool
nextAll() : static: Returns the next siblings nodes of the current selection.
nodeName() : string: Returns the node name of the first node of the list.
outerHtml() : string
parents() : static: Returns the parents nodes of the current selection.
previousAll() : static: Returns the previous sibling nodes of the current selection.
reduce() : static: Reduces the list of nodes by calling an anonymous function.
registerNamespace() : mixed
selectButton() : static: Selects a button by name or alt value for images.
selectImage() : static: Selects images by alt value.
selectLink() : static: Selects links by name or alt value for clickable images.
setDefaultNamespacePrefix() : mixed: Overloads a default namespace prefix to be used with XPath and CSS expressions.
siblings() : static: Returns the siblings nodes of the current selection.
slice() : static: Slices the list of nodes by $offset and $length.
text() : string: Returns the text of the first node of the list.
xpathLiteral() : string: Converts string for XPath expressions.
sibling() : array<string|int, mixed>
canParseHtml5String() : bool
convertToHtmlEntities() : string: Converts charset to HTML-entities to ensure valid parsing.
createCssSelectorConverter() : CssSelectorConverter
createDOMXPath() : DOMXPath
createSubCrawler() : static: Creates a crawler for some subnodes.
discoverNamespace() : string|null
filterRelativeXPath() : static: Filters the list of nodes with an XPath expression.
findNamespacePrefixes() : array<string|int, mixed>
isValidHtml5Heading() : bool
parseHtml5() : DOMDocument
parseHtmlString() : DOMDocument: Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.
parseXhtml() : DOMDocument
relativize() : string: Make the XPath relative to the current context.

$uri


    protected
        string|null
    $uri

$baseHref

The base href value.


    private
        string|null
    $baseHref

$defaultNamespacePrefix

The default namespace prefix to be used with XPath and CSS expressions.


    private
        string
    $defaultNamespacePrefix
     = 'default'

$document


    private
        DOMDocument|null
    $document

$html5Parser


    private
        HTML5|null
    $html5Parser

$isHtml

Whether the Crawler contains HTML or XML content (used when converting CSS to XPath).


    private
        bool
    $isHtml
     = true

$namespaces

A map of manually registered namespaces.


    private
        array<string, string>
    $namespaces
     = []

$nodes


    private
        array<string|int, DOMNode>
    $nodes
     = []

__construct()


    public
                    __construct([DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node = null ][, string $uri = null ][, string $baseHref = null ]) : mixed

Parameters

$node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null = null: A Node to use as the base for the crawling
$uri : string = null
$baseHref : string = null

add()

Adds a node to the current list of nodes.


    public
                    add(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $node) : mixed

This method uses the appropriate specialized add*() method based on the type of the argument.

Parameters

$node : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null: A node

addContent()

Adds HTML/XML content.


    public
                    addContent(string $content[, string|null $type = null ]) : mixed

If the charset is not set via the content type, it is assumed to be UTF-8, or ISO-8859-1 as a fallback, which is the default charset defined by the HTTP 1.1 specification.

Parameters

$content : string: A string to parse as HTML/XML
$type : string|null = null: The content type of the string

addDocument()

Adds a \DOMDocument to the list of nodes.


    public
                    addDocument(DOMDocument $dom) : mixed

Parameters

$dom : DOMDocument: A \DOMDocument instance

addHtmlContent()

Adds an HTML content to the list of nodes.


    public
                    addHtmlContent(string $content[, string $charset = 'UTF-8' ]) : mixed

The libxml errors are disabled when the content is parsed.

If you want to get parsing errors, be sure to enable internal errors via libxml_use_internal_errors(true) and then, get the errors via libxml_get_errors(). Be sure to clear errors with libxml_clear_errors() afterward.

Parameters

$content : string: The HTML content
$charset : string = 'UTF-8': The charset

addNode()

Adds a \DOMNode instance to the list of nodes.


    public
                    addNode(DOMNode $node) : mixed

Parameters

$node : DOMNode: A \DOMNode instance

addNodeList()

Adds a \DOMNodeList to the list of nodes.


    public
                    addNodeList(DOMNodeList $nodes) : mixed

Parameters

$nodes : DOMNodeList: A \DOMNodeList instance

addNodes()

Adds an array of \DOMNode instances to the list of nodes.


    public
                    addNodes(array<string|int, DOMNode> $nodes) : mixed

Parameters

$nodes : array<string|int, DOMNode>: An array of \DOMNode instances

addXmlContent()

Adds an XML content to the list of nodes.


    public
                    addXmlContent(string $content[, string $charset = 'UTF-8' ][, int $options = LIBXML_NONET ]) : mixed

The libxml errors are disabled when the content is parsed.

Parameters

$content : string: The XML content
$charset : string = 'UTF-8': The charset
$options : int = LIBXML_NONET: Bitwise OR of the libxml option constants LIBXML_PARSEHUGE is dangerous, see http://symfony.com/blog/security-release-symfony-2-0-17-released

attr()

Returns the attribute value of the first node of the list.


    public
                    attr(string $attribute) : string|null

Parameters

$attribute : string: The attribute name

Return values

string|null —

The attribute value or null if the attribute does not exist

children()

Returns the children nodes of the current selection.


    public
                    children() : static

Return values

static

clear()

Removes all the nodes.


    public
                    clear() : mixed

closest()

Return first parents (heading toward the document root) of the Element that matches the provided selector.


    public
                    closest(string $selector) : self|null

Parameters

$selector : string

Return values

self|null

count()


    public
                    count() : int

#[ReturnTypeWillChange]

Return values

int

each()

Calls an anonymous function on each node of the list.


    public
                    each(Closure $closure) : array<string|int, mixed>

The anonymous function receives the position and the node wrapped in a Crawler instance as arguments.

Example:

$crawler->filter('h1')->each(function ($node, $i) {
    return $node->text();
});

Parameters

$closure : Closure: An anonymous function

Return values

array<string|int, mixed> —

An array of values returned by the anonymous function

eq()

Returns a node given its position in the node list.


    public
                    eq(int $position) : static

Parameters

$position : int: The position

Return values

static

evaluate()

Evaluates an XPath expression.


    public
                    evaluate(string $xpath) : array<string|int, mixed>|Crawler

Since an XPath expression might evaluate to either a simple type or a \DOMNodeList, this method will return either an array of simple types or a new Crawler instance.

Parameters

$xpath : string: An XPath expression

Return values

array<string|int, mixed>|Crawler —

An array of evaluation results or a new Crawler instance

extract()

Extracts information from the list of nodes.


    public
                    extract(array<string|int, mixed> $attributes) : array<string|int, mixed>

You can extract attributes or/and the node value (_text).

Example:

$crawler->filter('h1 a')->extract(['_text', 'href']);

Parameters

$attributes : array<string|int, mixed>: An array of attributes

Return values

array<string|int, mixed> —

An array of extracted values

filter()

Filters the list of nodes with a CSS selector.


    public
                    filter(string $selector) : static

This method only works if you have installed the CssSelector Symfony Component.

Parameters

$selector : string: A CSS selector

Return values

static

filterXPath()

Filters the list of nodes with an XPath expression.


    public
                    filterXPath(string $xpath) : static

The XPath expression is evaluated in the context of the crawler, which is considered as a fake parent of the elements inside it. This means that a child selector "div" or "./div" will match only the div elements of the current crawler, not their children.

Parameters

$xpath : string: An XPath expression

Return values

static

first()

Returns the first node of the current selection.


    public
                    first() : static

Return values

static

form()

Returns a Form object for the first node in the list.


    public
                    form([array<string|int, mixed> $values = null ][, string $method = null ]) : Form

Parameters

$values : array<string|int, mixed> = null: An array of values for the form fields
$method : string = null: The method for the form

Return values

Form —

A Form instance

getBaseHref()

Returns base href.


    public
                    getBaseHref() : string|null

Return values

string|null

getIterator()


    public
                    getIterator() : ArrayIterator|array<string|int, DOMNode>

#[ReturnTypeWillChange]

Return values

ArrayIterator|array<string|int, DOMNode>

getNode()


    public
                    getNode(int $position) : DOMNode|null

Parameters

$position : int

Return values

DOMNode|null

getUri()

Returns the current URI.


    public
                    getUri() : string|null

Return values

string|null

html()

Returns the first node of the list as HTML.


    public
                    html() : string

Return values

string —

The node html

image()

Returns an Image object for the first node in the list.


    public
                    image() : Image

Return values

Image —

An Image instance

images()

Returns an array of Image objects for the nodes in the list.


    public
                    images() : array<string|int, Image>

Return values

array<string|int, Image> —

An array of Image instances

last()

Returns the last node of the current selection.


    public
                    last() : static

Return values

static

link()

Returns a Link object for the first node in the list.


    public
                    link([string $method = 'get' ]) : Link

Parameters

$method : string = 'get': The method for the link (get by default)

Return values

Link —

A Link instance

links()

Returns an array of Link objects for the nodes in the list.


    public
                    links() : array<string|int, Link>

Return values

array<string|int, Link> —

An array of Link instances

matches()


    public
                    matches(string $selector) : bool

Parameters

$selector : string

Return values

bool

nextAll()

Returns the next siblings nodes of the current selection.


    public
                    nextAll() : static

Return values

static

nodeName()

Returns the node name of the first node of the list.


    public
                    nodeName() : string

Return values

string —

The node name

outerHtml()


    public
                    outerHtml() : string

Return values

string

parents()

Returns the parents nodes of the current selection.


    public
                    parents() : static

Return values

static

previousAll()

Returns the previous sibling nodes of the current selection.


    public
                    previousAll() : static

Return values

static

reduce()

Reduces the list of nodes by calling an anonymous function.


    public
                    reduce(Closure $closure) : static

To remove a node from the list, the anonymous function must return false.

Parameters

$closure : Closure: An anonymous function

Return values

static

registerNamespace()


    public
                    registerNamespace(string $prefix, string $namespace) : mixed

Parameters

$prefix : string
$namespace : string

selectButton()

Selects a button by name or alt value for images.


    public
                    selectButton(string $value) : static

Parameters

$value : string: The button text

Return values

static

selectImage()

Selects images by alt value.


    public
                    selectImage(string $value) : static

Parameters

$value : string: The image alt

Return values

static —

A new instance of Crawler with the filtered list of nodes

selectLink()

Selects links by name or alt value for clickable images.


    public
                    selectLink(string $value) : static

Parameters

$value : string: The link text

Return values

static

setDefaultNamespacePrefix()

Overloads a default namespace prefix to be used with XPath and CSS expressions.


    public
                    setDefaultNamespacePrefix(string $prefix) : mixed

Parameters

$prefix : string

siblings()

Returns the siblings nodes of the current selection.


    public
                    siblings() : static

Return values

static

slice()

Slices the list of nodes by $offset and $length.


    public
                    slice([int $offset = 0 ][, int $length = null ]) : static

Parameters

$offset : int = 0
$length : int = null

Return values

static

text()

Returns the text of the first node of the list.


    public
                    text() : string

Pass true as the second argument to normalize whitespaces.

Return values

string —

The node value

xpathLiteral()

Converts string for XPath expressions.


    public
            static        xpathLiteral(string $s) : string

Escaped characters are: quotes (") and apostrophe (').

Examples:

echo Crawler::xpathLiteral('foo " bar'); //prints 'foo " bar'

echo Crawler::xpathLiteral("foo ' bar"); //prints "foo ' bar"

echo Crawler::xpathLiteral('a'b"c'); //prints concat('a', "'", 'b"c')

Parameters

$s : string: String to be escaped

Return values

string —

Converted string

sibling()


    protected
                    sibling(DOMElement $node[, string $siblingDir = 'nextSibling' ]) : array<string|int, mixed>

Parameters

$node : DOMElement
$siblingDir : string = 'nextSibling'

Return values

array<string|int, mixed>

canParseHtml5String()


    private
                    canParseHtml5String(string $content) : bool

Parameters

$content : string

Return values

bool

convertToHtmlEntities()

Converts charset to HTML-entities to ensure valid parsing.


    private
                    convertToHtmlEntities(string $htmlContent[, string $charset = 'UTF-8' ]) : string

Parameters

$htmlContent : string
$charset : string = 'UTF-8'

Return values

string

createCssSelectorConverter()


    private
                    createCssSelectorConverter() : CssSelectorConverter

Return values

CssSelectorConverter

createDOMXPath()


    private
                    createDOMXPath(DOMDocument $document[, array<string|int, mixed> $prefixes = [] ]) : DOMXPath

Parameters

$document : DOMDocument
$prefixes : array<string|int, mixed> = []

Return values

DOMXPath

createSubCrawler()

Creates a crawler for some subnodes.


    private
                    createSubCrawler(DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null $nodes) : static

Parameters

$nodes : DOMNodeList|DOMNode|array<string|int, DOMNode>|string|null

Return values

static

discoverNamespace()


    private
                    discoverNamespace(DOMXPath $domxpath, string $prefix) : string|null

Parameters

$domxpath : DOMXPath
$prefix : string

Return values

string|null

filterRelativeXPath()

Filters the list of nodes with an XPath expression.


    private
                    filterRelativeXPath(string $xpath) : static

The XPath expression should already be processed to apply it in the context of each node.

Parameters

$xpath : string

Return values

static

findNamespacePrefixes()


    private
                    findNamespacePrefixes(string $xpath) : array<string|int, mixed>

Parameters

$xpath : string

Return values

array<string|int, mixed>

isValidHtml5Heading()


    private
                    isValidHtml5Heading(string $heading) : bool

Parameters

$heading : string

Return values

bool

parseHtml5()


    private
                    parseHtml5(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument

Parameters

$htmlContent : string
$charset : string = 'UTF-8'

Return values

DOMDocument

parseHtmlString()

Parse string into DOMDocument object using HTML5 parser if the content is HTML5 and the library is available.


    private
                    parseHtmlString(string $content, string $charset) : DOMDocument

Use libxml parser otherwise.

Parameters

$content : string
$charset : string

Return values

DOMDocument

parseXhtml()


    private
                    parseXhtml(string $htmlContent[, string $charset = 'UTF-8' ]) : DOMDocument

Parameters

$htmlContent : string
$charset : string = 'UTF-8'

Return values

DOMDocument

relativize()

Make the XPath relative to the current context.


    private
                    relativize(string $xpath) : string

The returned XPath will match elements matching the XPath inside the current crawler when running in the context of a node of the crawler.

Parameters

$xpath : string

Return values

string

Crawler in package Application implements Countable, IteratorAggregate

Tags

Table of Contents

Interfaces

Properties

Methods

Properties

$uri

$baseHref

$defaultNamespacePrefix

$document

$html5Parser

$isHtml

$namespaces

$nodes

Methods

__construct()

Parameters

add()

Parameters

Tags

addContent()

Parameters

addDocument()

Parameters

addHtmlContent()

Parameters

addNode()

Parameters

addNodeList()

Parameters

addNodes()

Parameters

addXmlContent()

Parameters

attr()

Parameters

Tags

Return values

children()

Tags

Return values

clear()

closest()

Parameters

Tags

Return values

count()

Attributes

Return values

each()

Parameters

Return values

eq()

Parameters

Return values

evaluate()

Parameters

Return values

extract()

Parameters

Return values

filter()

Parameters

Tags

Return values

filterXPath()

Parameters

Return values

first()

Return values

form()

Parameters

Crawler
in package

Application

implements Countable, IteratorAggregate