HumHub Documentation (unofficial)

HTMLPurifier_Encoder
in package

Application

A UTF-8 specific character encoder that handles cleaning and transforming.

Constants

ICONV_OK = 0: No bugs detected in iconv.
ICONV_TRUNCATES = 1: Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found
ICONV_UNUSABLE = 2: Iconv does not support //IGNORE, making it unusable for transcoding purposes

Methods

cleanUTF8() : string: Cleans a UTF-8 string for well-formedness and SGML validity
convertFromUTF8() : string: Converts a string from UTF-8 based on configuration.
convertToASCIIDumbLossless() : string: Lossless (character-wise) conversion of HTML to ASCII
convertToUTF8() : string: Convert a string to UTF-8 based on configuration.
iconv() : string: iconv wrapper which mutes errors and works around bugs.
iconvAvailable() : bool
muteErrorHandler() : mixed: Error-handler that mutes errors, alternative to shut-up operator.
testEncodingSupportsASCII() : array<string|int, mixed>: This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.
testIconvTruncateBug() : int: glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.
unichr() : mixed: Translates a Unicode codepoint into its corresponding UTF-8 character.
unsafeIconv() : string: iconv wrapper which mutes errors, but doesn't work around bugs.
__construct() : mixed: Constructor throws fatal error if you attempt to instantiate class

ICONV_OK

No bugs detected in iconv.


    public
        mixed
    ICONV_OK
    = 0

ICONV_TRUNCATES

Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found


    public
        mixed
    ICONV_TRUNCATES
    = 1

ICONV_UNUSABLE

Iconv does not support //IGNORE, making it unusable for transcoding purposes


    public
        mixed
    ICONV_UNUSABLE
    = 2

cleanUTF8()

Cleans a UTF-8 string for well-formedness and SGML validity


    public
            static        cleanUTF8(string $str[, bool $force_php = false ]) : string

It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.

Specifically, it will permit: \x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF} Source: https://www.w3.org/TR/REC-xml/#NT-Char Arguably this function should be modernized to the HTML5 set of allowed characters: https://www.w3.org/TR/html5/syntax.html#preprocessing-the-input-stream which simultaneously expand and restrict the set of allowed characters.

Parameters

$str : string: The string to clean
$force_php : bool = false

Return values

string

convertFromUTF8()

Converts a string from UTF-8 based on configuration.


    public
            static        convertFromUTF8(string $str, HTMLPurifier_Config $config, HTMLPurifier_Context $context) : string

Parameters

$str : string: The string to convert
$config : HTMLPurifier_Config
$context : HTMLPurifier_Context

Return values

string

convertToASCIIDumbLossless()

Lossless (character-wise) conversion of HTML to ASCII


    public
            static        convertToASCIIDumbLossless(string $str) : string

Parameters

$str : string: UTF-8 string to be converted to ASCII

Return values

string —

ASCII encoded string with non-ASCII character entity-ized

convertToUTF8()

Convert a string to UTF-8 based on configuration.


    public
            static        convertToUTF8(string $str, HTMLPurifier_Config $config, HTMLPurifier_Context $context) : string

Parameters

$str : string: The string to convert
$config : HTMLPurifier_Config
$context : HTMLPurifier_Context

Return values

string

iconv()

iconv wrapper which mutes errors and works around bugs.


    public
            static        iconv(string $in, string $out, string $text[, int $max_chunk_size = 8000 ]) : string

Parameters

$in : string: Input encoding
$out : string: Output encoding
$text : string: The text to convert
$max_chunk_size : int = 8000

Return values

string

iconvAvailable()


    public
            static        iconvAvailable() : bool

Return values

bool

muteErrorHandler()

Error-handler that mutes errors, alternative to shut-up operator.


    public
            static        muteErrorHandler() : mixed

testEncodingSupportsASCII()

This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.


    public
            static        testEncodingSupportsASCII(string $encoding[, bool $bypass = false ]) : array<string|int, mixed>

Parameters

$encoding : string: Encoding name to test, as per iconv format
$bypass : bool = false: Whether or not to bypass the precompiled arrays.

Return values

array<string|int, mixed> —

of UTF-8 characters to their corresponding ASCII, which can be used to "undo" any overzealous iconv action.

testIconvTruncateBug()

glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.


    public
            static        testIconvTruncateBug() : int

Return values

int —

Error code indicating severity of bug.

unichr()

Translates a Unicode codepoint into its corresponding UTF-8 character.


    public
            static        unichr(mixed $code) : mixed

Parameters

$code : mixed

unsafeIconv()

iconv wrapper which mutes errors, but doesn't work around bugs.


    public
            static        unsafeIconv(string $in, string $out, string $text) : string

Parameters

$in : string: Input encoding
$out : string: Output encoding
$text : string: The text to convert

Return values

string

__construct()

Constructor throws fatal error if you attempt to instantiate class


    private
                    __construct() : mixed

HTMLPurifier_Encoder in package Application

Tags

Table of Contents

Constants

Methods

Constants

ICONV_OK

ICONV_TRUNCATES

ICONV_UNUSABLE

Methods

cleanUTF8()

Parameters

Tags

Return values

convertFromUTF8()

Parameters

Tags

Return values

convertToASCIIDumbLossless()

Parameters

Tags

Return values

convertToUTF8()

Parameters

Return values

iconv()

Parameters

Return values

iconvAvailable()

Return values

muteErrorHandler()

testEncodingSupportsASCII()

Parameters

Return values

testIconvTruncateBug()

Return values

unichr()

Parameters

Tags

unsafeIconv()

Parameters

Return values

__construct()

HTMLPurifier_Encoder
in package

Application