HumHub Documentation (unofficial)

HTMLPurifier_Encoder
in package

A UTF-8 specific character encoder that handles cleaning and transforming.

Tags
note

All functions in this class should be static.

Table of Contents

Constants

ICONV_OK  = 0
No bugs detected in iconv.
ICONV_TRUNCATES  = 1
Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found
ICONV_UNUSABLE  = 2
Iconv does not support //IGNORE, making it unusable for transcoding purposes

Methods

cleanUTF8()  : string
Cleans a UTF-8 string for well-formedness and SGML validity
convertFromUTF8()  : string
Converts a string from UTF-8 based on configuration.
convertToASCIIDumbLossless()  : string
Lossless (character-wise) conversion of HTML to ASCII
convertToUTF8()  : string
Convert a string to UTF-8 based on configuration.
iconv()  : string
iconv wrapper which mutes errors and works around bugs.
iconvAvailable()  : bool
muteErrorHandler()  : mixed
Error-handler that mutes errors, alternative to shut-up operator.
testEncodingSupportsASCII()  : array<string|int, mixed>
This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.
testIconvTruncateBug()  : int
glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.
unichr()  : mixed
Translates a Unicode codepoint into its corresponding UTF-8 character.
unsafeIconv()  : string
iconv wrapper which mutes errors, but doesn't work around bugs.
__construct()  : mixed
Constructor throws fatal error if you attempt to instantiate class

Constants

ICONV_OK

No bugs detected in iconv.

public mixed ICONV_OK = 0

ICONV_TRUNCATES

Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found

public mixed ICONV_TRUNCATES = 1

ICONV_UNUSABLE

Iconv does not support //IGNORE, making it unusable for transcoding purposes

public mixed ICONV_UNUSABLE = 2

Methods

cleanUTF8()

Cleans a UTF-8 string for well-formedness and SGML validity

public static cleanUTF8(string $str[, bool $force_php = false ]) : string

It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.

Specifically, it will permit: \x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF} Source: https://www.w3.org/TR/REC-xml/#NT-Char Arguably this function should be modernized to the HTML5 set of allowed characters: https://www.w3.org/TR/html5/syntax.html#preprocessing-the-input-stream which simultaneously expand and restrict the set of allowed characters.

Parameters
$str : string

The string to clean

$force_php : bool = false
Tags
note

Just for reference, the non-SGML code points are 0 to 31 and 127 to 159, inclusive. However, we allow code points 9, 10 and 13, which are the tab, line feed and carriage return respectively. 128 and above the code points map to multibyte UTF-8 representations.

note

Fallback code adapted from utf8ToUnicode by Henri Sivonen and hsivonen@iki.fi at http://iki.fi/hsivonen/php-utf8/ under the LGPL license. Notes on what changed are inside, but in general, the original code transformed UTF-8 text into an array of integer Unicode codepoints. Understandably, transforming that back to a string would be somewhat expensive, so the function was modded to directly operate on the string. However, this discourages code reuse, and the logic enumerated here would be useful for any function that needs to be able to understand UTF-8 characters. As of right now, only smart lossless character encoding converters would need that, and I'm probably not going to implement them.

Return values
string

convertToASCIIDumbLossless()

Lossless (character-wise) conversion of HTML to ASCII

public static convertToASCIIDumbLossless(string $str) : string
Parameters
$str : string

UTF-8 string to be converted to ASCII

Tags
warning

Adapted from MediaWiki, claiming fair use: this is a common algorithm. If you disagree with this license fudgery, implement it yourself.

note

Uses decimal numeric entities since they are best supported.

note

This is a DUMB function: it has no concept of keeping character entities that the projected character encoding can allow. We could possibly implement a smart version but that would require it to also know which Unicode codepoints the charset supported (not an easy task).

note

Sort of with cleanUTF8() but it assumes that $str is well-formed UTF-8

Return values
string

ASCII encoded string with non-ASCII character entity-ized

iconv()

iconv wrapper which mutes errors and works around bugs.

public static iconv(string $in, string $out, string $text[, int $max_chunk_size = 8000 ]) : string
Parameters
$in : string

Input encoding

$out : string

Output encoding

$text : string

The text to convert

$max_chunk_size : int = 8000
Return values
string

iconvAvailable()

public static iconvAvailable() : bool
Return values
bool

muteErrorHandler()

Error-handler that mutes errors, alternative to shut-up operator.

public static muteErrorHandler() : mixed

testEncodingSupportsASCII()

This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.

public static testEncodingSupportsASCII(string $encoding[, bool $bypass = false ]) : array<string|int, mixed>
Parameters
$encoding : string

Encoding name to test, as per iconv format

$bypass : bool = false

Whether or not to bypass the precompiled arrays.

Return values
array<string|int, mixed>

of UTF-8 characters to their corresponding ASCII, which can be used to "undo" any overzealous iconv action.

testIconvTruncateBug()

glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.

public static testIconvTruncateBug() : int
Return values
int

Error code indicating severity of bug.

unichr()

Translates a Unicode codepoint into its corresponding UTF-8 character.

public static unichr(mixed $code) : mixed
Parameters
$code : mixed
Tags
note

Based on Feyd's function at http://forums.devnetwork.net/viewtopic.php?p=191404#191404, which is in public domain.

note

While we're going to do code point parsing anyway, a good optimization would be to refuse to translate code points that are non-SGML characters. However, this could lead to duplication.

note

This is very similar to the unichr function in maintenance/generate-entity-file.php (although this is superior, due to its sanity checks).

unsafeIconv()

iconv wrapper which mutes errors, but doesn't work around bugs.

public static unsafeIconv(string $in, string $out, string $text) : string
Parameters
$in : string

Input encoding

$out : string

Output encoding

$text : string

The text to convert

Return values
string

__construct()

Constructor throws fatal error if you attempt to instantiate class

private __construct() : mixed

        
On this page

Search results