HTMLPurifier_Encoder
in package
A UTF-8 specific character encoder that handles cleaning and transforming.
Tags
Table of Contents
Constants
- ICONV_OK = 0
- No bugs detected in iconv.
- ICONV_TRUNCATES = 1
- Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found
- ICONV_UNUSABLE = 2
- Iconv does not support //IGNORE, making it unusable for transcoding purposes
Methods
- cleanUTF8() : string
- Cleans a UTF-8 string for well-formedness and SGML validity
- convertFromUTF8() : string
- Converts a string from UTF-8 based on configuration.
- convertToASCIIDumbLossless() : string
- Lossless (character-wise) conversion of HTML to ASCII
- convertToUTF8() : string
- Convert a string to UTF-8 based on configuration.
- iconv() : string
- iconv wrapper which mutes errors and works around bugs.
- iconvAvailable() : bool
- muteErrorHandler() : mixed
- Error-handler that mutes errors, alternative to shut-up operator.
- testEncodingSupportsASCII() : array<string|int, mixed>
- This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.
- testIconvTruncateBug() : int
- glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.
- unichr() : mixed
- Translates a Unicode codepoint into its corresponding UTF-8 character.
- unsafeIconv() : string
- iconv wrapper which mutes errors, but doesn't work around bugs.
- __construct() : mixed
- Constructor throws fatal error if you attempt to instantiate class
Constants
ICONV_OK
No bugs detected in iconv.
public
mixed
ICONV_OK
= 0
ICONV_TRUNCATES
Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found
public
mixed
ICONV_TRUNCATES
= 1
ICONV_UNUSABLE
Iconv does not support //IGNORE, making it unusable for transcoding purposes
public
mixed
ICONV_UNUSABLE
= 2
Methods
cleanUTF8()
Cleans a UTF-8 string for well-formedness and SGML validity
public
static cleanUTF8(string $str[, bool $force_php = false ]) : string
It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.
Specifically, it will permit: \x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF} Source: https://www.w3.org/TR/REC-xml/#NT-Char Arguably this function should be modernized to the HTML5 set of allowed characters: https://www.w3.org/TR/html5/syntax.html#preprocessing-the-input-stream which simultaneously expand and restrict the set of allowed characters.
Parameters
- $str : string
-
The string to clean
- $force_php : bool = false
Tags
Return values
stringconvertFromUTF8()
Converts a string from UTF-8 based on configuration.
public
static convertFromUTF8(string $str, HTMLPurifier_Config $config, HTMLPurifier_Context $context) : string
Parameters
- $str : string
-
The string to convert
- $config : HTMLPurifier_Config
- $context : HTMLPurifier_Context
Tags
Return values
stringconvertToASCIIDumbLossless()
Lossless (character-wise) conversion of HTML to ASCII
public
static convertToASCIIDumbLossless(string $str) : string
Parameters
- $str : string
-
UTF-8 string to be converted to ASCII
Tags
Return values
string —ASCII encoded string with non-ASCII character entity-ized
convertToUTF8()
Convert a string to UTF-8 based on configuration.
public
static convertToUTF8(string $str, HTMLPurifier_Config $config, HTMLPurifier_Context $context) : string
Parameters
- $str : string
-
The string to convert
- $config : HTMLPurifier_Config
- $context : HTMLPurifier_Context
Return values
stringiconv()
iconv wrapper which mutes errors and works around bugs.
public
static iconv(string $in, string $out, string $text[, int $max_chunk_size = 8000 ]) : string
Parameters
- $in : string
-
Input encoding
- $out : string
-
Output encoding
- $text : string
-
The text to convert
- $max_chunk_size : int = 8000
Return values
stringiconvAvailable()
public
static iconvAvailable() : bool
Return values
boolmuteErrorHandler()
Error-handler that mutes errors, alternative to shut-up operator.
public
static muteErrorHandler() : mixed
testEncodingSupportsASCII()
This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.
public
static testEncodingSupportsASCII(string $encoding[, bool $bypass = false ]) : array<string|int, mixed>
Parameters
- $encoding : string
-
Encoding name to test, as per iconv format
- $bypass : bool = false
-
Whether or not to bypass the precompiled arrays.
Return values
array<string|int, mixed> —of UTF-8 characters to their corresponding ASCII, which can be used to "undo" any overzealous iconv action.
testIconvTruncateBug()
glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.
public
static testIconvTruncateBug() : int
Return values
int —Error code indicating severity of bug.
unichr()
Translates a Unicode codepoint into its corresponding UTF-8 character.
public
static unichr(mixed $code) : mixed
Parameters
- $code : mixed
Tags
unsafeIconv()
iconv wrapper which mutes errors, but doesn't work around bugs.
public
static unsafeIconv(string $in, string $out, string $text) : string
Parameters
- $in : string
-
Input encoding
- $out : string
-
Output encoding
- $text : string
-
The text to convert
Return values
string__construct()
Constructor throws fatal error if you attempt to instantiate class
private
__construct() : mixed