~bohwaz/blog/

Avec de vrais morceaux de 2.0 !

Garbage2xhtml - A lightweight XHTML cleaning library for PHP5

There is a lot of software which can work with HTML content to clean it, protect possible XSS exploits and make the code comply with standards. Among them I did not find what I'm usually looking for: something clean, lightweight and using PHP5 objects.

  • strip_tags is way too simple, it doesn't check on malicious attributes in tags, it doesn't check that html is well-formed, etc.
  • Tidy is a great piece of software for transforming a complete mess of tags into something valid, but it doesn't have any security feature.
  • HTMLPurifier although looking very interesting is heavy (2MB just for the code!), have thousands of includes and is way too much complicated for this task.
  • htmlLawed is a horrible piece of code which seems designed for PHP 3. It's not even using objects.
  • There is some over pieces of software but haven't found anything interesting.

So, some years ago I wrote a library called garbage2xhtml which was designed to be PHP4-compatible (because of web hosting providers which are always slow to update their PHP version). Last year I wrote a new version, completely designed for PHP5 object model, with a new approach to the HTML parsing. The old library was using a character by character parsing with lot of strpos and substr stuff. This old-school idea was quite annoying and wasn't supporting nested tags. So in the rewrite I used a new approach, based on preg_split and building a complete DOM tree of the HTML string (inspired by this blog post). This way the processing is much faster and easier, we know very quickly if a tag isn't closed or opened properly.

So the resulting work is a new garbage2xhtml library which supports the following features :

  • Included HTML parser, does not require PHP DOM features, and you can use the parser independently if you wish.
  • No external dependency, only 25K / 831 lines of code
  • Sanitizes HTML strings: removing tags which are not properly opened or closed
  • Allows only tags that you specified, other tags are removed or escaped
  • Allows only attributes that you specified, others are removed
  • Outputs an indented xHTML code that you can actually read
  • Protects against most XSS attacks, url and href attributes are cleaned and escaped, only specified protocols are allowed
  • Works with every encoding supported by htmlspecialchars
  • Auto-transform line breaks, like nl2br, but better! Double breaks will form new paragraphs, as simple breaks will only form a <br /> tag

Basically it's a HTML-sanitizer which is very efficient and lightweight, which makes it easier to for users to submit HTML text. For example if you supply:

<blockquote>
Someone said one day...
</blockquote>

Garbage2xhtml will output:

<blockquote>
  <p>
     Someone said one day...
  </p>
</blockquote>

To make the code standards-compliant. But be sure to understand that garbage2xhtml is a dumb library, it works with configured tags, it's not Tidy, basically all it knows from HTML is just the tags you configure it to use. Here we configure it so that <blockquote> needs its contents to be enclosed by <p> tags. G2X is not a miracle solution, it wont make any HTML garbage something which can validates every time, it will just try its best to output XHTML which won't break your page validation. But for example it wont know that <div> can't be nested in <p> tags. The best I'll advise you if you really want to have valid XHTML code is to combine G2X with Tidy.

As it's a 'dumb' library, it won't check that for malicious code in <script> or <style> tags, it's not a CSS or ECMAScript parser. So you'll be advised to not allow those on your website, as well as onevent and style attributes which can carry malicious code or break your website design.

It's a great solution to let users enter HTML tags and not worry about non-closed tags which can break your design, XSS attacks or non valid code. In almost all cases (I can't assure I tested all possible HTML garbage on earth), the worst that can happen is that the tags will be escaped and displayed "as is".

At this day, this library is used on multiple websites, with tens of thousands of HTML content filtered by it, and I had no problem with it. Every XSS attack I can think did not pass the filter, but maybe you will find one, and I'll be glad to know it.

You can try Garbage2xhtml output on this demo page and download the code.

Write a comment
(optional)
(optional)
(mandatory)
            _ _  __   
__   _____ (_) | \_\_ 
\ \ / / _ \| | |/ _` |
 \ V / (_) | | | (_| |
  \_/ \___/|_|_|\__,_|
                      
(mandatory)

URLs will create links automatically.
Allowed HTML tags: <blockquote> <cite> <pre> <code> <var> <strong> <em> <del> <ins> <kbd> <samp> <abbr>

BohwaZ

if(isset($GLOBALS['C'])){$reC = $GLOBALS['C'];} 
$GLOBALS['C'] = $C; 
$S = is_array($S) ? $S : hl_spec($S); 
if(isset($GLOBALS['S'])){$reS = $GLOBALS['S'];} 
$GLOBALS['S'] = $S;
(...)
// main 
$t = preg_replace_callback('`<(?:(?:\s|$)|(?:[^>]*(?:>|$)))|>`m', 'hl_tag', $t); 
$t = $C['balance'] ? hl_bal($t, $C['keep_bad'], $C['parent']) : $t; 
$t = (($C['cdata'] or $C['comment']) && strpos($t, "\x01") !== false) 
? str_replace(array("\x01", "\x02", "\x03", "\x04", "\x05"), array('', '', '&', '<', '>'), $t) : $t; 
$t = $C['tidy'] ? hl_tidy($t, $C['tidy'], $C['parent']) : $t;

Is there really a need to explain why this code is horrible to read?

Anon

None of us are developing the htmllawed code, and if the code is styled in some unorthodox manner, so what? It works fine and delivers.

BohwaZ

If it suits you, then why bother? It doesn't suits me because of its code, but maybe it suits you and it's fine too, I don't see any problem here.

Michael Steyn

Throughout the years of coding I've learnt to be tidy about my html.
Every time I am busy on somebody's else code, if it's not written properly I take some time to tidy it up and then I feel much better to keep on working on it.
Now i found your post I am very intrigued to use your garbage2xhtml to see how it works and what i can gain out of it. I will try it today and post the response later...
But sounds great though.