Login

Cleanup dirty HTML from a WYSIWYG editor

Author:
denis
Posted:
May 29, 2009
Language:
Python
Version:
1.0
Score:
1 (after 1 ratings)

My admin allows editing of some html fields using TinyMCE, so I end up with horrible code that contains lots of nested <p>, <div>, <span> tags, and style properties which destroy my layout and consistence.

This tag based on lxml tries to kill as much unneeded tags as possible, and style properties. These properties can be customized by adapting the regex to your needs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from lxml import html, etree
import re

register = Library()

css_cleanup_regex = re.compile('((font|padding|margin)(-[^:]+)?|line-height):\s*[^;]+;')
def _cleanup_elements(elem):
    """
    Removes empty elements from HTML (i.e. those without text inside).
    If the tag has a 'style' attribute, we remove the css attributes we don't want.
    """
    if elem.text_content().strip() == '':
        elem.drop_tree()
    else:
        if elem.attrib.has_key('style'):
            elem.attrib['style'] = css_cleanup_regex.sub('', elem.attrib['style'])
        for sub in elem:
            _cleanup_elements(sub)

@register.simple_tag
def cleanup_html(string):
    """
    Makes generated HTML (i.e. ouput from the WYSISYG) look almost decent.
    """
    try:
        elem = html.fromstring(string)
        _cleanup_elements(elem)
        html_string = html.tostring(elem)
        lines = []
        for line in html_string.splitlines():
            line = line.rstrip()
            if line != '': lines.append(line)
        return '\n'.join(lines)
    except etree.XMLSyntaxError:
        return string

More like this

  1. Template tag - list punctuation for a list of items by shapiromatron 3 months ago
  2. JSONRequestMiddleware adds a .json() method to your HttpRequests by cdcarter 3 months, 1 week ago
  3. Serializer factory with Django Rest Framework by julio 10 months, 1 week ago
  4. Image compression before saving the new model / work with JPG, PNG by Schleidens 10 months, 4 weeks ago
  5. Help text hyperlinks by sa2812 11 months, 3 weeks ago

Comments

andybak (on May 31, 2009):

Why not just use TinyMCE's 'valid_elements' option to control which tags it allows?

#

Please login first before commenting.