djangosnippets: Clean text pasted from word into RTE

Author:: rodrigoc
Posted:: June 7, 2009
Language:: Python
Version:: 1.0
Score:: 1 (after 1 ratings)

Download
Raw

Utility function I am currently using to clean up taste pasted from Word into a Tiny MCE enabled Text field.

import re

def clean_word(txt, its):
    for i in "font div span font img hr table td tr".split():
        r=re.compile(r'</?%s[^>]*>' % i)
        txt = r.sub('',txt)
    for i in [
        r'<!--.*?<![^>]*>',
        r'<.--\[if [^>]*>.*?<.\[endif]-->',
        r'<style>.*?</style>',
        r'<(\w:[^>]*?)>.*</\1>',
        r'class=".*?"',
        r'<.--.*?-->',
        r'&lt;!--.*?--&gt;',
        #r'<p[^>]*>&nbsp;</p[^>]*>',
        #r'<p[^>]*>\s*</p[^>]*>',
        r"""align=["'][^"']*["']""",
        r"""style=["'][^"']*["']""",
        r'{mso-[^}]*}',
        r'<[^>]*>((&nbsp;)|\s*)</[^>]*>',
        ]:
        r=re.compile(i, re.DOTALL)
        txt = r.sub('',txt)
    if its>0:
        return clean_word(txt, its-1)
    r = re.compile(r'(<br\s?/?>\s*){1,9999}')
    txt = r.sub("</p><p>",txt)
    return txt

Comments

ewalstad (on June 8, 2009):

The regex's are getting compiled once each time the function is called. They don't change between calls to clean_word so why not move them to the the module level and compile them only once there?

Please login first before commenting.

Clean text pasted from word into RTE

More like this

Comments