Login

Clean text pasted from word into RTE

Author:
rodrigoc
Posted:
June 7, 2009
Language:
Python
Version:
1.0
Score:
1 (after 1 ratings)

Utility function I am currently using to clean up taste pasted from Word into a Tiny MCE enabled Text field.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import re

def clean_word(txt, its):
    for i in "font div span font img hr table td tr".split():
        r=re.compile(r'</?%s[^>]*>' % i)
        txt = r.sub('',txt)
    for i in [
        r'<!--.*?<![^>]*>',
        r'<.--\[if [^>]*>.*?<.\[endif]-->',
        r'<style>.*?</style>',
        r'<(\w:[^>]*?)>.*</\1>',
        r'class=".*?"',
        r'<.--.*?-->',
        r'&lt;!--.*?--&gt;',
        #r'<p[^>]*>&nbsp;</p[^>]*>',
        #r'<p[^>]*>\s*</p[^>]*>',
        r"""align=["'][^"']*["']""",
        r"""style=["'][^"']*["']""",
        r'{mso-[^}]*}',
        r'<[^>]*>((&nbsp;)|\s*)</[^>]*>',
        ]:
        r=re.compile(i, re.DOTALL)
        txt = r.sub('',txt)
    if its>0:
        return clean_word(txt, its-1)
    r = re.compile(r'(<br\s?/?>\s*){1,9999}')
    txt = r.sub("</p><p>",txt)
    return txt

More like this

  1. Template tag - list punctuation for a list of items by shapiromatron 2 months, 2 weeks ago
  2. JSONRequestMiddleware adds a .json() method to your HttpRequests by cdcarter 2 months, 3 weeks ago
  3. Serializer factory with Django Rest Framework by julio 9 months, 2 weeks ago
  4. Image compression before saving the new model / work with JPG, PNG by Schleidens 10 months, 1 week ago
  5. Help text hyperlinks by sa2812 11 months ago

Comments

ewalstad (on June 8, 2009):

The regex's are getting compiled once each time the function is called. They don't change between calls to clean_word so why not move them to the the module level and compile them only once there?

#

Please login first before commenting.