Thursday, December 22, 2005

Cleaning HTML generated by MS Word

During my last project, I was using a HTML editor for the user to submit content. I used FreeTextBox http://www.freetextbox.com

I wanted to let the users copying their text from Word. However I wanted to "filter"the HTML I was getting to only keep carriage returns.

After some researches, I decided to use Regular Expressions and I found a very good code sample on a weblog that was doing something similar (HTML cleaning for CMS posts - http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx)

I worked out some regular expressions to get exactly what I needed for my project and here is the final method.

/// <summary>
/// Cleans the abstract text if copied from Word
/// </summary>
/// <param name="abstractText">Text of the abstract</param>
/// <returns>Cleaned Text</returns>
private string CleanWordHtmlText(string abstractText)
{
//Replace paragraphs with custom string to preserve Carriage Returns
abstractText = abstractText.Replace("<o:p></o:p>", "$##$BR$##$");

//Drop potentially dangerous HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:scriptembedobjectframesetframeiframemetalinkstyle)(.\n)*?>", "");

//Drop XML namespace references
abstractText = Regex.Replace(abstractText, @"<?.(?i:xml)(.\n)*?>", "");

//Drop WORD HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:fontdelinspspandivbemhstrongst1a[ovwxp]:\w+)(.\n)*?>", "");

//Drop Table HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:tabletbodytrthtd)(.\n)*?>", "");

//Replace custom string with standard HTML carriage returns
abstractText = abstractText.Replace("$##$BR$##$", "<br/>");

return abstractText;
}