Thursday, December 22, 2005

Cleaning HTML generated by MS Word

During my last project, I was using a HTML editor for the user to submit content. I used FreeTextBox http://www.freetextbox.com

I wanted to let the users copying their text from Word. However I wanted to "filter"the HTML I was getting to only keep carriage returns.

After some researches, I decided to use Regular Expressions and I found a very good code sample on a weblog that was doing something similar (HTML cleaning for CMS posts - http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx)

I worked out some regular expressions to get exactly what I needed for my project and here is the final method.

/// <summary>
/// Cleans the abstract text if copied from Word
/// </summary>
/// <param name="abstractText">Text of the abstract</param>
/// <returns>Cleaned Text</returns>
private string CleanWordHtmlText(string abstractText)
{
//Replace paragraphs with custom string to preserve Carriage Returns
abstractText = abstractText.Replace("<o:p></o:p>", "$##$BR$##$");

//Drop potentially dangerous HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:scriptembedobjectframesetframeiframemetalinkstyle)(.\n)*?>", "");

//Drop XML namespace references
abstractText = Regex.Replace(abstractText, @"<?.(?i:xml)(.\n)*?>", "");

//Drop WORD HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:fontdelinspspandivbemhstrongst1a[ovwxp]:\w+)(.\n)*?>", "");

//Drop Table HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:tabletbodytrthtd)(.\n)*?>", "");

//Replace custom string with standard HTML carriage returns
abstractText = abstractText.Replace("$##$BR$##$", "<br/>");

return abstractText;
}

Thursday, September 01, 2005

Yet another blogger

I have been thinking to create my blog for awhile now, but I was not sure of what to write...There are lots of blogs outhere and many of them are just boring (you may think the same for mine). I did not want to blog about me. Blogging about technologies? Why not but there are already so many...


Finally I decided to post technical tips to share my experience and some of my knowledge, either to help others finding solutions or just to build my own knowledge base.


Don't hesitate to drop your comment or to contact me if you have any questions.