I wanted to let the users copying their text from Word. However I wanted to "filter"the HTML I was getting to only keep carriage returns.
After some researches, I decided to use Regular Expressions and I found a very good code sample on a weblog that was doing something similar (HTML cleaning for CMS posts - http://tim.mackey.ie/CleanWordHTMLUsingRegularExpressions.aspx)
I worked out some regular expressions to get exactly what I needed for my project and here is the final method.
/// <summary>
/// Cleans the abstract text if copied from Word
/// </summary>
/// <param name="abstractText">Text of the abstract</param>
/// <returns>Cleaned Text</returns>
private string CleanWordHtmlText(string abstractText)
{
//Replace paragraphs with custom string to preserve Carriage Returns
abstractText = abstractText.Replace("<o:p></o:p>", "$##$BR$##$");
//Drop potentially dangerous HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:scriptembedobjectframesetframeiframemetalinkstyle)(.\n)*?>", "");
//Drop XML namespace references
abstractText = Regex.Replace(abstractText, @"<?.(?i:xml)(.\n)*?>", "");
//Drop WORD HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:fontdelinspspandivbemhstrongst1a[ovwxp]:\w+)(.\n)*?>", "");
//Drop Table HTML tags
abstractText = Regex.Replace(abstractText, @"</?(?i:tabletbodytrthtd)(.\n)*?>", "");
//Replace custom string with standard HTML carriage returns
abstractText = abstractText.Replace("$##$BR$##$", "<br/>");
return abstractText;
}
No comments:
Post a Comment