Remove certain code and text with regular expressions
Hi
I have an HTML file that was generated by MS Word and I'm removing all of the unnecessary code. One of the things I want removed is a certain span tag that contain the words "mso-hide" (without the quotes).
If my program runs into this code:
Code Snippet
<span
style='color:windowtext;display:none;mso-hide:screen;text-decoration:none;
text-underline:none'>3</span>
then I want to remove the whole span tag including the contents...in this case 3. the contents of this span tag will always be a one to three digit number.
I've tried many patterns but not successful. Any ideas out there?
Thanks
(Moderator: Code can be place via code snippet button on editor { } )
[1035 byte] By [
sensfan] at [2008-1-4]
assuming u use .NET Regex Object, this should match on your span tags
(?si)<span[^>]*?mso-hide[^>]*?>\d{1,3}</span>
then replace the match wiht [empty string]
tested OK in Expresso v2.1
// using System.Text.RegularExpressions;
/// <summary>
/// Regular expression built for C# on: Thu, May 24, 2007, 07:08:31 PM
/// Using Expresso Version: 2.1.2150, http://www.ultrapico.com
///
/// A description of the regular expression:
///
/// Change options within the enclosing group [si]
/// Turn ON Ignore Case option
/// Turn ON Single Line option
/// <span[^>]*?mso-hide[^>]*?>\d{1,3}</span>
/// <spanAny character other than >, any number of repetitions, as few as possiblemso-hideAny character other than >, any number of repetitions, as few as possible>
/// Any digit, between 1 and 3 repetitions
///
///
/// </summary>
public static Regex regex = new Regex(
@"(?si)<span[^>]*?mso-hide[^>]*?>\d{1,3}</span>");