To regex - or not to regex?

Hi,

I have yet another regex query - this time regarding the retieval of a URL from HTML. I have a string called HTML, with a number of links on it. The format for these links is something like:

<a href="/index/welcome.php?action=one&amp;section=27">one</a>

Now, I only want the link which is called one (in the link text - so between the > and the </a>), and from that link I want everything between the " marks. Can anyone help me do this? Thanks

[679 byte] By [Martinp23] at [2008-2-17]
# 1
Hi,
something like this should do it:
\<a\s+href\s*=\s*\"(.*)\"\s*\>(.*)\<\/a\>

Hope it helps.

n0n4m3 at 2007-8-31 > top of Msdn Tech,.NET Development,Regular Expressions...
# 2

Hi

That didn't work I'm afraid - it outputs the whole HTML string. I'll give a better example of what I want:

I may have a link like this: <a href="/index/page.php?title=Welcome234" title="Edit section: Welcome">edit</a>

There may be hundereds of links like this in the HTML string, but what is different in each of them is the link itself and the link title. The program gets user input (a string called "find") for the title to find (the user inputs the part of the title after the "Edit section: " (note the space) and the regex should output the link for that title. How can I do this?

Thanks

Martinp23 at 2007-8-31 > top of Msdn Tech,.NET Development,Regular Expressions...
# 3

you could use regex however it is expensive. If you are going to be doing this stuff through your application then yes, use Regex otherwise you could use the String.SubString method:

string theString = theHtmlString.SubString(theHtmlString.SubString(theHtmlString.IndexOf(">"));

untested but it should be there abouts.

ahmedilyas at 2007-8-31 > top of Msdn Tech,.NET Development,Regular Expressions...
# 4
From your posts Martin, I think you're overusing regex. I HATE regex, its so unnecessarily hard, when substring, and trimstart, and trimeend, could all work. Just trim the "a href and /a" and youre done. Or, as ahmedilyas said, substring.

One day a few weeks back, I need to do a similar thing, and spent more than an hour trying to figure out how to get regex to work. Then as I doing some stuff I saw substring, and I used that instead, and it worked perfectly fine.

GeekSquad at 2007-8-31 > top of Msdn Tech,.NET Development,Regular Expressions...
# 5

well, regex is used for pattern searching as you may know and is the proper way of doing things on the long run but it expensive (can you blame it?) - with special meanings/pattern keywords etc... it is good but hard, this is where you need to practice and read about it ;-)

ahmedilyas at 2007-8-31 > top of Msdn Tech,.NET Development,Regular Expressions...
# 6

This pattern should work.

\<a[^>]*\>([^<]*)\<\/a\>

It is defined as:

<a
Any character not in ">" zero or more times
>
Capture
Any character not in "<" zero or more times
End Capture
</a>

JamesCurran at 2007-8-31 > top of Msdn Tech,.NET Development,Regular Expressions...
# 7

Hi

Thanks for all your help - I've used substring this time though - because I understand it!

Martinp23 at 2007-8-31 > top of Msdn Tech,.NET Development,Regular Expressions...

.NET Development

Site Classified