Explain Match Behavior
I'm having a problem understanding how Regex.Match interprets an extremely trivial match problem.
If I'm searching the string "ban" with the pattern b*, the matched string is b as one would expect. But if I change the pattern to a*, there is no match.
Now I would have thought that since .NET Regex is supposed to do a greedy match it would match the first 'a' in ban. The only way I can explain the behavior is that the match engine sees the first 'b' and says "this is a match because the pattern says match zero or more a's". Since zero is a match it returns zero matches.
But if my interpretation is correct, why does the pattern ba* return ba as the match and not just b?
(Moderator: Thread moved to the Regular Expression Forum and Title tweaked for quicker thread understanding during a search)
Hello NedHamilton,
Thank you for your post.The Regex engine returns the ‘first match’ instead of the ‘longest match’.In .NET, regular expression patterns evaluate starting with the first character of the input string. If a match is not found, the expression is re-evaluated starting on the next input character. This process continues until either a match is found or the end of the input string is reached.
The '*' greedy operator will consume as much as possible at the current location in the input string.If a match is found, Regex will return that match – even if the ‘*’ operator did not consume any characters as part of the match.
In most cases, returning the first match instead of the longest match is the desired behavior.Consider the case where you are parsing a markup language:
regular expression: “<a>\w*</a>”
input string: “blah blah <a></a> blah blah <a>Hello World</a>”
In this case the regular expression will return the shorter, first match:”<a></a>” – it won’t skip ahead to find the longer match “<a>Hello World</a>”.
Thanks,
Josh Free
Base Class Library Development