Why Would This Regex Return An Error?

January 31, 2023 Post a Comment

Why does the following evaluate to true? if(preg_match_all('%.*?.*?.*?%ims', $contents, $x)===FALSE) {...} $contents, is retrieved using

Solution 1:

I'm adding this second answer in response to the new information you added since I posted the first one. My goal there was to help you restore your system to its previous state, when the regexes were working. I tend to agree with the commenter on that page I linked to, who said the default setting was overly conservative. So I stand by that answer, but I don't want anyone to think they can solve all their regex problems by throwing more memory at them.

Now that I've seen your real-world regexes, I have to say you have another problem. I tested that third regex against the page you linked to in RegexBuddy and these are the results I got:

(?ims)<tr.*?>.*?<b>.*?first science area of study.*?</b>.*?</tr>.*?<tr.*?>.*?<td.*?>.*?<b>(.*?) \((.*?)\).*?</b>(.*?credits.*?)</td>.*?<td.*?>(.*?<a .*?)</td>.*?</tr>

          course name       start      end    steps
Match #1  (Comp. Sci.)        10       275    31271
Match #2  (Bio & Chem)       276       341     6986
Match #3  (Enviro)           342       379     5944
Match #4  (Genetics)         386       416     4463
Match #5  (Chem)             417       455     5074
Match #6  (Math)             495       546    15610
Match #7  (Phys & Astro)     547       593     8617
Match #8  (no match)        gave up after 1,000,000 steps

You've probably heard many people say that non-greedy regexes always return the shortest possible match, so why did this one return a first match that's 200 lines longer than any of the others? You may have heard that they're more efficient because they don't backtrack as much, so why did this one take over 30,000 steps to complete the first match, and why did it effectively lock up on the final attempt when no match was possible?

First off, there's no such thing as a greedy or non-greedy regex. Only individual quantifiers can be described that way. A regex in which every quantifier is greedy won't necessarily return the longest possible match, and the name "non-greedy regex" is even less accurate. Greedy or non-greedy, the regex engine always starts trying to match at the earliest opportunity, and it doesn't give up on a starting position until every possible path from it has been explored.

Non-greedy quantifiers are only a convenience; there's nothing magical about them. It's still up to you, the regex author, to guide the regex engine to a correct and efficient match. Your regex may returning the correct results, but it's wasting a hell of a lot of effort in the process. It consumes a lot of characters that it doesn't need to at the beginning, it thrashes about endlessly examining the same characters over and over, and it takes way too long to figure out when the path it's on can't lead to a match.

Now check out this regex:

(?is)<tr[^<]*(?:<(?!/tr>|b>)[^<]*)*<b>\s*first science area of study\s*</b>.*?</tr>.*?<tr.*?>.*?<td.*?>.*?<b>(.*?) \((.*?)\).*?</b>(.*?credits.*?)</td>.*?<td.*?>(.*?<a .*?)</td>.*?</tr>

          course name       start      end    steps
Match #1  (Comp. Sci.)       209       275     9891
Match #2  (Bio & Chem)       276       341     5389
Match #3  (Enviro)           342       379     5833
Match #4  (Genetics)         386       416     4222
Match #5  (Chem)             417       455     4961
Match #6  (Math)             495       546     9899
Match #7  (Phys & Astro)     547       593     8506
Match #8  (no match)        reported failure in 139 steps

After the first </b>, everything is as you wrote it. The effect of my changes is that it doesn't start matching in earnest until it finds the <TR> element that contains the first <B> tag we're interested in:

<tr[^<]*(?:<(?!/tr>|b>)[^<]*)*<b>\s*first science area of study\s*</b>

This part spends most of its time greedily consuming characters with [^<]*, which is significantly faster, character for character, than a non-greedy .*?. But far more important is that it takes practically no time to figure out when no more matches are possible. If there's a Golden Rule of regex performance it's this: when a match attempt is going to fail, it should fail as quickly as possible.

Solution 2:

You might check your pcre.backtrack_limit setting. It would have to be ridiculously low to prevent that regex from matching that input, but you did say you'd been messing around with the setup...

You can try testing it by changing the regex. When I tested it in RegexBuddy, your regex matched that input in 1216 steps. When I changed it to this:

'%<tr.*?>.*?<b>.*?</b>[^<]*(?:<(?!/?tr\b)[^<]*)*</tr>%ims'

...it only took 441 steps.

Html5 School

Why Would This Regex Return An Error?

Solution 1:

Solution 2:

Post a Comment for "Why Would This Regex Return An Error?"