[BUG] Help. What are limits of RegEx?

English support forum

Moderators: white, Hacker, petermad, Stefan2

User avatar
MarkFilipak
Member
Member
Posts: 164
Joined: 2008-09-28, 01:00 UTC
Location: Mansfield, Ohio

[BUG] Help. What are limits of RegEx?

Post by *MarkFilipak »

UPDATE:
This: \x00\x00 RegEX (2) succeeds.
This: \x00\x00\x01 RegEX (2) fails.

What is going on with RegEx?

MORE UPDATE:
It looks like RegEx is totally broken.
====

How can I get this regex to work?
\x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.)

This: \x00\x00\x01\xB5 RegEX (2) fails.
This: \x00\x00\x01\xB5 Hex succeeds.
This: \x00\x00\x01\xB5. Hex succeeds but selects only 4 bytes (the '.' is ignored).
This: \x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.) RegEx (2) fails.
This: \x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.) Hex fails.

What am I doing wrong?

Thanks so much,
Mark.

Of course you want to know what this is about, eh? I'm searching the binary contents of DVDs (VOB files) looking for particular 'sequence_extension' metadata.

I have yet to see a DVD that has this: 0x00 00 01 B5 1? ??, where '? ??' is other than '4 82' (i.e. MP@ML plus !progressive_sequence plus 4:2:0). (Note '<<== all DVDs?' in the table, below.) The above regex performs such a search and reports 0x00 00 01 B5 1 plus something ('? ??'), but not '4 82'.

The pattern is the 'sequence_extension' header ID metadata followed by 'profile_and_level_indication' -- the combinations are shown in the table, below -- which is followed by 'progressive_sequence' followed by 'chroma_format'.

0x00 00 01 B5 11 2 High@HighP
0x00 00 01 B5 11 4 High@High
0x00 00 01 B5 11 6 High@High1440
0x00 00 01 B5 11 8 High@Main
0x00 00 01 B5 11 A High@Low
0x00 00 01 B5 12 2 SpaciallyScalable@HighP
0x00 00 01 B5 12 4 SpaciallyScalable@High
0x00 00 01 B5 12 6 SpaciallyScalable@High1440
0x00 00 01 B5 12 8 SpaciallyScalable@Main
0x00 00 01 B5 12 A SpaciallyScalable@Low
0x00 00 01 B5 13 2 SNRScalable@HighP
0x00 00 01 B5 13 4 SNRScalable@High
0x00 00 01 B5 13 6 SNRScalable@High1440
0x00 00 01 B5 13 8 SNRScalable@Main
0x00 00 01 B5 13 A SNRScalable@Low
0x00 00 01 B5 14 2 Main@HighP
0x00 00 01 B5 14 4 Main@High
0x00 00 01 B5 14 6 Main@High1440
0x00 00 01 B5 14 8 Main@Main <<== all DVDs?
0x00 00 01 B5 14 A Main@Low
0x00 00 01 B5 15 2 Simple@HighP
0x00 00 01 B5 15 4 Simple@High
0x00 00 01 B5 15 6 Simple@High1440
0x00 00 01 B5 15 8 Simple@Main
0x00 00 01 B5 15 A Simple@Low
0x00 00 01 B5 18 E Multi-view@Low
0x00 00 01 B5 18 D Multi-view@Main
0x00 00 01 B5 18 B Multi-view@High1440
0x00 00 01 B5 18 A Multi-view@High
0x00 00 01 B5 18 5 4:2:2@Main
0x00 00 01 B5 18 2 4:2:2@High
Last edited by MarkFilipak on 2020-10-12, 16:40 UTC, edited 1 time in total.
Hi Christian! Delighted customer since 1999. License #37627
User avatar
MarkFilipak
Member
Member
Posts: 164
Joined: 2008-09-28, 01:00 UTC
Location: Mansfield, Ohio

Re: Help. What are limits of RegEx?

Post by *MarkFilipak »

Thanks for the bad news. Well, that makes RegEx a sham, doesn't it?
Hi Christian! Delighted customer since 1999. License #37627
gdpr deleted 6
Power Member
Power Member
Posts: 872
Joined: 2013-09-04, 14:07 UTC

Re: Help. What are limits of RegEx?

Post by *gdpr deleted 6 »

MarkFilipak wrote: 2020-10-12, 16:23 UTC
Thanks for the bad news. Well, that makes RegEx a sham, doesn't it?
No, it's not a sham. Why would it be a sham? Just because it is a limitation makes TC's regex limited, but not a sham.
Less angry hyperbole, please...
User avatar
MarkFilipak
Member
Member
Posts: 164
Joined: 2008-09-28, 01:00 UTC
Location: Mansfield, Ohio

Re: Help. What are limits of RegEx?

Post by *MarkFilipak »

Actually, what you wrote is incorrect. It's not that RegEx can't find '\x00'. It does.

The problem is that when it finds '\x00', it selects the '\x00' plus the next byte (as though it had searched for '\x00.'). The search index is now off by 1 byte, so the remainder of the search string fails (even though the target does exist).

This is just a bug.
Hi Christian! Delighted customer since 1999. License #37627
gdpr deleted 6
Power Member
Power Member
Posts: 872
Joined: 2013-09-04, 14:07 UTC

Re: Help. What are limits of RegEx?

Post by *gdpr deleted 6 »

MarkFilipak wrote: 2020-10-12, 16:39 UTC Actually, what you wrote is incorrect. It's not that RegEx can't find '\x00'. It does.
TC does not work reliably with \x00, as you found out. As i mentioned in the post i linked to, Ghisler has mentioned already in the past that \x00 and \x0000 don't really work reliably. At this point it is moot to argue whether regex patterns with \x00 or \x0000 match something incorrectly, or not match at all, because it boils down to the same thing: Patterns with \x00 or \x0000 don't really work, unfortunately. :(
User avatar
MarkFilipak
Member
Member
Posts: 164
Joined: 2008-09-28, 01:00 UTC
Location: Mansfield, Ohio

Re: Help. What are limits of RegEx?

Post by *MarkFilipak »

elgonzo wrote: 2020-10-12, 16:44 UTC
MarkFilipak wrote: 2020-10-12, 16:39 UTC Actually, what you wrote is incorrect. It's not that RegEx can't find '\x00'. It does.
TC does not work reliably with \x00, as you found out. As i mentioned in the post i linked to, Ghisler has mentioned already in the past that \x00 and \x0000 don't really work reliably. At this point it is moot to argue whether regex patterns with \x00 or \x0000 match something incorrectly, or not match at all, because it boils down to the same thing: Patterns with \x00 or \x0000 don't really work, unfortunately. :(
You should not dismiss a bug as an undocumented design 'feature'. It's a bug. It should be fixed.
Hi Christian! Delighted customer since 1999. License #37627
gdpr deleted 6
Power Member
Power Member
Posts: 872
Joined: 2013-09-04, 14:07 UTC

Re: [BUG] Help. What are limits of RegEx?

Post by *gdpr deleted 6 »

My comment about it being a limitation was a response to you exclaiming "Sham", and the only thing it is intended to be dismissive of is this exclamation of "Sham". It's not something i pulled out my nose either; look what Ghisler wrote here some time ago: https://www.ghisler.ch/board/viewtopic.php?p=224760#p224760

By the way, if you look around in the forum, you'll notice several other users in the past having stumbled over the \x00 / \x0000 issue. It would indeed be nice and much better if this were to be fixed (in the sense that patterns with \x00 and \x000 are properly functioning), i am not disagreeing with you in this regard (but short of this becoming reality, TC or its help file should spell out this limitation and not let users run into and troubleshoot the issue again and again and again...)
User avatar
MarkFilipak
Member
Member
Posts: 164
Joined: 2008-09-28, 01:00 UTC
Location: Mansfield, Ohio

Re: [BUG] Help. What are limits of RegEx?

Post by *MarkFilipak »

elgonzo wrote: 2020-10-12, 16:58 UTC My comment about it being a limitation was a response to you exclaiming "Sham", and the only thing it is intended to be dismissive of is this exclamation of "Sham". It's not something i pulled out my nose either; look what Ghisler wrote here some time ago: https://www.ghisler.ch/board/viewtopic.php?p=224760#p224760
Thank you for that... kind of you.

I posted there.
Hi Christian! Delighted customer since 1999. License #37627
User avatar
tuska
Power Member
Power Member
Posts: 3760
Joined: 2007-05-21, 12:17 UTC

Re: [BUG] Help. What are limits of RegEx?

Post by *tuska »

I cannot really give any real support on this issue(!), but I notice that probably nobody has tried yet,
in Total Commander to use the regex library of 'Everything' in a search query.
Please see: Search queries in TC using 'Everything' - point 3 RegEx - Regular Expressions
'Everything' uses 'Perl Compatible Regular Expressions (PCRE)'.

Here it is stated that there was a successful query, e.g. with Notepad++.
Notepad++ regular expressions use the Boost regular expression library v1.70,
which is based on PCRE (Perl Compatible Regular Expression) syntax, only departing from it in very minor ways.


Windows 10 Pro (x64) Version 2004 (OS build 19041.546)
TC 9.51 x64/x86 | 'Everything'-Version 1.4.1.993 (x64)
☑ 'Everything' | Search queries: TC <=> 'Everything'
User avatar
MarkFilipak
Member
Member
Posts: 164
Joined: 2008-09-28, 01:00 UTC
Location: Mansfield, Ohio

Re: [BUG] Help. What are limits of RegEx?

Post by *MarkFilipak »

tuska wrote: 2020-10-12, 21:16 UTC I cannot really give any real support on this issue(!), but I notice that probably nobody has tried yet,
in Total Commander to use the regex library of 'Everything' in a search query.
Please see: Search queries in TC using 'Everything' - point 3 RegEx - Regular Expressions
'Everything' uses 'Perl Compatible Regular Expressions (PCRE)'.
I believe that's the regexp with which I'm familiar. I don't know what Everything is.
Here it is stated that there was a successful query, e.g. with Notepad++.
I didn't understand what was being discussed there. I didn't know what this: "Problem is that he need to search files that contains 00 bytes only (entire file filled with 00 bytes), but not files that contain at least one 00 byte", meant.
Notepad++ regular expressions use the Boost regular expression library v1.70,
which is based on PCRE (Perl Compatible Regular Expression) syntax, only departing from it in very minor ways.
I think that's what I've used. I don't use POSIX, ever.

I think that the problem with regular expression processing is that it's character/line oriented, so is crippled with architecture that can't properly address the entire [\x00-\xFF]. I've researched a Linux method you might want to comment on:

Code: Select all

Implements this regexp: \x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.)
Converts the hex nibbles in FILE into textual values: 0 1 2 3 4 5 6 7 8 9 A B C D E F.
|                    Deletes the '\n's that xxd inserts -- turns the lines of nibbles into one huge string.
|                    |            Finds '000001B51???' where '???' is not '482' -- returns either '0' (not found) or '1' (found).
|                    |            |
xxd -p -u FILENAME | tr -d '\n' | grep -E -c '000001B51([0-35-9A-F]|4([0-79A-F]|8[0134-9A-F]))'
Hi Christian! Delighted customer since 1999. License #37627
User avatar
tuska
Power Member
Power Member
Posts: 3760
Joined: 2007-05-21, 12:17 UTC

Re: [BUG] Help. What are limits of RegEx?

Post by *tuska »

MarkFilipak wrote: 2020-10-13, 02:00 UTCI don't know what Everything is.
Search queries in TC using 'Everything' wrote:As of Total Commander 9.0, the tool 'Everything' can be integrated into a search query with its own search parameters.
In my signature above there are links to this tool and a documentation to help.

Based on your RegEx examples above (works/does not work) I just tried to give the hint,
that a RegEx query would also be possible by using TC [TRegExpr] with integration of the tool 'Everything', which uses PCRE.

If my assumption concerning the solution with Notepad++ was wrong, I am very sorry.
As already mentioned above, I cannot give you professional support due to insufficient knowledge (e.g. regarding RegEx queries).

Regards,
Karl
User avatar
nsp
Power Member
Power Member
Posts: 1804
Joined: 2005-12-04, 08:39 UTC
Location: Lyon (FRANCE)
Contact:

Re: [BUG] Help. What are limits of RegEx?

Post by *nsp »

Regexp is not meant for binary search nor for signature matching. It a a character matching library and in most case single line. Most of the implementation uses for the match a string representation with dedicated charset/string encoding.

And in most case depending of the charset, unicode, .... \x00 do not match \x00 !

It does not solves your issue or at least explain the miss-use of the hex expression for a string based search.

Converting a binary file to an text file with hex representation could solves your issue looking for plain numbers in regex. With large VOB files, it is probably better to first extract metadata only and then convert !

Also you could see if media info cannot helps you to get some part of the info you need from the vob file.
User avatar
MarkFilipak
Member
Member
Posts: 164
Joined: 2008-09-28, 01:00 UTC
Location: Mansfield, Ohio

Re: [BUG] Help. What are limits of RegEx?

Post by *MarkFilipak »

nsp wrote: 2020-10-13, 12:55 UTC Regexp is not meant for binary search nor for signature matching. It a a character matching library and in most case single line. Most of the implementation uses for the match a string representation with dedicated charset/string encoding.
I think that grep is for character search and that regexp is a tool for matching patterns that include any combination of bits. I think that grep is hobbled by an original design that was unnecessarily narrow and case specific. Respectfully, I think you may be conflating grep and the underlaying regexp and attributing the shortcomings of grep to regexp, for example, limiting patterns to lines.
And in most case depending of the charset, unicode, .... \x00 do not match \x00 !
I don't understand '\x00 do not match \x00' -- a negative tautology? \x00 is \x00. Are you saying that regexp cannot handle such a pattern? Since that is clearly untrue, I'm unsure what you mean, but no matter.
It does not solves your issue or at least explain the miss-use of the hex expression for a string based search.
I fail to understand your assertion that a hex-based pattern is a misuse of regexp. Clearly regexp includes hex.
Converting a binary file to an text file with hex representation could solves your issue looking for plain numbers in regex. With large VOB files, it is probably better to first extract metadata only and then convert !
Can you suggest how I can extract metadata without regexp to do the search? I see no alternative.
Also you could see if media info cannot helps you to get some part of the info you need from the vob file.
Do you mean the Mediainfo application? I need to parse everything in an MPEG stream. Mediainfo's resolution is far too narrow. What I'm looking for the answer to whether, in practice, DVDs always use HP@HL profile & level or always have 4:2:0 samples. It seems to me that I need to search a large number of VOBs to answer such questions. Even if it answered such questions, Mediainfo doesn't support either querry or bulk search.

Thanks for your comments.
Hi Christian! Delighted customer since 1999. License #37627
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Re: [BUG] Help. What are limits of RegEx?

Post by *milo1012 »

MarkFilipak wrote: 2020-10-12, 15:34 UTC Of course you want to know what this is about, eh? I'm searching the binary contents of DVDs (VOB files) looking for particular 'sequence_extension' metadata.

I have yet to see a DVD that has this: 0x00 00 01 B5 1? ??, where '? ??' is other than '4 82' (i.e. MP@ML plus !progressive_sequence plus 4:2:0). (Note '<<== all DVDs?' in the table, below.) The above regex performs such a search and reports 0x00 00 01 B5 1 plus something ('? ??'), but not '4 82'.

The pattern is the 'sequence_extension' header ID metadata followed by 'profile_and_level_indication' -- the combinations are shown in the table, below -- which is followed by 'progressive_sequence' followed by 'chroma_format'.
Probably a bit OffTopic, but why don't you use more or less "dedicated" tools for this? E.g. VobEdit can show/interpret the sequence extension quite clearly. Just open the first vob file of the (main DVD) video stream and navigate to the first video pack / I frame pack.
AFAIR dgindex can show the basic stream information (level@profile)as well, like probably a lot of other tools.

And back to topic: TC's limitations when it comes to splitting file content on newlines were the main reason why I started PCREsearch plugin. Your RegEx will probably work with it.
TC plugins: PCREsearch and RegXtract
Post Reply