Total Commander Forum Index Total Commander
Forum - Public Discussion and Support
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Problems searching for capital Russian letters in UTF-16BE

 
Post new topic   Reply to topic    Total Commander Forum Index -> TC Fixed bugs Printable version
View previous topic :: View next topic  
Author Message
Flint
Power Member
Power Member


Joined: 27 Oct 2003
Posts: 2867
Location: Moscow, Russia

PostPosted: Sun Jun 24, 2012 10:56 am    Post subject: Problems searching for capital Russian letters in UTF-16BE Reply with quote

1. Unpack the two text files from the archive:
Code:
MIME-Version: 1.0
Content-Type: application/octet-stream; name="test.rar"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="test.rar"

UmFyIRoHADvQcwgADQAAAAAAAADJhHSAkCsAFwAAAAoAAAACoB78iB2m2EAdNQYAIAAAAGJlLnR4
dACwtvtcAREL7QfOiUDwBT+TdsDwZnrFeJRZhvBAN3SQkCsAAwAAAAoAAAAClqhRUyCm2EAdNQYA
IAAAAGxlLnR4dACwjIZHZ8OQxD17AEAHAA==

2. These files contain the Russian word "Тест" in UTF-16LE and UTF-16BE encoding, respectively. Open Find Files in TC, specify searching for text Тест, check the "Unicode" option, press "Start search".
3. Only the le.txt file is found.

If you search for ест (that is, the part of the word without the first capital letter), both le.txt and be.txt will be found.
_________________
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
 
Using TC 8.01 / Win7 x64 SP1
Back to top
View user's profile Send private message Send e-mail Visit poster's website
MaxX
Senior Member
Senior Member


Joined: 23 Mar 2012
Posts: 311

PostPosted: Sun Jun 24, 2012 11:50 am    Post subject: Reply with quote

2Flint
Did TC ever work with unicode 16BE before?
Back to top
View user's profile Send private message
Flint
Power Member
Power Member


Joined: 27 Oct 2003
Posts: 2867
Location: Moscow, Russia

PostPosted: Sun Jun 24, 2012 12:42 pm    Post subject: Reply with quote

MaxX wrote:
Did TC ever work with unicode 16BE before?

At least Lister supports it fine. As for the Find Files function, I'm not sure. But even if it wasn't supposed to work with UTF-16BE files, the fact is, it already does (unless no capital letters are present), so improving it to work with capital letters would only sound logical.

BTW, I tested with English text, and situation is the same. So there is no need to check particularly with cyrillic text. (I was just following the original description of the bug from one of the Russian forums, and didn't think to check it on non-Russian contents.)
_________________
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
 
Using TC 8.01 / Win7 x64 SP1
Back to top
View user's profile Send private message Send e-mail Visit poster's website
ghisler(Author)
Site Admin
Site Admin


Joined: 04 Feb 2003
Posts: 24621
Location: Switzerland

PostPosted: Mon Jun 25, 2012 7:48 am    Post subject: Reply with quote

I will check it - Unicode BE should work when there is a BOM, and there is definitely one in the file.
_________________
Author of Total Commander
http://www.ghisler.com
Back to top
View user's profile Send private message Send e-mail Visit poster's website
white
Power Member
Power Member


Joined: 19 Nov 2003
Posts: 1307
Location: Netherlands

PostPosted: Mon Jun 25, 2012 11:15 am    Post subject: Re: Problems searching for capital Russian letters in UTF-16 Reply with quote

Flint wrote:
3. Only the le.txt file is found.

Confirmed. Furthermore, when I enable "Case sensitive" both files are found.
_________________
#16626 Personal licence
Back to top
View user's profile Send private message Send e-mail
Flint
Power Member
Power Member


Joined: 27 Oct 2003
Posts: 2867
Location: Moscow, Russia

PostPosted: Mon Jun 25, 2012 11:56 am    Post subject: Reply with quote

white wrote:
Furthermore, when I enable "Case sensitive" both files are found.

That's right. Sorry, forgot to mention about it.
_________________
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
 
Using TC 8.01 / Win7 x64 SP1
Back to top
View user's profile Send private message Send e-mail Visit poster's website
MVV
Power Member
Power Member


Joined: 03 Aug 2008
Posts: 4548
Location: Russian Federation

PostPosted: Mon Jun 25, 2012 12:10 pm    Post subject: Reply with quote

UTF-16 BE has reversed byte order in words (its BOM also has reversed bytes order in comparison with UTF-16LE BOM: FFFE and FEFF).
It is an accident that a part of a word was found, and only because of byte mix:

Mentioned word "Тест" in UTF-16 BE:
Quote:
0422043504410442


Mentioned word "Тест" in UTF-16 LE:
Quote:
2204350441044204


In red there is a common part of byte arrays (exactly "ест" in UTF-16BE) - so it was found only because that sequence appears in UTF-16BE file - it is a collision (first bytes of theese letters in BE are 04 and last ones in LE are 04 too). And whole word can't be found because its byte array doesn't appear here. Also, in case of different high-order bytes in characters TC won't find text at all. Any Unicode band has such collisions.

Currently TC doesn't support UTF-16BE in find files dialog (however simple byte order reverse may allow searching in BE, but currently there is no corresponding checkbox).

However it is interesting that Lister supports UTF-16BE too.
_________________
VirtualPanel plugin: Temporary panel for TC (forum)
TOTALCMD.NET: TCFS2, NTLinks, CopyTree, AskParam, ConPaste, Sudo…
Back to top
View user's profile Send private message Send e-mail
Flint
Power Member
Power Member


Joined: 27 Oct 2003
Posts: 2867
Location: Moscow, Russia

PostPosted: Mon Jun 25, 2012 1:15 pm    Post subject: Reply with quote

MVV wrote:
it is a collision

I thought so too, so I tested with a different couple of files, both containing the English word " Test ", with a space before and a space after the word. In this case the whole word "Test" (without spaces) would fit into that collision of identical byte sequences. But still TC refuses to find the second file.

Or even simpler: just replace the "Тест" with "ТЕСТ" in both files. Collision remains, but TC fails to find the UTF-16BE file if you search for "ест" or "ЕСТ".
_________________
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
 
Using TC 8.01 / Win7 x64 SP1
Back to top
View user's profile Send private message Send e-mail Visit poster's website
MVV
Power Member
Power Member


Joined: 03 Aug 2008
Posts: 4548
Location: Russian Federation

PostPosted: Mon Jun 25, 2012 1:41 pm    Post subject: Reply with quote

Yeah, seems that something is really wrong here. It is strange that with 'match case' option TC finds both files and w/o that option - only one file.

BTW in case of files files w/o BOM TC finds both ones using pattern "ТЕС" (first three letters). So it seems that TC tries to read BOM markers and does something crazy.
_________________
VirtualPanel plugin: Temporary panel for TC (forum)
TOTALCMD.NET: TCFS2, NTLinks, CopyTree, AskParam, ConPaste, Sudo…
Back to top
View user's profile Send private message Send e-mail
white
Power Member
Power Member


Joined: 19 Nov 2003
Posts: 1307
Location: Netherlands

PostPosted: Mon Jun 25, 2012 2:20 pm    Post subject: Reply with quote

MVV wrote:
It is an accident that a part of a word was found..

No, it wasn't. Try shortening the string in the test file to "е" and you will see it does work (search for "е"). However case insensitive unicode search fails on upper case characters. Try for example a test file containing "Е".
_________________
#16626 Personal licence
Back to top
View user's profile Send private message Send e-mail
Flint
Power Member
Power Member


Joined: 27 Oct 2003
Posts: 2867
Location: Moscow, Russia

PostPosted: Fri Jun 29, 2012 6:52 am    Post subject: Reply with quote

Cannot reproduce the problem anymore in rc2, so seems to be fixed.
_________________
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
 
Using TC 8.01 / Win7 x64 SP1
Back to top
View user's profile Send private message Send e-mail Visit poster's website
ghisler(Author)
Site Admin
Site Admin


Joined: 04 Feb 2003
Posts: 24621
Location: Switzerland

PostPosted: Fri Jun 29, 2012 9:07 am    Post subject: Reply with quote

Thanks! Btw, the problem wasn't related to Russian, it happend with "Test" in English too.
_________________
Author of Total Commander
http://www.ghisler.com
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Flint
Power Member
Power Member


Joined: 27 Oct 2003
Posts: 2867
Location: Moscow, Russia

PostPosted: Fri Jun 29, 2012 9:12 am    Post subject: Reply with quote

ghisler(Author) wrote:
Btw, the problem wasn't related to Russian, it happend with "Test" in English too.

Yes, I noticed that later too…
_________________
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
 
Using TC 8.01 / Win7 x64 SP1
Back to top
View user's profile Send private message Send e-mail Visit poster's website
white
Power Member
Power Member


Joined: 19 Nov 2003
Posts: 1307
Location: Netherlands

PostPosted: Fri Jun 29, 2012 9:55 am    Post subject: Reply with quote

Yes. Seems to be fixed in 8.01 rc2. (Tested 32-bit)
_________________
#16626 Personal licence
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic    Total Commander Forum Index -> TC Fixed bugs All times are GMT - 6 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Impressum: This site is maintained by Ghisler Software GmbH

Using phpBB © 2001-2005 phpBB Group