| View previous topic :: View next topic |
| Author |
Message |
Flint Power Member


Joined: 27 Oct 2003 Posts: 2867 Location: Moscow, Russia
|
Posted: Sun Jun 24, 2012 10:56 am Post subject: Problems searching for capital Russian letters in UTF-16BE |
|
|
1. Unpack the two text files from the archive:
| Code: | MIME-Version: 1.0
Content-Type: application/octet-stream; name="test.rar"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="test.rar"
UmFyIRoHADvQcwgADQAAAAAAAADJhHSAkCsAFwAAAAoAAAACoB78iB2m2EAdNQYAIAAAAGJlLnR4
dACwtvtcAREL7QfOiUDwBT+TdsDwZnrFeJRZhvBAN3SQkCsAAwAAAAoAAAAClqhRUyCm2EAdNQYA
IAAAAGxlLnR4dACwjIZHZ8OQxD17AEAHAA== |
2. These files contain the Russian word "Тест" in UTF-16LE and UTF-16BE encoding, respectively. Open Find Files in TC, specify searching for text Тест, check the "Unicode" option, press "Start search".
3. Only the le.txt file is found.
If you search for ест (that is, the part of the word without the first capital letter), both le.txt and be.txt will be found. _________________ Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
Using TC 8.01 / Win7 x64 SP1 |
|
| Back to top |
|
 |
MaxX Senior Member


Joined: 23 Mar 2012 Posts: 311
|
Posted: Sun Jun 24, 2012 11:50 am Post subject: |
|
|
2Flint
Did TC ever work with unicode 16BE before? |
|
| Back to top |
|
 |
Flint Power Member


Joined: 27 Oct 2003 Posts: 2867 Location: Moscow, Russia
|
Posted: Sun Jun 24, 2012 12:42 pm Post subject: |
|
|
| MaxX wrote: | | Did TC ever work with unicode 16BE before? |
At least Lister supports it fine. As for the Find Files function, I'm not sure. But even if it wasn't supposed to work with UTF-16BE files, the fact is, it already does (unless no capital letters are present), so improving it to work with capital letters would only sound logical.
BTW, I tested with English text, and situation is the same. So there is no need to check particularly with cyrillic text. (I was just following the original description of the bug from one of the Russian forums, and didn't think to check it on non-Russian contents.) _________________ Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
Using TC 8.01 / Win7 x64 SP1 |
|
| Back to top |
|
 |
ghisler(Author) Site Admin


Joined: 04 Feb 2003 Posts: 24621 Location: Switzerland
|
Posted: Mon Jun 25, 2012 7:48 am Post subject: |
|
|
I will check it - Unicode BE should work when there is a BOM, and there is definitely one in the file. _________________ Author of Total Commander
http://www.ghisler.com |
|
| Back to top |
|
 |
white Power Member


Joined: 19 Nov 2003 Posts: 1307 Location: Netherlands
|
Posted: Mon Jun 25, 2012 11:15 am Post subject: Re: Problems searching for capital Russian letters in UTF-16 |
|
|
| Flint wrote: | 3. Only the le.txt file is found.
|
Confirmed. Furthermore, when I enable "Case sensitive" both files are found. _________________ #16626 Personal licence |
|
| Back to top |
|
 |
Flint Power Member


Joined: 27 Oct 2003 Posts: 2867 Location: Moscow, Russia
|
Posted: Mon Jun 25, 2012 11:56 am Post subject: |
|
|
| white wrote: | | Furthermore, when I enable "Case sensitive" both files are found. |
That's right. Sorry, forgot to mention about it. _________________ Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
Using TC 8.01 / Win7 x64 SP1 |
|
| Back to top |
|
 |
MVV Power Member


Joined: 03 Aug 2008 Posts: 4548 Location: Russian Federation
|
Posted: Mon Jun 25, 2012 12:10 pm Post subject: |
|
|
UTF-16 BE has reversed byte order in words (its BOM also has reversed bytes order in comparison with UTF-16LE BOM: FFFE and FEFF).
It is an accident that a part of a word was found, and only because of byte mix:
Mentioned word "Тест" in UTF-16 BE:
Mentioned word "Тест" in UTF-16 LE:
In red there is a common part of byte arrays (exactly "ест" in UTF-16BE) - so it was found only because that sequence appears in UTF-16BE file - it is a collision (first bytes of theese letters in BE are 04 and last ones in LE are 04 too). And whole word can't be found because its byte array doesn't appear here. Also, in case of different high-order bytes in characters TC won't find text at all. Any Unicode band has such collisions.
Currently TC doesn't support UTF-16BE in find files dialog (however simple byte order reverse may allow searching in BE, but currently there is no corresponding checkbox).
However it is interesting that Lister supports UTF-16BE too. _________________ VirtualPanel plugin: Temporary panel for TC (forum)
TOTALCMD.NET: TCFS2, NTLinks, CopyTree, AskParam, ConPaste, Sudo… |
|
| Back to top |
|
 |
Flint Power Member


Joined: 27 Oct 2003 Posts: 2867 Location: Moscow, Russia
|
Posted: Mon Jun 25, 2012 1:15 pm Post subject: |
|
|
| MVV wrote: | | it is a collision |
I thought so too, so I tested with a different couple of files, both containing the English word " Test ", with a space before and a space after the word. In this case the whole word "Test" (without spaces) would fit into that collision of identical byte sequences. But still TC refuses to find the second file.
Or even simpler: just replace the "Тест" with "ТЕСТ" in both files. Collision remains, but TC fails to find the UTF-16BE file if you search for "ест" or "ЕСТ". _________________ Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
Using TC 8.01 / Win7 x64 SP1 |
|
| Back to top |
|
 |
MVV Power Member


Joined: 03 Aug 2008 Posts: 4548 Location: Russian Federation
|
|
| Back to top |
|
 |
white Power Member


Joined: 19 Nov 2003 Posts: 1307 Location: Netherlands
|
Posted: Mon Jun 25, 2012 2:20 pm Post subject: |
|
|
| MVV wrote: | It is an accident that a part of a word was found..
|
No, it wasn't. Try shortening the string in the test file to "е" and you will see it does work (search for "е"). However case insensitive unicode search fails on upper case characters. Try for example a test file containing "Е". _________________ #16626 Personal licence |
|
| Back to top |
|
 |
Flint Power Member


Joined: 27 Oct 2003 Posts: 2867 Location: Moscow, Russia
|
Posted: Fri Jun 29, 2012 6:52 am Post subject: |
|
|
Cannot reproduce the problem anymore in rc2, so seems to be fixed. _________________ Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
Using TC 8.01 / Win7 x64 SP1 |
|
| Back to top |
|
 |
ghisler(Author) Site Admin


Joined: 04 Feb 2003 Posts: 24621 Location: Switzerland
|
Posted: Fri Jun 29, 2012 9:07 am Post subject: |
|
|
Thanks! Btw, the problem wasn't related to Russian, it happend with "Test" in English too. _________________ Author of Total Commander
http://www.ghisler.com |
|
| Back to top |
|
 |
Flint Power Member


Joined: 27 Oct 2003 Posts: 2867 Location: Moscow, Russia
|
Posted: Fri Jun 29, 2012 9:12 am Post subject: |
|
|
| ghisler(Author) wrote: | | Btw, the problem wasn't related to Russian, it happend with "Test" in English too. |
Yes, I noticed that later too… _________________ Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, other stuff!
Using TC 8.01 / Win7 x64 SP1 |
|
| Back to top |
|
 |
white Power Member


Joined: 19 Nov 2003 Posts: 1307 Location: Netherlands
|
Posted: Fri Jun 29, 2012 9:55 am Post subject: |
|
|
Yes. Seems to be fixed in 8.01 rc2. (Tested 32-bit) _________________ #16626 Personal licence |
|
| Back to top |
|
 |
|