Fixed - Search text in Find Files / Lister: bug with non-ANSI characters

The behaviour described in the bug report is either by design, or would be far too complex/time-consuming to be changed

Moderators: white, Hacker, petermad, Stefan2

Post Reply
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Fixed - Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Slavic »

(Apparently this is not a new bug, in 9.51 it occurs slightly differently, this report is about current 10.00b2)

Symptoms
In the Find Files interface we can input any characters to search, including the characters which are out of ANSI charset (8-bit encoding). It's not a problem if we set to find Unicode UTF-16 or UTF-8 string in our files (texts). However, we can also (intentionally or not, it's memorised in UI) set the ANSI search (Windows). If our characters cannot be translated from Unicode to ANSI, then Find Files may show wrong occurrence depending on random factors. Attempt to find these characters in Lister may fail or result will be wrong.

Reproduction
Windows: Western codepage based on Latin alphabet (en-US, en-GB, de-DE etc)
Test file: HISTORY.TXT
Test texts: αβγδ абвг (4 first characters from Greek and Cyrillic alphabets; they cannot be converted to ANSI CP 437, 850 or 1252)
Find files: TC directory, Search for: *.txt, Find text: 4 characters above or less, [v] ANSI charset
Lister automatically inserts our text as search pattern, then use F3 or Ctrl+F; if file was not found, open it in Lister

Results
αβγδ - FF nothing found, Lister shows 2 cases of ???? (there are two occurrences of 4 question marks in this file)
αβ - FF nothing found, Lister shows all occurrences of ??
α - FF found HISTORY.TXT, Lister shows all occurrences of ?
абвг - FF found HISTORY.TXT, Lister nothing found
аб - FF found HISTORY.TXT, Lister nothing found
а - FF found HISTORY.TXT, Lister nothing found

Expected results
all cases - FF nothing found, Lister nothing found

Possible reason
For correct search, the pattern (characters we input) and text should have the same encoding. If encodings are different, one should be converted to another one. While I don't know exactly how TC works here, I suppose that our pattern is converted using the default encoding method in OS - Unicode to ANSI, because we input Unicode characters in the Find text field. If they cannot be converted because they don't present in ANSI table, the conversion function should either return an error or substitute the non-existing characters with "?", as Windows often does. Apparently the last has a place in TC, and FF or Lister use these question marks as a pattern. This works slightly different for Greek and Cyrillic symbols.

How to fix
The best solution would be to inform a user that text which was input in FF or Lister dialog cannot be properly converted to ANSI, so text should be changed or the option [v] ANSI charset removed because the result will be incorrect. It's possible to not inform about non-convertible characters, but in such case need to cancel the search by this option internally. In either case, the conversion function should be able to recognise non-convertible characters and return an error flag, supposedly Win32 WideCharToMultiByte() is able to do this according to its description.
Last edited by Slavic on 2021-04-30, 15:47 UTC, edited 1 time in total.
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48012
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *ghisler(Author) »

I will consider it, but I'm sure to get complaints from users who have both Ansi and Unicode checked when searching. :(
Author of Total Commander
https://www.ghisler.com
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Usher »

2ghisler(Author)
Warning should be better, something like this:
"This text contain characters from different codepages. ANSI search for this text won't work properly."

2Slavic
Could you try the following method, please:
1. Search for "αβγδ" with "ANSI charset" checked.
2. Search in results for "абвг".
Andrzej P. Wozniak
Polish subforum moderator
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Slavic »

Usher wrote: 2021-03-25, 18:29 UTC 1. Search for "αβγδ" with "ANSI charset" checked.
2. Search in results for "абвг".
Sorry, search in this order does not work (αβγδ is not found in Find Files, so nothing to search in Lister).
But inverse operation does work: абвг with file mask *.* is found in number of files, including HISTORY.TXT, then we can open it in Lister from list. In Lister, switch to Options > Text only. While search for абвг has no result, if we input αβγδ in Lister search dialog, it shows two occurrences of ???? as above.

The only difference between 10.00b2 and 10.00b3 is that Lister should be switched to Text only mode, because it determines automatically HISTORY.TXT in Beta 3 as UTF-8 and wrong result of search does not happen. (I have written a separate report about wrong determination of text encoding as UTF-8).
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Usher »

Slavic wrote: 2021-03-25, 20:30 UTC
Usher wrote: 2021-03-25, 18:29 UTC 1. Search for "αβγδ" with "ANSI charset" checked.
2. Search in results for "абвг".
Sorry, search in this order does not work (αβγδ is not found in Find Files, so nothing to search in Lister).
No sorry. It's proper result for "αβγδ".
Slavic wrote: 2021-03-25, 20:30 UTC But inverse operation does work: абвг with file mask *.* is found in number of files, including HISTORY.TXT,
And this is still wrong.
Andrzej P. Wozniak
Polish subforum moderator
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Slavic »

Usher wrote: 2021-03-25, 22:34 UTC No sorry. It's proper result for "αβγδ".
(...)
And this is still wrong.
Sorry again, but can you notice that these results have already been reported in my first post? So, your remarks added nothing new to them. Could you please try to do something essentially helpful? Any of:
  • Attempt to reproduce the results on other PC config / OS / Language and country settings / other TC options etc
  • Confirm that the results are the same (bug happens), or cannot be reproduced (all works as intended)
  • Extend the test case, add some new data, compare the new results with existing ones
  • Make another assumption about the reason of problem or suggest possible way(s) how to fix the bug
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Usher »

2Slavic
I only wanted to suggest other approach than yours - and it partially failed. However, it still may be a step in a good direction.

Maybe you misunderstood my laconic comments. You don't need to be sorry, you need to be more patient, please.
I trust your findings, some of them I can confirm for TC in XP (hence my suggestion), but I haven't enough time to make more tests, describe results and make other suggestions in a single message.
Do you think all the people work in a similar manner as you?
Andrzej P. Wozniak
Polish subforum moderator
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Usher »

Well, it seems to be similar case as in another topic which you have started:
viewtopic.php?f=35&t=74205#p399092

Lister behaviour depends on UnwrapWidth value. If you set UnwrapWidth >=1738, then Lister will find "????" as a Russian text in history.txt.
You don't need to test Greek any more, combine Greek with Russian or select many charsets. Just test search for Russian text in ANSI mode.

I have windows-1250 (Polish) codepage set in Windows, language in TC seems to be irrelevant. I did tests in current TC instance and in TC with fresh.ini.
  • I did search for "абвг" in 2 texts - history.txt and Russian text in ANSI (windows-1251) containing "????".
    Search always finds both files, Lister always finds "????" in Russian txt , but for history.txt it depends on UnwrapWidth value.
  • Search for "αβγδ" gives more or less proper results:
    1. Finds nothing for search in non-Cyrillic ANSI.
    2. Finds proper string in wincmd.ini (UTF-16) and fresh.ini (UTF-16) for search in UTF-16.
    3. Finds proper string (though strangely looking in fresh.ini.bak (backup of the initial fresh.ini before conversion from windows-1250 to UTF-16) for search in UTF-8.
    UnwrapWidth for Lister in this case is irrelevant, codepage in Lister set to Autodetect.
    1L. Nothing to view.
    2L. Finds proper string in UTF-16 files.
    3L. Finds nothing in Autodetect mode (ANSI), finds proper string when switched to UTF-8.
As you can see, it's a problem with autodetection for Russian codepage (windows-1251). It seems to be very old. It means that some autodetection heuristics are wrong and should be changed.

There is a similar problem with autodetection for Polish codepages: ANSI (windows-1250) and OEM/DOS (Latin CP-852). Autodetect almost always gives swapped results: text written in DOS is displayed as Windows text and vice versa. It's annoying. You can test it on your own with attached files, names are obvious.

Code: Select all

MIME-Version: 1.0
Content-Type: application/octet-stream; name="ogonki.cp852.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="ogonki.cp852.txt"

RE9TIExhdGluMiBDUDg1Mg0KjSCoII8gICC9IKQglyDjIOAgnQ0KqyCpIIYgICC+IKUgmCDkIKIg
iA0K

Code: Select all

MIME-Version: 1.0
Content-Type: application/octet-stream; name="ogonki.cp1250.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="ogonki.cp1250.txt"

V2luZG93cy0xMjUwIA0KjyDKIMYgICCvIKUgjCDRINMgow0KnyDqIOYgICC/ILkgnCDxIPMgsw0K

Text in the second and third line contains only Polish characters, which in both cases should be properly displayed as follows:

Code: Select all

Ź Ę Ć   Ż Ą Ś Ń Ó Ł
ź ę ć   ż ą ś ń ó ł
I added spaces in source files to avoid antispam forum protection.
Andrzej P. Wozniak
Polish subforum moderator
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Slavic »

Find Files is fixed
I confirm that in TC Beta 4 the problem with non-convertible characters in Find Files is solved. The added dialog informs a user that the search string includes the Unicode characters which cannot be converted to ANSI or ASCII. Thanks!

Lister is not fixed
But the problem (not exactly the same, but very similar) in Lister is not solved yet. I looked at it more deeply and performed some tests with Win32 (I can understand somehow Win32 API and basics of C, but sorry, I am completely not proficient in Pascal). Like above, two strings of Greek and Cyrillic characters were used for tests.

Symptoms
Unlike Find Files, where the conversion of Unicode to ANSI uses the current system codepage (in Windows 8.1: Control Panel > Language > Region > Administrative > Change system locale...), Lister uses the specific codepage defined by user in the Encoding menu; the last selected encoding is saved in wincmd.ini > [Lister] > Codepage=. When we perform a string search in Find Files, the OS codepage and Lister codepage may or may not be equal and, as a result, the string with non-convertible characters may be found in FF and not found in Lister, or vice versa.

Tests
For example, take our test file HISTORY.TXT and input "αβγδ" in Find Files. Currently FF will show warning, as intended. Then open this file in Lister and input the same string in search Ctrl+F (it should be already there) with Text only and ANSI settings in Options. If we select in Encoding the Greek (1253) codepage, the string is not found as expected. The same happens for codepages 1250 and 1252. But if we change the codepage to Cyrillic (1251) or Baltic (1257), Lister finds the substrings "????" in two locations.

If we input in Find Files the string "äö.do.not.remove" which exists in HISTORY.TXT, FF shows this file, at least for Western Europe (1252) OS codepage. Lister shows the string too for Western (1252), Eastern (1250) or Baltic (1257). But if we select the Cyrillic (1251) or Greek (1253), the string is not found. If we shorten the string to "äö", then for 1252, 1250 and 1257 the results are the same: one real occurrence. For 1251 and 1253 it shows two "aO" and "AO" occurrences, but for Hebrew (1255) or Arabic (1256) it finds all "??" substrings.

This wrong search is not easy to fix, mostly it happens by design, because user is allowed to change the codepage in Lister. Such option is very helpful and should be left as is. But some improvements can be made, however.

Reason
Why does Lister show the occurrences of question mark? It's a result of Win32 API, when non-convertible characters in string are replaced with '?'.

More exactly, the conversion function WideCharToMultiByte has two specific parameters: lpDefaultChar and lpUsedDefaultChar. The lpDefaultChar is used when the string has a character that cannot be converted; the default is NULL value (for UTF-8 it must be NULL), then the system uses a default replacement character, which is '?'. The lpUsedDefaultChar is a flag which may be set to NULL, but also can be TRUE or FALSE. TRUE is returned if the string contains non-convertible characters for selected codepage, while FALSE returns if all is OK. I tested it with L"αβγδ" as lpWideCharStr with parameter CodePage 1251 and 1253. This flag returns TRUE in the Cyrillic 1251 case with question marks in returned lpMultiByteStr and FALSE in Greek 1253 case with returned ANSI symbols equal to Greek, as expected.

Possible solution
Lister can check the flag lpUsedDefaultChar and show the warning similar to Find Files (or a bit more specific, like search cannot be correct for selected codepage in Encoding, user see this and can change CP to more appropriate). This will prevent the situation when non-convertible characters are "converted" to question marks and then these marks become found instead of real characters. But if the selected codepage is good and conversion has been performed correctly, search should work as now.
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48012
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *ghisler(Author) »

I prefer not to show the warning in Lister. Why? You will see immediately that it found something different. And if it didn't find anything, it wouldn't have found anything anyway because of the impossible to find characters. You do not see that the wrong string is found in the main search function, so it's a good idea to warn the user there.
Author of Total Commander
https://www.ghisler.com
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Slavic »

I understand this logic and may agree, but the problem is, a user should understand that search results in Lister depend on selected encoding. Would it be reasonable to add this to help file? Otherwise user may decide that it's a bug in Lister, but in fact it's by design, not a bug.

An example: we can select in Lister the Cyrillic (1251) encoding and forget about it: when we see the Western ANSI texts, for example, in English, encoding doesn't matter. And when we look at Russian texts, all is well too. Then we decide to find the "äö" string, which exists somewhere in text files that came with TC. We input Unicode characters in Find Files and it gives us HISTORY.TXT, but Lister (F3) shows incorrect substrings and doesn't show correct result. Would we select a suitable encoding like Western Europe (1252), the problem would not happen.

Suggestion: could we force Lister use the same encoding as Find Files, only for ANSI/ASCII search if we call Lister from FF result window? For UTF-8 and other Unicode searches a codepage doesn't matter, nothing to change. Yes, it's not a perfect solution and may have some pro and contra. But at least a user who forgot to set the correct codepage in Encoding menu will see correct results immediately.
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48012
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *ghisler(Author) »

By default, Lister uses the same Ansi encoding as the system (and search), but it can be changed via the Encoding menu.
Author of Total Commander
https://www.ghisler.com
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Re: Search text in Find Files / Lister: bug with non-ANSI characters

Post by *Slavic »

ghisler(Author) wrote: 2021-04-09, 09:28 UTC By default, Lister uses the same Ansi encoding as the system (and search), but it can be changed via the Encoding menu.
This is exactly what I tried to illustrate above. A user may set in Lister an arbitrary encoding, not the same as OS default, then it will be stored in the configuration file and kept there indefinitely, and user may forget about it (or not forget, if the user is already educated). So, the consequences of search will depend on user's proficiency.

Well, let this bugreport be closed as resolved. Anyway it was very helpful - in FF fix and this discussion.
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
Post Reply