the search returns unmatched characters

ter · Post by *ter » 2024-07-22, 16:45 UTC

1. search for "MZ└" (4D 5A C0) in a text file in Lister
2. it triggers on text that contains "mzL" (6D 7A 4C)

MIME-Version: 1.0
Content-Type: application/octet-stream; name="mzl.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="mzl.txt"

TVrAIHNlYXJjaCBtZQ0KLg0KDQouDQptekwgbm8=

yefkov · Post by *yefkov » 2024-07-22, 19:07 UTC

it triggers on text that contains "mzL" (6D 7A 4C)

I couldn't reproduce (TC 11.03). Only the text "MZ└" was found.
UPD: You forgot to mention that you are using a "wrong" code page. "└" can be assumed to be from cp850. So I searched using this code page. If I switch to cp1250 I get the result you describe.
This behavior is probably caused by the conversion of the search text from Unicode (TC search text box) to cp1250.
I don't know what conversion function TC uses, but WideCharToMultiByte, for example, has the following hint:

This flag prevents the function from mapping characters to characters that appear similar but have very different semantics. In some cases, the semantic change can be extreme. For example, the symbol for "∞" (infinity) maps to 8 (eight) in some code pages.

https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte

Post by *Hacker » 2024-07-22, 19:32 UTC

Not confirmed, either.

Roman

ter · Post by *ter » 2024-07-22, 21:00 UTC

My windows code page is cyrillic , and "└" can be seen in oem/dos codepage (hotkey S).

Post by *petermad » 2024-07-23, 01:39 UTC

I assume this is about searching in Lister, not Searching for text in files with the Find Files dialog.

It seems to happen when Lister is in ANSI mode (hotkey A) and set to use one of the following codepages: 1250, 1251, 1253 and 1254 hence also if the "Encoding" is set to "As configured for current Font" when the sytem codepage for the font is one of these.

I think yefkov is right about https://learn.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte giving the explanation

Post by *ghisler(Author) » 2024-07-24, 09:21 UTC

Not a bug. This happens when "└" isn't part of the current code page and you search in ANSI text. TC converts the search string from Unicode to ANSI, with MultiByteToWideChar, and this functions has the strange habit to convert characters to similar characters, e.g. "└" to "L" is there is no equivalent in the current code page.

ter · Post by *ter » 2024-07-24, 12:48 UTC

Why Find text in files warns about invalid characters in the encoding, but search in lister does not?

I'm searching not in ANSI text, but in ASCII text

Why in ASCII/dos codepage it searches mzL but not MZ└? Which one is "current" codepage? If it's displayed, then it's current.

AntonyD · Post by *AntonyD » 2024-07-24, 13:52 UTC

As I see it - IF I choose Lister's render Options as `ANSI (Window charset)` and set "Encoding" Codepage as ASCII/DOS (866) - then
I will be able to find exactly this `MZ└` and of course This phrase will be displayed on the screen in the correct way.
But IF I will set codepage to "As configured for current Font" or `ANSI local code page` - I will not be able to search this string.
Instead of it will find "mzL". And rendering of this phrase will be broken obviously. `MZА` will be rendered.
So far, this is adequately understood and accepted as a fact.

IF I will choose Lister's render Options as `ASCII (DOS charset)` This phrase will be displayed on the screen in the correct way.
BUT! so necessary "Encoding" menu item will be disabled - and WHICH indeed the codepage will be chosen for the visual
interpretation of this text - I will not know for sure. I will only guess that it should be of course 866 - but who knows?
So as I assume this - I also still think that the repeated search process should be successful.
BUT! Search process now will NOT find `MZ└` - instead of it will find "mzL". Despite the fact
that the screen IS displaying the correctly searched phrase `MZ└`... Strange logic....

Post by *petermad » 2024-07-24, 14:52 UTC

and WHICH indeed the codepage will be chosen for the visual
interpretation of this text - I will not know for sure. I will only guess that it should be of course 866

I tested it a little

If you in In the "Find Files" dialog, only select ASCII charset (DOS) then codepage 850 (OEM) seems to be used for searching text in files - at least in Windows with Danish locale (Codepage 865 is the Danish/Norwegian DOS codepage)...

AntonyD · Post by *AntonyD » 2024-07-25, 08:41 UTC

seems to be used for

This is the trouble - we seem to understand and seem to be right - but we only guess WHAT and HOW is happening at this moment.
ALTHOUGH, it would seem that the most logical thing in this case is to display information about the factors on the basis of which
the rendering and search processes are carried out.
That is in fact, just do not disable the Encoding menu item. After all, just for this option only "As configured for current Font"
should remain in this menu as a sub-item. It seems to me that this is how it works.

Post by *petermad » 2024-07-25, 13:26 UTC

2AntonyD
I tested this way:
I made txt a file with these characters: µ°Õ - in ASCII mode (S) they are displayed as Á░ı in Lister
If I view the file in ANSI mode (A) in lister it is only when chosing codepage DOS-LATIN1 (850) that the characters are displayed as Á░ı
And it is only when I in "Find Files" use Á░ı in the "Find text" field with only "ASCII charset (DOS)" enabled, that the file is found - not it I use µ°Õ

That for me indicates that the ASCII mode in Lister is using codepage 850, and that the "ASCII charset (DOS)" option in "Find Files" also uses codepage 850.

AntonyD · Post by *AntonyD » 2024-07-25, 15:04 UTC

2petermad
I fully agree that we/you, or anyone else, will be able to DEDUCE this information in a same logical way, which you described in your post,
by doing such simple things as you said...
BUT the essence of the problem is not whether all these people know WHAT steps to perform (and you seem to have described them here
in the form of a help paragraph), but that when using Lister, there should not be no such situation/behavior when it is necessary to GUESS:
on the basis of what data(codepages, fonts - What else affects this process, by the way?) the search is carried out.
For that matter, there is still no option in the search dialog to choose `in which code page` data is provided by user for the searching.
In order to be able to make a more correct interpretation of the data bits from the search string from an input field.

Post by *ghisler(Author) » 2024-08-08, 08:10 UTC

Moderator message

Moved to will not be changed

ter · Post by *ter » 2024-08-08, 12:18 UTC

why not changed? in other editors, it does work.
why use WC_NO_BEST_FIT_CHARS? no explanation.
why not warn on different codepage? no explanation.

JOUBE · Post by *JOUBE » 2024-08-08, 14:37 UTC

ter wrote: 2024-08-08, 12:18 UTCin other editors, it does work.

So, simply use these instead.

HTH

Joube

Total Commander

the search returns unmatched characters

the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters

Re: the search returns unmatched characters