BOM wipes search history

Zerryk · Post by *Zerryk » 2015-05-18, 19:21 UTC

Once I searched for any text files that are in UTF-8. This means, any file containing BOM. It wiped the search history

How to reproduce:

Put the BOM into the clipboard:

create a text file in Notepad
Save as - UTF-8 encoding
open in Lister, press A 1 for ANSI text view
copy the first three characters into the clipboard
(they look like these ď»ż but do not copy them from this forum)

then:

Alt-F7 to open "Find files" dialog
look into the "Find text" combobox, some history is there
paste BOM into the "Find text" combobox
Start search
close the dialog
open that dialog again
look for history in the combo... nothing is here.

It looks the same approach can wipe File name search history, Command line history etc...
My system is Vista 32 bit, 1250 codepage for ANSI.

MVV · Post by *MVV » 2015-05-18, 20:22 UTC

Please also try with wincmd.ini saved in UTF-16 LE encoding (standard Windows Unicode).

Zerryk · Post by *Zerryk » 2015-05-18, 21:11 UTC

Wow, it helped! Now the history is remembered. Hope it will not break something else

Seemingly in a non-UTF16 config file, TC stores unicode strings starting with BOM. So maybe a suggestion/bugfix: if nothing follows BOM on a config line, don't get confused.

Or the config file should be automatically saved as UTF-16.

MVV · Post by *MVV » 2015-05-19, 07:43 UTC

Yes, it is correct that in non-Unicode file TC uses BOMs for Unicoded lines.

However if all history is lost when you search for BOM, it is a kind of bug. And I can confirm such behaviour.

I checked it, lines in INI are physically there, but TC doesn't show them. Perhaps TC sees empty (UTF-8 encoded) line and stops scanning for additional entries while they are exist.

milo1012 · Post by *milo1012 » 2015-05-19, 16:31 UTC

Zerryk wrote:Once I searched for any text files that are in UTF-8. This means, any file containing BOM.

I hope you're aware of the fact that not all UTF-8 files contain the BOM?
It is neither recommended nor required to use it for today's text parsers.

Anyway, you should enable 'Hex' and 'Case sensitive' and directly search for 'EFBBBF'.
That means better use that for all Unicode special symbols, like RTL/LTR markers or Unicode newlines.

MVV wrote:Perhaps TC sees empty (UTF-8 encoded) line and stops scanning for additional entries while they are exist.

Maybe.
Just try enter sth like:

Code: Select all

ABC<BOM-chars>

and TC will save and reload them correctly.
If you try

Code: Select all

<BOM-chars>ABC

TC will reload it as 'ABC', skipping the BOM.

So why does TC scan for UTF-8 BOM, but at the start of the key only, anyway?
It doesn't save or should expect UTF-8 text at any place in the INI, from what I can see.
But even if there is a place: using the BOM as a marker is generally bad for UTF-8, contrary to UTF-16.
You should always scan the sequence for valid encoding. It's not complicated, just some lines of code to check the sequence.

Maybe it's time to finally switch to a new config format, or do the INI load/save on it's own.
I'm counting the days until MS will finally drop GetPrivateProfileString et al. from the API.

Update:

MVV wrote:I checked it, lines in INI are physically there

Yes, but try entering the BOM-chars a 2nd time manually and start the search, close the dialog:
all history is lost!

MVV · Post by *MVV » 2015-05-19, 19:40 UTC

So why does TC scan for UTF-8 BOM, but at the start of the key only, anyway?

I've answered this question just before your post. When TC needs to store Unicode string in ANSI configuration file, it prefixes that string with BOM. This allows having only particular strings in Unicode while rest of file is in ANSI.

I'm counting the days until MS will finally drop GetPrivateProfileString et al. from the API.

What do you have against these functions? They are very useful for tiny programs! And they work pretty fine with UTF-16 files, you only need to create this file with UTF-16 BOM before.

Yes, but try entering the BOM-chars a 2nd time manually and start the search, close the dialog:
all history is lost!

I can understand this: TC rewrites all history entries and removes invalid ones.

milo1012 · Post by *milo1012 » 2015-05-19, 20:17 UTC

MVV wrote:I've answered this question just before your post. When TC needs to store Unicode string in ANSI configuration file, it prefixes that string with BOM. This allows having only particular strings in Unicode while rest of file is in ANSI.

No.
You talk about UTF-16 BOM, where the behavior is completely clear (to store Unicode strings in ANSI file).

But I'm asking why TC treats an UTF-8 BOM special, which is just not applicable and makes no sense there,
because you'd either treat UTF-16 (a.k.a. 'Unicode') or ANSI strings in the INI file, but not UTF-8.
Also it's not like TC will treat the characters behind the U8-BOM as UTF-8, but they will still be interpreted as ANSI when saving.

MVV wrote:What do you have against these functions? They are very useful for tiny programs! And they work pretty fine with UTF-16 files...

I didn't say I'm against them, but they were already rated deprecated in the 90s, when MS prompted the developers to move all settings to registry.
Also they force you to program in a C-Style, and have severe limitations (64k section size).
That's why I prefer things like SimpleIni, which uses C++ containers, plus it can use multi-line keys, UTF-8, and has no size limits.

MVV · Post by *MVV » 2015-05-20, 07:06 UTC

No.

I'm talking exactly about UTF-8 BOM which TC uses as a prefix for Unicode strings (with international characters) that are stored in UTF-8 when configuration file is in ANSI and not UTF-16. It is not possible to use UTF-16 for particular string in ANSI file but it is possible to use UTF-8 here. It looks like ANSI for Windows so may be stored in ANSI file but TC knows that string is in UTF-8 because of BOM. It is a special TC feature.

I didn't say I'm against them, but they were already rated deprecated in the 90s, when MS prompted the developers to move all settings to registry.

Registry it's a pain for portability. I prefer INI files because they are simple and I don't need to re-configure every program when I reinstall Windows.

Also they force you to program in a C-Style, and have severe limitations (64k section size).

Quite simple wrapper function allows retrieving strings as objects so it isn't a reason for C-Style. And, INIs (just like registry) are not for huge data, they are for configuration parameters. Large binary data must be stored in separate files (e.g. in databases).

milo1012 · Post by *milo1012 » 2015-05-20, 08:27 UTC

Yeah, we have a little clash of terms here.
I thought you were talking about the ini file resaved as Unicode.
My bad.

In any case, TC shouldn't skip remaining entries if the first one is "empty".

<Offtopic>

MVV wrote:Registry it's a pain for portability. I prefer INI files because they are simple and I don't need to re-configure every program when I reinstall Windows.

The point was that the functions are outdated since at least 15 years,
not that Ini files are bad in general.
We'd probably have seen some updated functions with loosened limitations if MS would still have
interest in config file handling.

MVV wrote:Quite simple wrapper function allows retrieving strings as objects so it isn't a reason for C-Style.

The API itself is C style. So you need buffer allocations and have unnecessary (multiple) data copy.
In portable C++ with custom ini parsers you could just open a file stream and let the CRT handle everything.

MVV wrote:INIs (just like registry) are not for huge data, they are for configuration parameters.
Large binary data must be stored in separate files (e.g. in databases)

I didn't mention binary data.
Just look how much keys some MS programs (Office...) store in the registry
when installed. It regularly exceeds 1 MiB.
Ini files are just an (old) serialization format, like XML/JSON/YAML today, but far more "human readable".
But a config can be complex too, having thousands of entries, exceeding that 64k per section easily.
TC uses that format, and therefore all it's limitations.
</Offtopic>

MVV · Post by *MVV » 2015-08-19, 13:45 UTC

TC 8.52rc1 shows BOM bytes on reopening search dialog in a text field but unfortunately search history is empty, i.e. previous items are still lost, and after restarting TC history and text field are both empty.

Quick test: search for text 111, then for 222, then for 333, 444, п»ї, 777, 888. Now restart TC. Only 777 and 888 are in history, BOM and next items are lost.

Post by *ghisler(Author) » 2015-08-20, 09:38 UTC

Strange, works fine for me! I tried with
TC 32-bit and 64-bit
Ansi and Unicode wincmd.ini

Maybe we are not talking about the same BOM? I mean the one at the start of UTF-8 files, looking like this here:
ï»¿

Or in hex: EF BB BF

Maybe you opened search in TC 8.51a when there was a BOM in the history? Then the history will get damaged again when closing the search, also for newer TC versions.

MVV · Post by *MVV » 2015-08-20, 10:09 UTC

We are talking about the same BOM, my text was simply in my local encoding (1251).

What I'm doing is searching for BOM as text ï»¿ (bytes EF BB BF), but when I open search dialog next time its search text history is truncated.

In my example I've expected to see following search history:

Code: Select all

But I've seen only items before BOM text:

Code: Select all

888
777

And this is the topic bug. I think it happens because TC INI reader treats BOM string as empty string and stops scanning for entries. So adding additional BOM in order to store BOM in UTF-8 would help here (INI reader would cut only one BOM).

Post by *ghisler(Author) » 2015-08-20, 10:13 UTC

Hmm, this doesn't happen here any more. Maybe you still have the problem due to different local encoding?

MVV · Post by *MVV » 2015-08-20, 10:17 UTC

I search for BOM text in my local encoding, so there should be similar bytes as for you. However there may be problems if you compare Unicode characters because my characters in the BOM are different.

When you pass some text to INI save function, it appends BOM - here it should compare it with passed string, ANSI characters of the BOM will be the same for all encodings.

milo1012 · Post by *milo1012 » 2015-08-20, 10:34 UTC

For me BOM search doesn't break history any more in 8.52 RC1 with default ANSI ini file (searching [face=courier]ï»¿[/face] in CP 1252).
History is tested to be safe in all circumstances.

Update:
I can confirm that for a system with CP 1251 the same old behavior is still present.

In my CP 1252 system I can see that for the BOM entry the following Bytes are stored:

Code: Select all

ï»¿Ã¯Â»Â¿
EF BB BF C3 AF C2 BB C2 BF

This would mean: the ANSI BOM characters itself are recoded to Unicode plane and prefixed by a BOM of course.
Doing the reverse when reading the key should be no problem: you'd get the same chars out of it (assuming the system ANSI CP didn't change of course).

But for CP 1251 the entry is still the BOM only (not sure about the other CPs).

Total Commander

BOM wipes search history

BOM wipes search history

Re: BOM wipes search history