Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

The behaviour described in the bug report is either by design, or would be far too complex/time-consuming to be changed

Moderators: white, Hacker, petermad, Stefan2

Post Reply
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *Slavic »

When you open in Lister the HISTORY.TXT of last TC 10.00b3, Lister shows it as UTF-8 encoding, but in fact, it is not. The problem is, this text has the "UTF-8" substring too close to the begin. This looks like as a side effect of heuristic algorithm which determines the text encoding, whether it's a plain text or UTF-8. To see an example, create TEST.TXT

Code: Select all

"This text does not use UTF-8 encoding, it's simple ASCII."
Save in Notepad as ANSI and open in Lister when it uses Autodetect in configuration (default setting). Check the Options menu: it shows UTF-8, even though this example uses plain text (ASCII-7).

Reason: apparently Lister attempts to determine the encoding using different methods, and besides the two "standard": UTF-8 byte order mark and specific UTF-8 encoding somewhere in text, it also checks the "UTF-8" substring which is widely used in HTML files, something like

content="text/html; charset=utf-8"

While it is good to determine the encoding of HTML code, the existence of "UTF-8" substring near the begin of simple text doesn't make it automatically UTF-8 in common sense. I would suggest to use such heuristic very carefully if the file is a simple text, not HTML or XML.
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
User avatar
petermad
Power Member
Power Member
Posts: 14739
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *petermad »

This is an old issue - see: viewtopic.php?p=353267#p353267
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50b4 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *Slavic »

petermad wrote: 2021-03-25, 20:50 UTC This is an old issue - see: viewtopic.php?p=353267#p353267
This is not exactly the same case. Or even if the reason may be the same or similar, its understanding was different in that discussion:
Total Commander looks in the first read block (128 kBytes) of each file whether it
- contains valid UTF-8 codes
or
- contains valid UTF-8 encoding headers
As you see, in my test nothing can be counted as a valid UTF-8 encoding header or even a part (fragment) of such header. Only a fragment is "UTF-8", and apparently it is counted as a sufficient reason. But the header should include some distinctive features, like angle brackets. So, I can assume that the detection of UTF-8 encoding is oversimplified in some cases, intentionally or not.
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *ghisler(Author) »

This is indeed intentional when the file does NOT contain any character codes above 127. There are many many ways to indicate the character encoding in html, xml etc. files, so I just look for UTF-8. It doesn't do any harm as long as the file doesn't contain any non-UTF-8 characters further down. If anyone has a better idea, just let me know. Removing this feature would cause much more harm.
Author of Total Commander
https://www.ghisler.com
User avatar
petermad
Power Member
Power Member
Posts: 14739
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *petermad »

I just thought I would test it again with the latest history.txt file from TC 10 b3 - and to my surprise, when I tested it in my work-TC it showed up in "Text only" mode.

Then I tested it in my 32bit TC and it shows up as "UTF-8".
Now I tested it in 64bit TC with a clean .ini file and it shows up as "UTF-8"

After some sleuthing I found out that setting the [Lister] parameter UnwrapWidth to 1738 or higher makes TC show the file in "Test only" mode.

I wonder how that setting influences the view mode (or is it a hidden feature ? :wink: )
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50b4 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *ghisler(Author) »

Strange indeed, I will check it in the debugger.
Author of Total Commander
https://www.ghisler.com
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *Slavic »

petermad wrote: 2021-03-26, 12:52 UTC After some sleuthing I found out that setting the [Lister] parameter UnwrapWidth to 1738 or higher makes TC show the file in "Test only" mode.
Interesting result, I was able to reproduce it as well. The magic of the value 1738 (0x6CA) isn't clear, however.
Unfortunately, it doesn't affect my test example, which shows UTF-8 anyway.

As for HISTORY.TXT, I tried to dig deeper and have found a culprit :wink:
Open it in Notepad (or other editor) and find the line
17.08.16 Fixed: Write SetEncoding=äö.do.not.remove to wincmd.ini [Configuration] on load, to prevent notepad from opening/saving the file as UTF-8 when it contains UTF-8 encoded parts (32/64)
then replace äö with ao (without umlauts), save and compare the result. These two characters are 8-bits, while all other text is 7-bits, that's a reason.
The logics is exactly as Christian said above: if all the text is 7-bits, then "UTF-8" substring is counted as an attribute of UTF-8 encoding.
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
Slavic
Senior Member
Senior Member
Posts: 290
Joined: 2006-02-26, 15:41 UTC
Location: Montenegro

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *Slavic »

I performed another test to find out, where exactly "UTF-8" substring should be to affect the encoding. It should occur not far than 254 bytes from begin. If it is located on 255th or farther position, it is not counted and the text is detected as "Text only". This explains the difference between HISTORY.TXT in beta2 and beta3. I still think how would be possible to improve the detection, but don't have a certain solution.
Desktop: Windows 11 Pro 23H2, TC 11.03(RC). Mobile: Pixel 5a, Android 14, TC 3.42b5
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Post by *ghisler(Author) »

This happens because when TC sees UTF-8 within the first 259 characters, it searches the entire buffer for non-utf8 characters above 127. If it finds any, it knows that the file isn't UTF-8. When you set a larger UnwrapWidth value, TC uses a larger text buffer so it can show the entire page without re-reading the buffer. That's why a larger buffer will be searched in this case.
Author of Total Commander
https://www.ghisler.com
Post Reply