Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Slavic · Post by *Slavic » 2021-03-25, 20:06 UTC

When you open in Lister the HISTORY.TXT of last TC 10.00b3, Lister shows it as UTF-8 encoding, but in fact, it is not. The problem is, this text has the "UTF-8" substring too close to the begin. This looks like as a side effect of heuristic algorithm which determines the text encoding, whether it's a plain text or UTF-8. To see an example, create TEST.TXT

Code: Select all

"This text does not use UTF-8 encoding, it's simple ASCII."

Save in Notepad as ANSI and open in Lister when it uses Autodetect in configuration (default setting). Check the Options menu: it shows UTF-8, even though this example uses plain text (ASCII-7).

Reason: apparently Lister attempts to determine the encoding using different methods, and besides the two "standard": UTF-8 byte order mark and specific UTF-8 encoding somewhere in text, it also checks the "UTF-8" substring which is widely used in HTML files, something like

content="text/html; charset=utf-8"

While it is good to determine the encoding of HTML code, the existence of "UTF-8" substring near the begin of simple text doesn't make it automatically UTF-8 in common sense. I would suggest to use such heuristic very carefully if the file is a simple text, not HTML or XML.

Post by *petermad » 2021-03-25, 20:50 UTC

This is an old issue - see: viewtopic.php?p=353267#p353267

Slavic · Post by *Slavic » 2021-03-26, 07:29 UTC

petermad wrote: ↑2021-03-25, 20:50 UTC This is an old issue - see: viewtopic.php?p=353267#p353267

This is not exactly the same case. Or even if the reason may be the same or similar, its understanding was different in that discussion:

Total Commander looks in the first read block (128 kBytes) of each file whether it
- contains valid UTF-8 codes
or
- contains valid UTF-8 encoding headers

As you see, in my test nothing can be counted as a valid UTF-8 encoding header or even a part (fragment) of such header. Only a fragment is "UTF-8", and apparently it is counted as a sufficient reason. But the header should include some distinctive features, like angle brackets. So, I can assume that the detection of UTF-8 encoding is oversimplified in some cases, intentionally or not.

Post by *ghisler(Author) » 2021-03-26, 08:35 UTC

This is indeed intentional when the file does NOT contain any character codes above 127. There are many many ways to indicate the character encoding in html, xml etc. files, so I just look for UTF-8. It doesn't do any harm as long as the file doesn't contain any non-UTF-8 characters further down. If anyone has a better idea, just let me know. Removing this feature would cause much more harm.

Post by *petermad » 2021-03-26, 12:52 UTC

I just thought I would test it again with the latest history.txt file from TC 10 b3 - and to my surprise, when I tested it in my work-TC it showed up in "Text only" mode.

Then I tested it in my 32bit TC and it shows up as "UTF-8".
Now I tested it in 64bit TC with a clean .ini file and it shows up as "UTF-8"

After some sleuthing I found out that setting the [Lister] parameter UnwrapWidth to 1738 or higher makes TC show the file in "Test only" mode.

I wonder how that setting influences the view mode (or is it a hidden feature ?

)

Post by *ghisler(Author) » 2021-03-26, 14:31 UTC

Strange indeed, I will check it in the debugger.

Slavic · Post by *Slavic » 2021-03-26, 16:18 UTC

petermad wrote: ↑2021-03-26, 12:52 UTC After some sleuthing I found out that setting the [Lister] parameter UnwrapWidth to 1738 or higher makes TC show the file in "Test only" mode.

Interesting result, I was able to reproduce it as well. The magic of the value 1738 (0x6CA) isn't clear, however.
Unfortunately, it doesn't affect my test example, which shows UTF-8 anyway.

As for HISTORY.TXT, I tried to dig deeper and have found a culprit

Open it in Notepad (or other editor) and find the line

17.08.16 Fixed: Write SetEncoding=äö.do.not.remove to wincmd.ini [Configuration] on load, to prevent notepad from opening/saving the file as UTF-8 when it contains UTF-8 encoded parts (32/64)

then replace äö with ao (without umlauts), save and compare the result. These two characters are 8-bits, while all other text is 7-bits, that's a reason.
The logics is exactly as Christian said above: if all the text is 7-bits, then "UTF-8" substring is counted as an attribute of UTF-8 encoding.

Slavic · Post by *Slavic » 2021-03-26, 20:12 UTC

I performed another test to find out, where exactly "UTF-8" substring should be to affect the encoding. It should occur not far than 254 bytes from begin. If it is located on 255th or farther position, it is not counted and the text is detected as "Text only". This explains the difference between HISTORY.TXT in beta2 and beta3. I still think how would be possible to improve the detection, but don't have a certain solution.

Post by *ghisler(Author) » 2021-03-28, 10:40 UTC

This happens because when TC sees UTF-8 within the first 259 characters, it searches the entire buffer for non-utf8 characters above 127. If it finds any, it knows that the file isn't UTF-8. When you set a larger UnwrapWidth value, TC uses a larger text buffer so it can show the entire page without re-reading the buffer. That's why a larger buffer will be searched in this case.

Total Commander

Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding

Re: Lister: "UTF-8" at begin of plain text is counted as UTF-8 encoding