HISTORY.TXT detected as utf-8 by compare tool

English support forum

Moderators: Hacker, petermad, Stefan2, white

Post Reply
User avatar
white
Power Member
Power Member
Posts: 6022
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

HISTORY.TXT detected as utf-8 by compare tool

Post by *white »

As of TC 8.0 beta 18 the file HISTORY.TXT is detected as utf-8 file by the compare tool. Why is HISTORY.TXT now considered to be utf-8, but not when it was smaller?

The critical size seems to lay at 390393 bytes. If I remove text at the top until the file starts with "d settings (and F2) while a comparison is in progress (32/64)", the file is 390393 bytes in size and no longer detected as utf-8.

If I then add "e" at the beginning (file size 390394 bytes), the file is detected as utf-8. Moreover, it is also detected as utf-8 by the internal lister which isn't the case for the original HISTORY.TXT file.

Related seems to be the lines:
04.11.10 Added: Compare by Content: Auto-detect UTF-8-encoded xml files by tag <?xml version="1.0" encoding="UTF-8" ...
04.11.10 Added: Compare by Content: Auto-detect UTF-8-encoded html files by tag <meta http-equiv="Content-type" charset=UTF-8" ...
If I change the text "UTF-8" into "UTF-7" ;-) in these lines in the original HISTORY.TXT file, the file is no longer considered to be utf-8. But if I do the same in the 390394 bytes file, it doesn't work.

Is utf-8 detection working as designed? How does it work exactly?
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50934
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

No, it's not a bug, it happens because the history.txt
1. Contains a valid header which lets TC think that it's UTF-8
2. Does NOT contain any accented characters which are not UTF-8.
Author of Total Commander
https://www.ghisler.com
User avatar
white
Power Member
Power Member
Posts: 6022
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Post by *white »

ghisler(Author) wrote:No, it's not a bug, it happens because the history.txt
1. Contains a valid header which lets TC think that it's UTF-8
2. Does NOT contain any accented characters which are not UTF-8.
As explained above, that's NOT the case. When the file size is at most 390393 bytes (remove text from the beginning, not from the end), the file is considered to be ANSI. If any ASCII characters are then added at the beginning, the file is considered to be utf-8. Why?
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50934
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

TC compares the files in 128kByte chunks. The Unicode detection is only done in the first 128k block. I guess that the line with the UTF-8 header gets near the end of that block when you add more text at the beginning. Sorry, I have no plans to change that behaviour.
Author of Total Commander
https://www.ghisler.com
User avatar
white
Power Member
Power Member
Posts: 6022
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Post by *white »

ghisler(Author) wrote:TC compares the files in 128kByte chunks. The Unicode detection is only done in the first 128k block. I guess that the line with the UTF-8 header gets near the end of that block when you add more text at the beginning. Sorry, I have no plans to change that behaviour.
It is the following text which gets near the end of the block:
16.03.10 Added: With the help from a German user, changed all German texts and the Help from Swiss to German spellings (ß character)
So it is the beta character which no longer causes the file to be interpreted as ANSI. Thanks for the explanation.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50934
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Ah, yes it is! the ß would be an invalid UTF-8 sequence, so TC knows that the file isn't UTF-8...
Author of Total Commander
https://www.ghisler.com
User avatar
white
Power Member
Power Member
Posts: 6022
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Post by *white »

2ghisler(Author)

I noticed you added "(except when containing non-UTF8 chars like ß)" to the following line in HISTORY.TXT:
04.11.10 Added: Compare by Content: Auto-detect UTF-8-encoded html files by tag <meta http-equiv="Content-type" charset=UTF-8" ...
This causes the HISTORY.TXT file to be considered as ANSI by the compare tool. However it is not correct, the code to check for non-UTF8 chars was added later in Total Commander 7.56a release candidate 1.
14.12.10 Fixed: Compare by content: Do not show files as UTF-8 if they contain the UTF-8 HTML header but also invalid UTF-8 characters (e.g. plain text accents)
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50934
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

However it is not correct, the code to check for non-UTF8 chars was added later in Total Commander 7.56a release candidate 1.
Actually it IS correct, please re-read the description of the fix! ß is NOT an UTF-8 character, so the file isn't shown as UTF-8 although it does have a HTML header.
Author of Total Commander
https://www.ghisler.com
User avatar
white
Power Member
Power Member
Posts: 6022
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Post by *white »

ghisler(Author) wrote:
However it is not correct, the code to check for non-UTF8 chars was added later in Total Commander 7.56a release candidate 1.
Actually it IS correct, please re-read the description of the fix! ß is NOT an UTF-8 character, so the file isn't shown as UTF-8 although it does have a HTML header.
Yes, I know. What I meant was that it is not correct to say in HISTORY.TXT you added the check for non-UTF-8 characters on 04.11.10.
HISTORY.TXT wrote:04.11.10 Added: Compare by Content: Auto-detect UTF-8-encoded html files by tag <meta http-equiv="Content-type" charset=UTF-8" ... (except when containing non-UTF8 chars like ß)
The check wasn't build in until Total Commander 7.56a release candidate 1. I tested it.
HISTORY.TXT wrote:14.12.10 Fixed: Compare by content: Do not show files as UTF-8 if they contain the UTF-8 HTML header but also invalid UTF-8 characters (e.g. plain text accents)
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50934
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

I see what you mean - it was meant as a correction/precision of the fix for people who read it later. I may move the ß up to the 14.12.10 entry.
Author of Total Commander
https://www.ghisler.com
User avatar
white
Power Member
Power Member
Posts: 6022
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Post by *white »

white wrote:Yes, I know. What I meant was that it is not correct to say in HISTORY.TXT you added the check for non-UTF-8 characters on 04.11.10.
HISTORY.TXT wrote:04.11.10 Added: Compare by Content: Auto-detect UTF-8-encoded html files by tag <meta http-equiv="Content-type" charset=UTF-8" ... (except when containing non-UTF8 chars like ß)
The check wasn't build in until Total Commander 7.56a release candidate 1. I tested it.
HISTORY.TXT wrote:14.12.10 Fixed: Compare by content: Do not show files as UTF-8 if they contain the UTF-8 HTML header but also invalid UTF-8 characters (e.g. plain text accents)
ghisler(Author) wrote:I see what you mean - it was meant as a correction/precision of the fix for people who read it later. I may move the ß up to the 14.12.10 entry.
Not fixed yet.
Post Reply