Encoding of custom column language files and pluginst.inf - Unicode?

Dalai · Post by *Dalai » 2019-05-24, 20:20 UTC

Hi there.

When editing plugin custom column language files (e.g. adding new strings) and pluginst.inf (e.g. update version number) I have trouble of finding the "correct" encoding. To avoid that hassle I thought it should be fine if these files were encoded in Unicode (UTF-16). So I searched through my TC "plugin collection" and found only three of them that have the mentioned files in Unicode: IsDotNET, TrID_Identifier and Registry.

Also, I made some tests with my own plugins and found that even TC 6.50 is able to read the Unicode pluginst.inf and TC 7.50 seems to show the strings for custom columns correctly.

Questions:

Are there any drawbacks to expect when using UTF-16 for these files?
Assuming there aren't any drawbacks, is there any reason not to use Unicode?
What's the best or most reliable approach to convert these files to Unicode, without breaking anything? Simply opening them in Notepad++ and converting to UCS-2 LE doesn't seem to work properly for either of them.
Is there any way to "validate" the conversion/result?

Regards
Dalai

Post by *ghisler(Author) » 2019-05-27, 13:33 UTC

There is indeed a downside: You cannot use translated strings when your system has a different codepage. For example, a Russian user with TC set to Russian but the system locale set to English would not get Cyrillic strings this way.

Why?
TC loads the strings with the ANSI registry function GetPrivateProfileString. This will convert from Unicode to ANSI with the _current_ code page. Then it would use the string to translate, and finally call MultiByteToWideChar using the encoding of the current language file (e.g. wcmd_rus.lng).

That's why the language files should use the same encoding as the lng file for that language.

The easiest way to do this is to open the Unicode language file and the target ini file with "Compare by content", and then set the encoding in the font settings (which only affects the ANSI file but not the Unicode file). Then use copy+paste to copy the translation into the ini file.

MVV · Post by *MVV » 2019-05-27, 14:31 UTC

Why doesn't TC use GetPrivateProfileStringW function for reading INI strings? In case of Unicode INI it will read pure Unicode properly regardless of codepages (hey, what is a codepage, eh?).
Of course it will convert ANSI INI strings into Unicode using system codepage so re-converting may be required...

Perhaps you could detect if INI is in UTF-16 during plugin load and then use proper algorithm... In case of INI files, it is enough to look for a BOM at the beginning of the file and then check if second file byte is a zero byte (it will work in most cases because valid INI files may only start with a bracket or a semicolon).

BTW reading all INI files with GetPrivateProfileStringW would make many users happy because it wouldn't need to use UTF-8 encoded strings which don't allow searching.

Dalai · Post by *Dalai » 2019-05-27, 18:09 UTC

ghisler(Author) wrote: ↑2019-05-27, 13:33 UTCThere is indeed a downside: You cannot use translated strings when your system has a different codepage. For example, a Russian user with TC set to Russian but the system locale set to English would not get Cyrillic strings this way.

Any guess as to how many users this applies to? Yes, it's more of a rhetorical/theoretical question because neither of us probably knows.

That's why the language files should use the same encoding as the lng file for that language.

Yeah, well, that's the issue here. Example: pluginst.inf contains translations to several languages, Chinese among them. Notepad++ automatically detects GB2312 for this file when opening it. Assume I add some strings. Aren't these saved with this specific character set/encoding? Also, I send the file to the translators, and they add new translations or correct existing ones.

Rephrased: What happens to the strings already in the file when it's opened and saved with a different character set on a different system? Is it possible for them to get broken? My thought was to avoid that by using Unicode files.

The easiest way to do this is to open the Unicode language file and the target ini file with "Compare by content", and then set the encoding in the font settings (which only affects the ANSI file but not the Unicode file). Then use copy+paste to copy the translation into the ini file.

That doesn't work for all encodings. I have Cyrillic available, but not Chinese, so there's no way for me to copy'n'paste these translations. But I tried a different approach: Open the ANSI file in Notepad++, set the language to the proper codepage I want to copy the translations of (one at a time), and paste the contents into a new Unicode file. This seems to have worked quite well, although I can't be sure because I neither know enough about encoding nor the languages ...

So, if I understand you right, your recommendation is to leave it as is?

Regards
Dalai

MVV · Post by *MVV » 2019-05-28, 05:37 UTC

Yeah, well, that's the issue here. Example: pluginst.inf contains translations to several languages, Chinese among them. Notepad++ automatically detects GB2312 for this file when opening it. Assume I add some strings. Aren't these saved with this specific character set/encoding? Also, I send the file to the translators, and they add new translations or correct existing ones.

You should keep in mind that such multi-language files require to switch encoding multiple times for editing multiple languages, and your editor will show properly only a part of lines and then convert characters into bytes according to the selected encoding.

Rephrased: What happens to the strings already in the file when it's opened and saved with a different character set on a different system? Is it possible for them to get broken? My thought was to avoid that by using Unicode files.

If you edit only some lines, other lines remain unchanged, of course if your editor will not damage them while saving due to characters that may be incorrect in selected encoding.

That doesn't work for all encodings. I have Cyrillic available, but not Chinese, so there's no way for me to copy'n'paste these translations.

We don't need to check for all encodings, INI files may only be ANSI (local Windows codepage) and Unicode (UTF-16LE with or without BOM), and zero second byte (in case of missing BOM) means that the file is in Unicode, so it may be safely read as Unicode using GetPrivateProfileStringW, otherwise it should be read using GetPrivateProfileStringA and converted into Unicode using different codepage numbers for strings in different languages as TC currently do.

This seems to have worked quite well, although I can't be sure because I neither know enough about encoding nor the languages ...

I usually keep codepage numbers of plugin translations in comments, so I don't need to remember which codepage should be used for a translation, it only needs to discover language codepage once when a new language is being added. But happily I have no WDX LNG multi-codepage files yet, so I didn't even noticed that TC still reads Unicode INI files using ANSI function and may lost Unicode characters.

Dalai · Post by *Dalai » 2019-05-28, 19:52 UTC

MVV wrote: ↑2019-05-28, 05:37 UTCYou should keep in mind that such multi-language files require to switch encoding multiple times for editing multiple languages, and your editor will show properly only a part of lines and then convert characters into bytes according to the selected encoding.

Yes, I'm aware of that. That's what I did when copying the translations from the ANSI file to the new Unicode one. If they were Unicode files, this switching would be eliminated, for the plugin authors as well as the translators.

If you edit only some lines, other lines remain unchanged, of course if your editor will not damage them while saving due to characters that may be incorrect in selected encoding.

That's exactly my point. I don't know which editor a translator uses, and even if I knew, I wouldn't know whether or not it could break something in the files. I don't see such problem with Unicode files (as long as nobody uses ancient editors like I do sometimes

).

We don't need to check for all encodings [...]

I meant something else in regards to the conversion from ANSI to Unicode Ghisler described. He specifically mentioned switching fonts and encoding (which means "script" in the font settings I guess). If I use e.g. Western script I can't copy the Chinese strings properly, can I?

I usually keep codepage numbers of plugin translations in comments, so I don't need to remember which codepage should be used for a translation

Ah, that's a really good idea! Will add such comments to the ANSI files.

But happily I have no WDX LNG multi-codepage files yet

Neither do I

. I'm referring to WFX custom columns, but I guess it doesn't make any difference.

---

So, looks like it boils down to these questions: Why does TC use the ANSI functions to read Unicode files? Is this because of some backwards compatibility with legacy systems (Win9x) or something? Are the reasons still valid today?

Regards
Dalai

Post by *ghisler(Author) » 2019-05-30, 09:23 UTC

Why does TC use the ANSI functions to read Unicode files?

This has historical reasons: When I started writing TC, there was still Windows 3.0. It didn't support Unicode at all. Later came Windows 95 with partial Unicode support. I was using language files with the specific encoding because that's what the users had on their system. Nowadays I could detect whether the lng file is using UTF-16 and then use the *W functions to load the strings - I already do this with the wincmd.ini and other settings files. However, it would break compatibility with older TC versions.

So just use an editor which allows you to change encoding, like my "Compare by content" function. It will allow you to copy+paste with the desired encoding. Just don't touch the other languages while editing with a different encoding.

Usher · Post by *Usher » 2019-05-30, 13:08 UTC

2ghisler(Author)
Maybe add mulitlanguage Unicode pluginstW.inf and keep pluginst.inf with English text only?
64-bit plugins should be fully Unicode, and new 32-bit plugins probably work only for systems with decent Unicode support (Windows 98+ or Windows XP+).

Dalai · Post by *Dalai » 2019-05-31, 20:11 UTC

Usher wrote: ↑2019-05-30, 13:08 UTCMaybe add mulitlanguage Unicode pluginstW.inf and keep pluginst.inf with English text only?

Huh, interesting proposal. It's a good idea. Unfortunately, it doesn't change anything about the language files for custom columns...

Regards
Dalai

Usher · Post by *Usher » 2019-05-31, 20:26 UTC

2Dalai
It can be done the same way…

Dalai · Post by *Dalai » 2019-06-01, 22:32 UTC

True. Silly me

.

Regards
Dalai

Usher · Post by *Usher » 2019-06-07, 12:29 UTC

Final note:

There are already 88 language packs for Windows 10. Some of them are for languages which have NO ANSI codepage defined.

The only way to support them is to use Unicode.

Post by *ghisler(Author) » 2019-06-11, 14:48 UTC

They just need to use the same encoding as the lng file, and you can set Unicode UTF-8 as the encoding by setting codepage=65001 .
Of course this will not work on Windows 9x/ME, but they don't support these languages anyway.

Dalai · Post by *Dalai » 2019-06-11, 16:22 UTC

ghisler(Author) wrote: ↑2019-06-11, 14:48 UTC[...] and you can set Unicode UTF-8 as the encoding by setting codepage=65001 .

Uh, how is this supposed to work in the files we are talking about, where multiple languages are in a single file? This is not about TC language files (wcmd_*.lng) where the codepage specification is possible and only one language (and codepage) is present within a file.

Regards
Dalai

Post by *ghisler(Author) » 2019-06-13, 19:25 UTC

Actually it _is_ about lng files: The plugin translation needs to use the same encoding as the lng file for that language. So if the lng file uses codepage=65001, then that section in the plugin language file needs to use UTF-8 encoding too.

Total Commander

Encoding of custom column language files and pluginst.inf - Unicode?

Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?

Re: Encoding of custom column language files and pluginst.inf - Unicode?