Multi-rename tool, to replace non-english characters.

JimmyTheBroker · Post by *JimmyTheBroker » 2018-02-08, 09:39 UTC

Hi guys,

I have a few files with non-english characters eg:
キャッ】39みゅトフードーじっく！【ドーじっオリ

I have no idea what it means, but they're throughout some of the filenames.
Any idea how i can remove them? (Using the Multi-rename tool, dealing with a few hundred files.)

thanks,
Jimmy

Post by *ghisler(Author) » 2018-02-08, 10:41 UTC

Currently you can't - there are too many for simple search+replace. A content plugin would be nice.

Stefan2 · Post by *Stefan2 » 2018-02-08, 10:51 UTC

I don't know if that really works with "non-english characters",
but you can search and replace many different signs at once with MRT.

Press the F1-key while in the MRT and read:

Example: Replace Umlauts+Accents:
Search for: ä|ö|ü|é|è|ê|à
Replace with: ae|oe|ue|e|e|e|a

Maybe that works for you too?
Search for: キ|ャ|ッ|】|3|9|...
Replace with: _|_|_|_|_|...

- - -

Oh, to late, Mr. Ghisler already answered.
But maybe you can just run that search&replace a few times with different chars....

Post by *ghisler(Author) » 2018-02-08, 11:02 UTC

The characters are Japanese phonetic characters called Katakana. There are half width and full width forms. Here are the charts from Unicode.org:
https://www.unicode.org/charts/PDF/U30A0.pdf
https://www.unicode.org/charts/PDF/UFF00.pdf

There are 96 different character codes for full width Katakana alone (including those with " and ° modifiers). You could do Search/Replace for these, but it would take a while to create. A small content plugin would be better.

And then there is the second phonetic alphabet, Hiragana, with about the same number of characters.

These could both be converted to Latin quite easily. But the ideaograms (chinese characters) also used in Japanese can't - they are read differently depending on where they are used.

gdpr deleted 6 · Post by *gdpr deleted 6 » 2018-02-08, 11:53 UTC

Keep in mind that if you want to reduce your file names to only the ASCII characters, you might end up with the same (ASCII) name for different files in the same directory.

While you could manually take care of such an issue in the MRT, this could become burdensome if "hundreds" of files are involved in a MRT operation.

For so many files, it probably is better to write a simple, tailor-made script (Powershell/.NET or any other preferred scripting language of your choice) that does the renaming while automatically handling possible file name collisions the way you want.

In such a script, each file name would be stripped from any character which is not in the ASCII value range (>= 0x80, i.e., non-english characters) or outside the "extended ASCII" range (>= 0x100). Before renaming the file , the script would check if the new file name would collide with an existing file and handle this situation according to your requirements.

Just because you have Total Commander running does not mean you _have_ to use Total Commander to do _any_ task. Choose the right tool for the job. You have more than one hammer in your toolkit.

JimmyTheBroker · Post by *JimmyTheBroker » 2018-02-08, 13:09 UTC

I appreciate all the help.

I converted all the symbols i do want to keep to non-symbols and then got rid of the other character using \W, see below.
(I did it over a few steps but i think if i wanted to do it in one, i could have used the following)

Search for: -|\W|PPP
Replace with: PPP||-

thanks guys,
Jimmy

JimmyTheBroker · Post by *JimmyTheBroker » 2018-02-08, 13:14 UTC

elgonzo wrote:Keep in mind that if you want to reduce your file names to only the ASCII characters, you might end up with the same (ASCII) name for different files in the same directory.

While you could manually take care of such an issue in the MRT, this could become burdensome if "hundreds" of files are involved in a MRT operation.

For so many files, it probably is better to write a simple, tailor-made script (Powershell/.NET or any other preferred scripting language of your choice) that does the renaming while automatically handling possible file name collisions the way you want.

In such a script, each file name would be stripped from any character which is not in the ASCII value range (>= 0x80, i.e., non-english characters) or outside the "extended ASCII" range (>= 0x100). Before renaming the file , the script would check if the new file name would collide with an existing file and handle this situation according to your requirements.

Just because you have Total Commander running does not mean you _have_ to use Total Commander to do _any_ task. Choose the right tool for the job. You have more than one hammer in your toolkit.

good advice. I like it a lot, but thank goodness i didn't need to go that deep.

I did run into the multiple files with the same name, but MRT renames them with (1),(2) and then I was able to throw some other meta-data information in to distinguish them.

gdpr deleted 6 · Post by *gdpr deleted 6 » 2018-02-08, 14:03 UTC

JimmyTheBroker wrote:I did run into the multiple files with the same name, but MRT renames them with (1),(2) [...]

Hehe, yeah, you are right. This is one of the new features in TC 9.xx that completely flew under my radar...

milo1012 · Post by *milo1012 » 2018-02-08, 19:35 UTC

Why so complicated?
Just use RegEx and let TC find all characters in a certain range.
Example: keep ASCII and the basic latin and most european characters, but remove all "upper" code points:
Search for:

Code: Select all

[\x{0250}-\x{FFFF}]

Replace with: _
[x]RegEx

(it seems that TC's RegEx engine doesn't allow codepoints > U+FFFF - so everything up to U+10FFFD isn't searchable with such an expression)

gdpr deleted 6 · Post by *gdpr deleted 6 » 2018-02-08, 19:37 UTC

milo1012 wrote:Why so complicated?
Just use RegEx and let TC find all characters in a certain range.
Example: keep ASCII and the basic latin and most european characters, but remove all "upper" code points:
Search for:
Code: Select all
[\x{0250}-\x{FFFF}]
Replace with: _
[x]RegEx

Okay, you convinced me that i need a vacation...

JimmyTheBroker · Post by *JimmyTheBroker » 2018-02-08, 22:10 UTC

Code: Select all

[\x{0250}-\x{FFFF}]

Oh awesome, thanks mate!

But I got another question about hexadecimal now.

milo1012 wrote:(it seems that TC's RegEx engine doesn't allow codepoints > U+FFFF - so everything up to U+10FFFD isn't searchable with such an expression)

I understand that hexadecimal's largest number possible with four digits is FFFF (or 65535 in decimal).

I'm assuming U+10FFFD means FFFFFFFFFD (9 F's and a D)
which would be 1099511627773 in base10.
That's over 1 trillion possible characters? which must be totally wrong.
I've googled a bit but cant seem to find the answer. Sorry for the basic question.

gdpr deleted 6 · Post by *gdpr deleted 6 » 2018-02-08, 22:42 UTC

JimmyTheBroker wrote:I'm assuming U+10FFFD means FFFFFFFFFD (9 F's and a D)
which would be 1099511627773 in base10.

No

10FFFD is just hexadecimal 0x10FFFD, which is 1114109 decimal.

JimmyTheBroker · Post by *JimmyTheBroker » 2018-02-08, 22:53 UTC

I think I get it.

So the "U+" and "0x" are same thing, indicating that its a hex number.

When is U+ used and when is 0x (for what languages).

thanks

gdpr deleted 6 · Post by *gdpr deleted 6 » 2018-02-08, 23:08 UTC

No. U+ specifically denotes a Unicode code point. The actual code point value is then given as hexadecimal number directly following U+.

"0x" is just a prefix commonly used to denote a hexadecimal number. By the way, "0x" is not the only way to denote a hex number, but it is by far the most common. Other ways used here and there to express a hexadecimal number like 0x1234 would be for example: 1234h, &H1234, $1234 (there are plenty more, but 3 examples are enough) - all denoting the same hexadecimal number 1234 (= 4660 decimal).

The programming language/software you are using (or documentation standard/style guide you are following) will dictate which notation you will have to use...

JimmyTheBroker · Post by *JimmyTheBroker » 2018-02-08, 23:15 UTC

thanks!