Multi-rename tool, to replace non-english characters.
Moderators: white, Hacker, petermad, Stefan2
- JimmyTheBroker
- Member
- Posts: 179
- Joined: 2017-06-07, 05:22 UTC
Multi-rename tool, to replace non-english characters.
Hi guys,
I have a few files with non-english characters eg:
キャッ】39みゅトフードーじっく!【ドーじっオリ
I have no idea what it means, but they're throughout some of the filenames.
Any idea how i can remove them? (Using the Multi-rename tool, dealing with a few hundred files.)
thanks,
Jimmy
I have a few files with non-english characters eg:
キャッ】39みゅトフードーじっく!【ドーじっオリ
I have no idea what it means, but they're throughout some of the filenames.
Any idea how i can remove them? (Using the Multi-rename tool, dealing with a few hundred files.)
thanks,
Jimmy
I finally get notifications from emails again!!!
So happy!
So happy!
- ghisler(Author)
- Site Admin
- Posts: 48083
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Currently you can't - there are too many for simple search+replace. A content plugin would be nice.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
I don't know if that really works with "non-english characters",
but you can search and replace many different signs at once with MRT.
Press the F1-key while in the MRT and read:
Maybe that works for you too?
Search for: キ|ャ|ッ|】|3|9|...
Replace with: _|_|_|_|_|...
- - -
Oh, to late, Mr. Ghisler already answered.
But maybe you can just run that search&replace a few times with different chars....
but you can search and replace many different signs at once with MRT.
Press the F1-key while in the MRT and read:
Example: Replace Umlauts+Accents:
Search for: ä|ö|ü|é|è|ê|à
Replace with: ae|oe|ue|e|e|e|a
Maybe that works for you too?
Search for: キ|ャ|ッ|】|3|9|...
Replace with: _|_|_|_|_|...
- - -
Oh, to late, Mr. Ghisler already answered.
But maybe you can just run that search&replace a few times with different chars....
- ghisler(Author)
- Site Admin
- Posts: 48083
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
The characters are Japanese phonetic characters called Katakana. There are half width and full width forms. Here are the charts from Unicode.org:
https://www.unicode.org/charts/PDF/U30A0.pdf
https://www.unicode.org/charts/PDF/UFF00.pdf
There are 96 different character codes for full width Katakana alone (including those with " and ° modifiers). You could do Search/Replace for these, but it would take a while to create. A small content plugin would be better.
And then there is the second phonetic alphabet, Hiragana, with about the same number of characters.
These could both be converted to Latin quite easily. But the ideaograms (chinese characters) also used in Japanese can't - they are read differently depending on where they are used.
https://www.unicode.org/charts/PDF/U30A0.pdf
https://www.unicode.org/charts/PDF/UFF00.pdf
There are 96 different character codes for full width Katakana alone (including those with " and ° modifiers). You could do Search/Replace for these, but it would take a while to create. A small content plugin would be better.
And then there is the second phonetic alphabet, Hiragana, with about the same number of characters.
These could both be converted to Latin quite easily. But the ideaograms (chinese characters) also used in Japanese can't - they are read differently depending on where they are used.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Keep in mind that if you want to reduce your file names to only the ASCII characters, you might end up with the same (ASCII) name for different files in the same directory.
While you could manually take care of such an issue in the MRT, this could become burdensome if "hundreds" of files are involved in a MRT operation.
For so many files, it probably is better to write a simple, tailor-made script (Powershell/.NET or any other preferred scripting language of your choice) that does the renaming while automatically handling possible file name collisions the way you want.
In such a script, each file name would be stripped from any character which is not in the ASCII value range (>= 0x80, i.e., non-english characters) or outside the "extended ASCII" range (>= 0x100). Before renaming the file , the script would check if the new file name would collide with an existing file and handle this situation according to your requirements.
Just because you have Total Commander running does not mean you _have_ to use Total Commander to do _any_ task. Choose the right tool for the job. You have more than one hammer in your toolkit.
While you could manually take care of such an issue in the MRT, this could become burdensome if "hundreds" of files are involved in a MRT operation.
For so many files, it probably is better to write a simple, tailor-made script (Powershell/.NET or any other preferred scripting language of your choice) that does the renaming while automatically handling possible file name collisions the way you want.
In such a script, each file name would be stripped from any character which is not in the ASCII value range (>= 0x80, i.e., non-english characters) or outside the "extended ASCII" range (>= 0x100). Before renaming the file , the script would check if the new file name would collide with an existing file and handle this situation according to your requirements.
Just because you have Total Commander running does not mean you _have_ to use Total Commander to do _any_ task. Choose the right tool for the job. You have more than one hammer in your toolkit.
- JimmyTheBroker
- Member
- Posts: 179
- Joined: 2017-06-07, 05:22 UTC
I appreciate all the help.
I converted all the symbols i do want to keep to non-symbols and then got rid of the other character using \W, see below.
(I did it over a few steps but i think if i wanted to do it in one, i could have used the following)
Search for: -|\W|PPP
Replace with: PPP||-
thanks guys,
Jimmy
I converted all the symbols i do want to keep to non-symbols and then got rid of the other character using \W, see below.
(I did it over a few steps but i think if i wanted to do it in one, i could have used the following)
Search for: -|\W|PPP
Replace with: PPP||-
thanks guys,
Jimmy
I finally get notifications from emails again!!!
So happy!
So happy!
- JimmyTheBroker
- Member
- Posts: 179
- Joined: 2017-06-07, 05:22 UTC
good advice. I like it a lot, but thank goodness i didn't need to go that deep.elgonzo wrote:Keep in mind that if you want to reduce your file names to only the ASCII characters, you might end up with the same (ASCII) name for different files in the same directory.
While you could manually take care of such an issue in the MRT, this could become burdensome if "hundreds" of files are involved in a MRT operation.
For so many files, it probably is better to write a simple, tailor-made script (Powershell/.NET or any other preferred scripting language of your choice) that does the renaming while automatically handling possible file name collisions the way you want.
In such a script, each file name would be stripped from any character which is not in the ASCII value range (>= 0x80, i.e., non-english characters) or outside the "extended ASCII" range (>= 0x100). Before renaming the file , the script would check if the new file name would collide with an existing file and handle this situation according to your requirements.
Just because you have Total Commander running does not mean you _have_ to use Total Commander to do _any_ task. Choose the right tool for the job. You have more than one hammer in your toolkit.
I did run into the multiple files with the same name, but MRT renames them with (1),(2) and then I was able to throw some other meta-data information in to distinguish them.
I finally get notifications from emails again!!!
So happy!
So happy!
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Why so complicated?
Just use RegEx and let TC find all characters in a certain range.
Example: keep ASCII and the basic latin and most european characters, but remove all "upper" code points:
Search for:
Replace with: _
[x]RegEx
(it seems that TC's RegEx engine doesn't allow codepoints > U+FFFF - so everything up to U+10FFFD isn't searchable with such an expression)
Just use RegEx and let TC find all characters in a certain range.
Example: keep ASCII and the basic latin and most european characters, but remove all "upper" code points:
Search for:
Code: Select all
[\x{0250}-\x{FFFF}]
[x]RegEx
(it seems that TC's RegEx engine doesn't allow codepoints > U+FFFF - so everything up to U+10FFFD isn't searchable with such an expression)
Last edited by milo1012 on 2018-02-08, 19:37 UTC, edited 1 time in total.
TC plugins: PCREsearch and RegXtract
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Okay, you convinced me that i need a vacation...milo1012 wrote:Why so complicated?
Just use RegEx and let TC find all characters in a certain range.
Example: keep ASCII and the basic latin and most european characters, but remove all "upper" code points:
Search for:Replace with: _Code: Select all
[\x{0250}-\x{FFFF}]
[x]RegEx
- JimmyTheBroker
- Member
- Posts: 179
- Joined: 2017-06-07, 05:22 UTC
Code: Select all
[\x{0250}-\x{FFFF}]
But I got another question about hexadecimal now.
I understand that hexadecimal's largest number possible with four digits is FFFF (or 65535 in decimal).milo1012 wrote:(it seems that TC's RegEx engine doesn't allow codepoints > U+FFFF - so everything up to U+10FFFD isn't searchable with such an expression)
I'm assuming U+10FFFD means FFFFFFFFFD (9 F's and a D)
which would be 1099511627773 in base10.
That's over 1 trillion possible characters? which must be totally wrong.
I've googled a bit but cant seem to find the answer. Sorry for the basic question.
I finally get notifications from emails again!!!
So happy!
So happy!
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
- JimmyTheBroker
- Member
- Posts: 179
- Joined: 2017-06-07, 05:22 UTC
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
No. U+ specifically denotes a Unicode code point. The actual code point value is then given as hexadecimal number directly following U+.
"0x" is just a prefix commonly used to denote a hexadecimal number. By the way, "0x" is not the only way to denote a hex number, but it is by far the most common. Other ways used here and there to express a hexadecimal number like 0x1234 would be for example: 1234h, &H1234, $1234 (there are plenty more, but 3 examples are enough) - all denoting the same hexadecimal number 1234 (= 4660 decimal).
The programming language/software you are using (or documentation standard/style guide you are following) will dictate which notation you will have to use...
"0x" is just a prefix commonly used to denote a hexadecimal number. By the way, "0x" is not the only way to denote a hex number, but it is by far the most common. Other ways used here and there to express a hexadecimal number like 0x1234 would be for example: 1234h, &H1234, $1234 (there are plenty more, but 3 examples are enough) - all denoting the same hexadecimal number 1234 (= 4660 decimal).
The programming language/software you are using (or documentation standard/style guide you are following) will dictate which notation you will have to use...
- JimmyTheBroker
- Member
- Posts: 179
- Joined: 2017-06-07, 05:22 UTC