Find duplicate files by common code in name

CopyCats · Post by *CopyCats » 2014-02-16, 07:37 UTC

Consider these video files, which all have different names, sizes and byte content, but which all show a recording of the same song.

Code: Select all

 Blondie-Eat to the Beat-[ABCDEFGH].wmv
 Blondie-Eat to the Beat-[ABCDEFGH]_1.wmv
 Blondie-Eat to the Beat-[ABCDEFGH].mp4
 Blondie-Eat to the Beat-[IJKLMNOP].wmv
 Blondie-Eat to the Beat Live-[QRSTUVWX].flv

Each filename contains a code in brackets that identifies the recording.
So, in the example above, the first three files are duplicates of eachother (in the sense that they all show the exact same recording, even though the file format or the audio/video quality in the files may be different). The code is always 8 chars long and surrounded by brackets. There may be more brackets in the filename.

I don't want to have more than one instance of each recording. So, I'd like to locate all files that have a duplicate code in their name.

I've been playing with TC 8.5rc's new option to use plugins in the 'Find Duplicate Files' function (and to use placeholders in the filename), but I can't think of a way to have it identify duplicates by this code. But I feel that I'm overlooking something and that it should still be possible..

Any ideas (or workarounds) would be much appreciated!

Regards,
-Daan-

Post by *white » 2014-02-16, 09:45 UTC

You can use regexp_wdx plugin or Script Content Plugin.

See:
http://ghisler.ch/board/viewtopic.php?p=194119#194119
http://ghisler.ch/board/viewtopic.php?p=252907#252907

For example using regexp_wdx plugin, search for duplicates using plugin field [=regexp.Result] and put in regexp.ini:

Code: Select all

[Regexp]
Rule=getID

[getID]
Find="\[(.*)\]"
Change="$1"
Substitute=1
Others=0

CopyCats · Post by *CopyCats » 2014-02-16, 17:33 UTC

Many thanks -- Yes, that should work just fine!

CopyCats · Post by *CopyCats » 2014-02-17, 00:56 UTC

And it did work brilliantly with the regexp_wdx plug-in and your definitions, except for files that have unicode characters in their names (there are no such chars in the names' ID parts).

Would you happen to know a solution for that too?

Post by *white » 2014-02-17, 18:03 UTC

CopyCats wrote:And it did work brilliantly with the regexp_wdx plug-in and your definitions, except for files that have unicode characters in their names (there are no such chars in the names' ID parts).

Would you happen to know a solution for that too?

The mentioned plugins do not support unicode

Because of this, TC sends the plugin the "short file name". If the file name is "Filename 必[ID].mwv", TC sends for example "filena~1.txt" to the plugin and the plugin would not find square brackets in the file name.

Solution: Find a similar plugin that does support unicode.

Solution 2: There is a work around if the ID does not contain unicode. You can use the Script Content Plugin and convert the short name into a long name and find the ID.

script.ini

Code: Select all

[Script]
Section=getID

[getID]
Script=getID.vbs
LongName=1

getID.vbs

Code: Select all

' convert shortname (c:\dir\file~1.txt) to long name
set file = CreateObject("Scripting.FileSystemObject").GetFile(filename)
longname = CreateObject("Shell.Application").Namespace(file.ParentFolder.Path).ParseName(file.Name).Path

' get ID from file name, ID between square brackets: [ID]
set re = new RegExp
re.Pattern = "^.*\[([^\\]*)\][^\\]*$|.*"
content = re.Replace(longname,"$1")

CopyCats · Post by *CopyCats » 2014-02-19, 03:21 UTC

Works great - many thanks once more, White!!

The only thing that I'd still like to improve is limiting the finds to codes of exactly 8 characters long. Some filenames contain strings like [HiDef] or [3D] which causes false positives. I have a rudimentary understanding of regular expressions, but " ^.*\[([^\\]*)\][^\\]*$|.* " goes way over my head.. it looks like ASCII art to me. I half-heartedly tried replacing the section between the outermost brackets by "????????" but that didn't work. Is there a different string that would work?

Post by *white » 2014-02-19, 11:13 UTC

CopyCats wrote:The only thing that I'd still like to improve is limiting the finds to codes of exactly 8 characters long.

Simply replace "*" with "{8}" between matching of the square brackets.

Code: Select all

re.Pattern = "^.*\[([^\\]{8})\][^\\]*$|.*"

CopyCats wrote:I have a rudimentary understanding of regular expressions, but " ^.*\[([^\\]*)\][^\\]*$|.* " goes way over my head

OK. Perhaps the following version is more clear.

getID.vbs

Code: Select all

' convert shortname (c:\dir\file~1.txt) to long name
set file = CreateObject("Scripting.FileSystemObject").GetFile(filename)
longname = CreateObject("Shell.Application").Namespace(file.ParentFolder.Path).ParseName(file.Name).Path

set re = new RegExp

'remove path
re.Pattern = "^.*\" 
fname = re.Replace(longname,"")

' get ID from file name, ID between square brackets and 8 characters long
re.Pattern = "^.*\[(.{8})\].*"
if re.Test(fname) then ID = re.Replace(fname,"$1")

content = ID

Explanation of "^.*\":

Code: Select all

^   Match beginning of string. (Not necessary in this case, but more speed efficient)
.   Match any character
*   Zero or more of previous character (or sub expression). Try to match as much characters as possible.
\\  Match backslash character. Because \ is a special character, you have to write two.

Explanation of "^.*\[(.{8})\].*":

Code: Select all

^   Match beginning of string.  (Not necessary in this case, but more speed efficient)
.   Match any character
*   Zero or more of previous character.
\[  Match opening bracket. Because [ is a special character, you have to write \[.
(   Begin sub expression
.   Match any character
{8} Exactly 8 of previous character (or sub expression).
)   End sub expression
\]  Match closing bracket. Because ] is a special character, you have to write \].
.   Match any character
*   Zero or more of previous character.

Explanation of "$1":

Code: Select all

$1  Get the part of the string matching the first sub expression.

CopyCats · Post by *CopyCats » 2014-02-25, 13:50 UTC

It works perfectly, and the explanation helps me too. Many thanks again!!