Directory comparison using file size as "matching key&q

Here you can propose new features, make suggestions etc.

Moderators: Hacker, petermad, Stefan2, white

Post Reply
edg
Junior Member
Junior Member
Posts: 7
Joined: 2014-02-06, 11:56 UTC

Directory comparison using file size as "matching key&q

Post by *edg »

Hi,
sometimes I need to compare directories (by files contents and ignoring dates) in which files are named by different persons and the result is that to make the comparison first of all I need to rename (file by file) all the files in a pane.

It would by very useful if, in Synchronize Dirs..., there will be an additional option with which the comparison "matching key" could be set as files dimensions (obviously in bytes) ignoring files name and date.

In the same manner it will also be useful to have files date/time as comparison "matching key".

Best regards

P.s. excuse me in advance for my bad english
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

Hi edg,
your English isn't soo bad, there's plenty of *much* worse examples :)
But still (please bear with me), I don't really understand what you are trying to do.

First of all, what do you mean by "matching key&q"? (I kind of understand "matching key", but what's "q"?)

Then, I figure that you do NOT want to use the file *names* but rather their *sizes*, as a first step, for determining which file to compare with which (while ignoring dates completely, of course).
Is that right?
If that is the case, then there might be a *logical* problem...

But maybe it's best if you give an example, like so:
- what are the contents of left and right pane *before* your manual renaming (names and sizes of files)?
- what are the contents of left and right pane *afterwards*?
- which of the files then would you like to have marked as different, and why?
Last edited by meisl on 2014-02-07, 00:23 UTC, edited 2 times in total.
User avatar
Hacker
Moderator
Moderator
Posts: 13144
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Post by *Hacker »

meisl,
what do you mean by "matching key&q"
&q is a shortened " which is the HTML code for the quotes sign " that got cut off because the title exceeded the allowed length. So "matching key&q actually should have been "matching key".

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
edg
Junior Member
Junior Member
Posts: 7
Joined: 2014-02-06, 11:56 UTC

Post by *edg »

Hi Meisl,
first of all thanks for your reply.

Hacker has already respond to you, with 'matching key' I mean the rule that will be used to compare the files.

A good example could be this one:

you and a colleague take off a server folder in yours notebook (or on a USB drive) for working on in the weekend.
When you both returned to office you have two differents modified sets of the same original directory with some files modified in set A and others, or some of the same, in set B but with many renamed files because for you or for your colleague some names was not well representative of files content but those files are left untouched in its original form (completly unmodified except for the filename).
Now when you go to merge and consolidate your and your colleague work into the original server folder how can you easily compare the same files that now have different names?

The same scenario occour when you rename some photos and then, maybe on another machine, you modify a different set of the same photos for, as an example, remove some red eye.
Also in this case it will be very useful to compare the differents sets of photos by file size (because now they have different filename) and delete the same photos in one of the two sets leaving there only the unmatched ones so that now you can manually merge a very limited number of files instead the complete set.

Thanks for the support and have a nìce day.
edg
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

Oh, so this is about duplicates I think.
[EDIT: plz don't be put off by this lengthy post, the most important is the following paragraph]

Precisely: you want the normal result of "Synchronize dirs" except any files that have a corresponding file "on the other side" with exact the same contents - but happen to have a different name and/or date.
Right :?:

----

Note that "same size" does NOT imply that the contents are really the same (the contrary, "different size", does indeed imply different content, though).
(That's what I meant by "logical problem": consider two files "on the left" and, say, four "on the right", and all of the same size - so which should be expected to equal [by content] which? And which should be marked?)

Also note that you need some strategy for merging the real duplicates, too. Meaning: if the name and/or date was changed on either (or both) sides, but not the contents, you still need to decide which of the names and/or dates you will use as the "final" one.
This can amount to a choice out of up to three:
1) the original name/date,
2) your new name/date,
3) your colleague's name/date.
But that's a different question.

EDIT: "same content", just like "same size" does NOT guarantee that you will have at most one on the left associated to at most one right. The only thing ("criterion") that guarantees this is "name", no way around it.
But: having (and relying on) such a criterion is essential for "Synchronize dirs". Think about it.
Ask yourself:
a) which to compare with which, in the first place ("any to any" - not really...!)?
b) then, what to display as result? (there are many ways in which "any" can be regarded as different from "any other"...)
This, in fact, is why I'm suggesting to subtract duplicates from the result list, rather than introducing "same size" or even "same content" as the criterion for the initial association step (question a) above).
Last edited by meisl on 2014-02-07, 01:01 UTC, edited 1 time in total.
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

@Hacker: Thanks, didn't occur to me at all :)
edg
Junior Member
Junior Member
Posts: 7
Joined: 2014-02-06, 11:56 UTC

Post by *edg »

[quote]Oh, so this is about [b]duplicates[/b] I think.
[[b]EDIT:[/b] plz don't be put off by this lengthy post, the most important is the following paragraph]

Precisely: you want the normal result of "Synchronize dirs" [b]except[/b] any files that have a corresponding file "on the other side" [i]with exact the same contents[/i] - but happen to have a different name and/or date.
Right :?:[/quote]

Hi meisl and excuse me for the lenght of my previous post, I'm new in this board.

The result that I want is:

scenario - left pane with some files - right pane with a copy of the same files (obviously in another folder) but someone modified in content (so with different date/time and filesize) and someone with only different name (but exactly same binary content).

Now invoking Alt-C + Y is possible to call the Sycronize Directories function.
Here there are some boolean (on/off) checkboxes (Asymmetric, Subdirs, by content, Ignore date) and, under Show, duplicates and singles.
My dream will be to have two more checkboxes 'by size' and 'by date' so the compared files in the two panes will be coupled with these rules and the equal ones (once compare internally to be absolutely shure) could be selected and deleted from left or right pane.

Thanks for your clarifing exposition of what I had previously try to said.
edg
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

I have to apoligize, edg; I'm afraid I'm a bit lost.
You say
The result that I want is:
scenario - left pane with some files - right pane with a copy of the same files (obviously in another folder) but someone modified in content (so with different date/time and filesize) and someone with only different name (but exactly same binary content).
- but in fact you do NOT really describe what that result should be. You know, the crucial points that define "expected result" would be definite answers to questions a) and b) from my previous post (under 2nd "EDIT").
Rather, what you've described in the above so far is only the situation you start from. But what's missing is what exactly you expect/whish to have marked as different there.

In the following you're repeating that you would like to have "by size" and "by date" added just like "by content". However, these would not make sense as such, I'm afraid. :!:
I tried hard to explain why, and in different ways.

So, how to proceed?
Maybe you could describe the result you'd expect, when that imagined checkbox "by size" (and, say only that) would be checked, and you click "Compare".
So: assuming left and rigth pane as you described - which should then be marked as "different", and on what grounds :?:
edg
Junior Member
Junior Member
Posts: 7
Joined: 2014-02-06, 11:56 UTC

Post by *edg »

meisl:
Note that "same size" does NOT imply that the contents are really the same (the contrary, "different size", does indeed imply different content, though).
(That's what I meant by "logical problem": consider two files "on the left" and, say, four "on the right", and all of the same size - so which should be expected to equal [by content] which? And which should be marked?)

Answer:
It's really very rare that two files in a directory have the same byte (not approxymate in kb) size as is really very rare that two files have the same date+time (of course if they are located on the same filesystem type, FAT store local time while NTFS store UTC time and with more decimals).

meisl:
Also note that you need some strategy for merging the real duplicates, too.

Answer:
My goal is to delete the real duplicates previously associated by its filesize (or, other times, by filedates if the filesystem is the same) expressed in bytes and then compared by contents.

meisl:
Meaning: if the name and/or date was changed on either (or both) sides, but not the contents, you still need to decide which of the names and/or dates you will use as the "final" one.

Answer:
In this case I can select the compared by contents files of my interest and then choose to delete the ones present in right or left pane.

meisl:
This can amount to a choice out of up to three:
1) the original name/date,
2) your new name/date,
3) your colleague's name/date.
But that's a different question.

Answer:
The most important thing is to not have, once the two foldershas been merged, the same file twice, so is less important if the name I decide to preserve is mine or the other one.

meisl:
EDIT: "same content", just like "same size" does NOT guarantee that you will have at most one on the left associated to at most one right. The only thing ("criterion") that guarantees this is "name", no way around it.

Answer:
But I don't need to associate all the files but only the ones that have the same byte size (or NTFS date+time). If you made a bubble sort using bytesize as key rather then filename then you can try to associate files in the panes and the ones that are uncoupleble have an empty pair in the opposite pane.

meisl:
But: having (and relying on) such a criterion is essential for "Synchronize dirs". Think about it.
Ask yourself:
a) which to compare with which, in the first place ("any to any" - not really...!)?

Answer:
Absolutly no, I want to compare only files that have exactly the same bytesize (or date+time)

meisl:
b) then, what to display as result? (there are many ways in which "any" can be regarded as different from "any other"...)
This, in fact, is why I'm suggesting to subtract duplicates from the result list, rather than introducing "same size" or even "same content" as the criterion for the initial association step (question a) above).

Answer:
I need that the association step display files with same bytesize (or date+time) so can be compare by content and finally erased (from right or left pane) if have exactly the same binary content.

As I had first said my english is not so good and I wanto to apologize for the manner in which I probably misuse some words.
meisl many thanks for your valable time that you spent over my questions.
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

Hey edg, that's really cool. Thanks for taking the time to go through all of my questions. It really tells me that you're serious about it.
So it's a matter of course for me to try to give substantial answers. Let me just ask for a little patience atm, as I won't have time for the next two days.

Cheers :)
edg
Junior Member
Junior Member
Posts: 7
Joined: 2014-02-06, 11:56 UTC

Post by *edg »

Hi meisl,
it's absolutely not a question of time, take all the time you need.
I like so much TC that the only thing that I want is to let it the most complete possible.

Thanks for your support.
Bye
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

Hi edg, I think I got it now.
edg wrote:The most important thing is to not have, once the two foldershas been merged, the same file twice [...]
I had kind of understood this before, and said "then subtract duplicates from the result list"
I should have added, and emphasized, that you do this with "Find Files", NOT with "Synchronize Dirs".
What you're trying to do is already there in TotalCommander, it just doesn't fit well in the "Synchronize Dirs" dialog.

Like so:

a) do your normal merging with "Synchronize Dirs", ignoring (real) duplicates for the moment. It is important to understand that "duplicates" in this dialog has a special meaning - different from what I (and you) called "real duplicates". In this dialog you should have both buttons, "duplicates" and "singles" depressed (=active).

b) after having synchronized - to, say, someFolder\ - close "Synchronize Dirs", navigate to someFolder\ and press Alt+F7 or Command | Search...

c) Search for * on the "General" tab and on the "Advanced" tab check "Find Duplicates", uncheck "same name" and check "same size" and "same content"

d) click "Start Search", then "Feed to listbox". You will see a list of groups of files.

e) From each of these groups you want to delete all but one. Either you do it by hand or, with the new 8.50 version of TotalCommander, you'd use "Mark" | "Select Group" which let's you mark all but the first in each group for example.

Please try with test files/folders first, and keep in mind that the "Find Files" dialog does NOT remember your settings on the "Advanced" tab.

One last note: you may want to take a look at Version Control Systems, depending on how often you need to merge and from how many sources. They can be complicated, so if you've never worked with one you might be better off using TC, as described above.
edg
Junior Member
Junior Member
Posts: 7
Joined: 2014-02-06, 11:56 UTC

Post by *edg »

Hi meisl,
your answer is very accurate and the goal can be reached but:

- by merging that way the files, you can no longer group the different ones by source (author, remote ftp, etc.) but at the end you can only have everything mixed in the same folder grouped together (many times is better to create a subfolder with author, date or other peculiarity of a file group).

- merging with this method is very disk stress (both 'by content' sources are read from the same physical HDD instead of been placed in two different physical media) and reduce enormeusly the performance increasing tremendeusly the time taken by the process of content comparison due to the disk heads continously moving from one source to the other multiplying the seek time necessary for doing that.

What I had request is a far better way to achive this goal and also e less difficult and more intuitive method.

By the way many thanks for the support and for your answers.
edg
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

Hi edg,
I had omitted a few things with which you can refine the method.

- you can search in sub-folders, too. Simply select "all (unlimited depth)" for "Search in subdirectories:" on the "General" tab. So there's no need to put all files into one folder.

- additionally, you can search in more than one folder simultaneously. Simply append ";" (semicolon) and another path to "Search in:" on "General" tab. Example: "C:\foo\bar;E:\projects\baz"
edg wrote:merging with this method is very disk stress (both 'by content' [...]) [...] increasing tremendeusly the time taken by the process of content comparison [...]
- you can do the duplicates elimination (steps b) - c) above) before or after the merging (step a)). If you want to do it before you'd search in the two or more source folders simultaneously (see above). This way you can have sources on different physical media.

- basically, there's no way around real content comparison (if you want a correct result). However, TC is smart enough to do it only if the sizes equal. You can even do comparison by size only, if you really want to. Simply check "by size" and uncheck "by content" on the "Advanced" tab. But again: I strongly discourage you from doing that!
edg wrote:What I had request is [...] less difficult and more intuitive method.
Well, I disagree, for the reasons given in my first posts. The key point being: "duplicates" in the Sync Dirs dialog means something different than in "Find Files".
edg wrote:By the way many thanks for the support and for your answers.
You're welcome :)
Post Reply