[WCX] ZPAQ

Post by *Hacker » 2015-10-17, 01:16 UTC

milo1012,

Look, I'm neither the Zpaq author nor do I intend to defend the format at all cost.

Of course, not at all. I was just trying to explain where my interest in ZPAQ comes from and why am I testing it the way I am. I am not trying to put it into a bad light or something, sorry if it came across that way. Basically I use a simple "last 14 days" backup for some stuff and so far there has been only WinRAR that supported multiple file versions out of the box, without the need to rename the source files or put them into subdirs etc. So I was wondering how ZPAQ works in this scenario and so far it seems it does not work all that well for me. Perhaps I am a bit disappointed. I apologize if that came across the wrong way.

But I understand how it works technically, plus I know and understood the source code.

That's why I am asking here and am grateful for the answers.

In the end it's the same, because when adding any new version of a file to the archive it is deduplicated anyway, and so it's a benchmark for how good the dedupe feature works in general.

I am probably still misunderstanding how deduplication works. The idea I have of deduplication is that it looks for blocks of identical data and then replaces them with a pointer. This is to me quite similar to the idea of building the usual dictionary and replacing increasingly larger blocks of data with pointers as you find them. The difference perhaps that the blocks for deduplication are at some set boundaries instead of "where you happen to find them along the way"(?). However, if deduplication works across several different files and file versions, is this not similar to solid archiving? What's the culprit of it not working with the DB files? The fixed offsets where it's trying to match the blocks?

Since when is Rar a journaling archiver?

OK, I might misunderstand journalling as well.

To me it means keeping just the file changes between versions, so if version 1 of a file contained "ABCD" and version 2 contained "ABCDEFGH" the "journal" would store only something like "+EFGH". Same goes for metadata.

as soon as you turn off solid compression or have multi-part archives that you don't want to recompress,
the different versions are stored all on their own again; also no timestamps are kept for when you added the versions, and so on...

True.

Your insight is much appreciated. Hope I did not derail the thread too much.

Roman

milo1012 · Post by *milo1012 » 2015-10-17, 03:05 UTC

Yes, Journaling is some kind of buzzword.
In the Zpaq author's concept it means that an archive is append only,
to store multiple dated versions of files for incremental backup.
For that, the archive format transparently logs (in a "Journal") all file versions it ever came across,
in detailed stats (date of update, which files, and which version belong to a specific update).

Rar on the opposite doesn't log any versions.
So when you have multiple file versions ('filename;n'), you can't tell which file belongs to which "archive version" (the update number),
or if the user created such labeled files on purpose.
So it's just a blob of file versions, w/o further info.

Zpaq is good when you have e.g. source code versions, where more than one file changes.
With Rar you can't tell if a specific header file belongs to a specific module file (you can only guess by the timestamp).
So when you want to get a specific "state" of your files, Rar doesn't help you much, but Zpaq does.
Therefore it's quite similar to SVN / CVS.
(well, more like CVS, because if a file didn't change, it isn't stored again, but for that you can go back in the version tree,
to get the next older available version; TC's branch view helps quite a lot for that)

Hacker wrote:The idea I have of deduplication is that it looks for blocks of identical data and then replaces them with a pointer. This is to me quite similar to the idea of building the usual dictionary and replacing increasingly larger blocks of data with pointers as you find them.

Yes, that's the concept.
However, no dictionary and pointers, but it uses a SHA1 hash for every individual fragment, and stores them in a global table.
So input files are divided into fragments, and their SHA-1 hashes are computed, but unlike the usual compression dictionaries,
these fragments can't have any arbitrary size and position (but "content-dependent boundaries").

Hacker wrote:However, if deduplication works across several different files and file versions, is this not similar to solid archiving?

In principle yes, but you can't do any good with the actual data; you just know: "Hey, that block is already stored", so you'd just mark it in the table and continue with the next data.
In solid archiving you can also use "partial" blocks, which a file may share with the next/previous ones, so that you only need to store the actual difference,
like when only a few bytes in a 64k block change, where the dedupe concept needs to compress the new block again,
while solid compression would only compress that difference.

So solid c. works better in theory, but in practice is limited by the dictionary size (you can't have an unlimited size, as you would run out of RAM sooner or later).
Other disadvantages:
For decent solid c. you need a good order of the input files, like sorting by file extension or analyze the content (for deduplication the order of input doesn't really matter).
Additionally, solid c. has trouble decompressing individual files fast (just try it in TC with solid 7-Zip archives).

Hacker wrote:What's the culprit of it not working with the DB files? The fixed offsets where it's trying to match the blocks?

As I said, manipulating a sqlite file doesn't just append data or changes existing one, probably all internal data offsets will change, so that it's quite likely that Zpaq won't find
any block that matches an already stored one (binary identical), even with a 2k fragment size.
Zpaq would expect the fragments to start at the same position, but can't find identical ones, as all blocks moved either forward or backwards.
Just do a binary compare (TC's CBC tool) of two different file versions, and you'll see what I mean.
Solid compression works much better for such situations, hence what you saw in Rar.

Like I said, when you only expect such file versions, it's better to use Rar,
but for a mixed file set you can probably expect that such differences will only be a small fraction of all changes,
and the dedupe feature of the remaining files would make it up.

Post by *Hacker » 2015-10-17, 12:15 UTC

milo1012,

With Rar you can't tell if a specific header file belongs to a specific module file (you can only guess by the timestamp).

Can I not simply consider all files ending in the same ;xx to be of the same version? Unless there is a file whose name ends ;xx on purpose in the latest update, I should be fine, shouldn't I?

In solid archiving you can also use "partial" blocks, which a file may share with the next/previous ones, so that you only need to store the actual difference,
like when only a few bytes in a 64k block change, where the dedupe concept needs to compress the new block again,
while solid compression would only compress that difference.

Ah, yes, true, I didn't take into consideration that dictionaries are built from ever increasing blocks of source data, each larger one containing the previous smaller one. On the other hand, could one not in theory also include blocks of size/2 in the table? And size/4 etc.?

Thanks for the answers!

Roman

milo1012 · Post by *milo1012 » 2015-10-17, 15:47 UTC

Hacker wrote:Can I not simply consider all files ending in the same ;xx to be of the same version?

Sure, it'd work if you really put both files into the archive every time you update it.
But what if you don't, or accidentally miss one file, or forget about how you did it after not using the archive for a few months?

Your header file may update once a week, while your main module file gets an update twice a week.,
and so you'd end up with different numbering schemes, not related to each other any more.
You'd have to compare file timestamps manually and "guess" which file belongs to which module.
(not to mention when you intentionally rollback a header file to an old version, where the timestamp is suddenly way older)

Of course, for Zpaq you will also have to make sure that all relevant files are put in the archive every time you update it.

Hacker wrote:On the other hand, could one not in theory also include blocks of size/2 in the table? And size/4 etc.?

In theory probably, but in practice the computational expense would be way higher.
You need a nearly fixed size and progress, and not stepping back all over again and doing another compare.
It's the same thing as using a smaller fragment size in the first place, which isn't of much use, as you saw with those DB files.

Post by *Hacker » 2015-10-17, 21:13 UTC

milo1012,

But what if you don't, or accidentally miss one file, or forget about how you did it after not using the archive for a few months?
Your header file may update once a week, while your main module file gets an update twice a week., and so you'd end up with different numbering schemes, not related to each other any more.

True. Does not apply to my use case, though, so I am OK.

Of course, for Zpaq you will also have to make sure that all relevant files are put in the archive every time you update it.

What's the difference to WinRAR? I mean, when I pack c:\dir\*.*, they do the same, don't they?

In theory probably, but in practice the computational expense would be way higher.
You need a nearly fixed size and progress, and not stepping back all over again and doing another compare.
It's the same thing as using a smaller fragment size in the first place, which isn't of much use, as you saw with those DB files.

I see, thanks.

Roman

milo1012 · Post by *milo1012 » 2015-10-17, 21:41 UTC

Hacker wrote:What's the difference to WinRAR? I mean, when I pack c:\dir\*.*, they do the same, don't they?

If everything in c:\dir\ is relevant to your backup, yes, no difference.
I just wanted to express that you should take care for how you update that archive (no matter if Rar or Zpaq),
because when doing it manually every time you could easily miss a file at some point.
So some automatic process (batch, AutoIt, etc.) might be advisable.

Post by *Hacker » 2015-10-18, 00:11 UTC

milo1012,
I understand. Thanks a lot!

Roman

milo1012 · Post by *milo1012 » 2015-10-26, 21:28 UTC

New Version 1.1b!

fixed: creating/updating or opening an archive with with path length > 259 didn't work
fixed: config dialog size for font > 8 was wrong
Chinese Simplified translation update (by 'wwj402')

Check the first post for the new file.

milo1012 · Post by *milo1012 » 2015-10-26, 22:01 UTC

Addendum to the mentioned compression of the Windows directory:

Win 7 x64 dir with ~1 year of updates : 24.368 MB

Zpaq x64 Level 1 - 5.509 MB ~15.5 (!) Minutes (3 threads)
Zpaq x64 Level 2 - 4.626 MB ~27.3 Minutes (3 threads)
Zpaq x64 Level 3 - 4.072 MB ~34.1 Minutes (3 threads)
Zpaq x64 Level 4 - 3.737 MB ~83.1 Minutes (3 threads)
Zpaq x64 Level 5 - 3.397 MB ~276.3 (!) Minutes (3 threads)
7-Zip x64 standard compression - 4.125 MB ~71.5 Minutes (2 threads)
7-Zip x64 maximum compression - 3.956 MB ~127.3 Minutes (2 threads)
Rar 4.x standard compression solid - 5.778 MB ~47.4 Minutes
Rar 4.x maximum compression solid - 5.605 MB ~50.3 Minutes

Non-solid for comparison:
7-Zip x64 maximum compression non-solid - 7.849 MB
Rar 4.x standard compression non-solid - 8.345 MB

Post by *Hacker » 2015-10-26, 22:19 UTC

milo1012,
Nice

I gave up after seeing it would take many hours to test all packers. Thanks.

Roman

krasusczak · Post by *krasusczak » 2015-10-27, 09:12 UTC

Behaviour of plugin when I have "Show all archive version" = off is problematic I think..

This is my test scenario I backup 5 different files in one zpaq file:
Round 1:
File 1
File 2
Round 2:
File 3
Round 3:
File 4
File 5
Round 4:
File 3 (update)
Round 5:
File 1 (update)
File 2 (update)
Round 6:
File 3 (update)
File 4 (update)

& with option "off" I see only last files 3 & 4, I understand that this option show probably last packed entry (with this 2 files) but unfortunately for many people this will be problematic (especially when you set this option off by default) they will not find "lost" files 1-3, good if they change this option but mostly they will not probably..

Is this option should list all newest files from all rounds & list them as one?

milo1012 · Post by *milo1012 » 2015-10-27, 15:22 UTC

krasusczak wrote:Behaviour of plugin when I have "Show all archive version" = off is problematic I think..

I understand what you mean, but this is not due to the plug-in, but because of the Zpaq concept.
Reason: the whole concept is derived from backup purposes.
When you update the archive with the same (more or less) file set every time,
you will only get the newest set "state" in default listing, with older files that ere not present any more being marked as deleted.
So when I list such archive with the standalone program:

Code: Select all

zpaq.exe l archive.zpaq

I get the very same listing.

The culprit is the missing -nodelete option:

zpaq manpage wrote:-nodelete

With add, do not mark files in the archive as deleted when the corresponding external file does not exist. This makes zpaq consistent with the behavior of most non-journaling archivers.

When you add files from a sub-dir to an archive

Code: Select all

zpaq.exe a archive.zpaq mydir\*.*

they are saved the usual way.
But when you now delete files in that dir and do another update,
those files will be marked as deleted too, and you wouldn't see them,
no matter if in the standalone program (w/o -all option) or the plug-in (w/o "Show all archive versions" option).

In the plug-in I NEVER compare if external files actually still exist (-nodelete disabled),
and so all files not included in the current update will be marked as deleted.

But it's a good point. I described it as:
"you can see only the"collected" files from all updates you made, with each file showing it's newest version"
but it's in fact not true when you use the plug-in for updates.

I will add that -nodelete option to the plug-in, enabled by default(*), so that you can choose how you want to add updates.

* Or better disabled by default?

batchman61 · Post by *batchman61 » 2015-10-27, 18:05 UTC

Hi,
can't reproduce the findings of the krasusczak test scenario for adding and updating files regarding plugin option "Show all archive versions".

What I see:
ON : version folders, containing folders and files saved with that version only (use branch view CTRL+B to see all files of all versions)
OFF : current version of all folders and files (can't see why this default could be an issue)

Standalone program lists all folders and files of a version, not only those saved with that version (zpaq list archive -until version).

Can anyone confirm please ?

As the test scenario doesn't delete any file, the behaviour shouldn't depend on option -nodelete.
If the plugin doesn't implement "file existence check" nor the nodelete option so far, an implementation using default ON should be consistent to stay with current behaviour.

Thanks

krasusczak · Post by *krasusczak » 2015-10-27, 18:05 UTC

I think that enabled by default will be a better solution, but I need to see live version

Thanks!

P.s. Zpaq support deleting files?, I know that this is for backup etc., but I just start to wonder if someday by mistake I gonna copy completely wrong file, I will need to start everything from beginning because I can't del this wrong file from archive..

milo1012 · Post by *milo1012 » 2015-10-27, 19:25 UTC

batchman61 wrote:Hi,
can't reproduce the findings of the krasusczak test scenario for adding and updating files regarding plugin option "Show all archive versions".
...
OFF : current version of all folders and files (can't see why this default could be an issue)

The newest version of files: yes
but files not included in the last update: no.
I doubt that you see a different behavior, or do you?
(judging from the code it just can't happen otherwise)

For the normal view ("Show all archive versions" disabled) it's a difference if you add files with -nodelete or not.

Just use the plug-in and add fileA to a new archive.
Now add fileB to the same archive.
Result: you will only see fileB.
With -nodelete (internally) enabled while adding fileB: you'd see both, fileA and fileB,
and this is exactly what krasusczak would have expected to see.

The standalone program works slightly different, depending on if you use a prefix path or not.
In the above example

Code: Select all

zpaq.exe a archive.zpaq mydir\*.*

would delete files too, if mydir for the initial archive update contains fileA only, and only fileB for the second update, so you'd only see fileB after that.
But using it for standalone files

Code: Select all

zpaq.exe a archive.zpaq fileA
zpaq.exe a archive.zpaq fileB

will keep both files, and you would see both in default view.
For the plug-in I can't make such differentiation, and therefore all files not included in the current update will be marked as deleted.

batchman61 wrote:As the test scenario doesn't delete any file, the behavior shouldn't depend on option -nodelete.

It doesn't of course. It will just list what is found in the last update.
And, depending on if you added the last update with the -nodelete option,
it will either still list files not included during the last update ("deleted" ones), or not.

To repeat the above statement from the manpage:

"With add, do not mark files in the archive as deleted when the corresponding external file does not exist. This makes zpaq consistent with the behavior of most non-journaling archivers."

batchman61 wrote:If the plugin doesn't implement "file existence check" nor the nodelete option so far, an implementation using default ON should be consistent to stay with current behaviour.

I think I will leave it default disabled, to not break the current behavior.

krasusczak wrote:Zpaq support deleting files?...

Deleting individual files is not supported by the format at all.
But one thing that comes close to deletion is to "rollback" an archive to an older version, which will delete all updates after the update number you'd specify.
This only works with the zpaq standalone program:

zpaq homepage wrote:When updating, -until will truncate the archive at that point before appending. So if you backed up some files you didn't mean to, then you can truncate the last update and repeat:
zpaq add backup c:\ -not c:\tmp -until -1