Atomic Operation
Atomic operations are complex operations, which either fail altogether, or complete successfully, providing a strong guarantee that there is no intermediate, incorrect result.
godi
is read-only when verifying data, and atomicity doesn't apply.
It will be writing a single file per source when sealing, and it will only do so if it managed to read all source files successfully.
In sealed-copy mode, it will potentially write hundreds of thousands of files to multiple destinations. If one of these fails to write, it will remove all the files underneath a destination that it has written so far, but keeps writing unaffected destinations. Have a look at this feature in action
This feature implies that it has to remember all files written so far, and tests showed that it requires about 250MB of RAM for one million files. For thousands of files, the memory consumption will stay well below 10MB though.
Input File-Filters
During seal and sealed-copy operations, godi
traverses directories to find files for reading. Which files it picks depends on the input file filter, specified using the --file-exclude-patterns
flag.
By default, it will exclude files which are known to change a lot, like .DS_Store on osx, but you may specify to ignore hidden files, symbolic links, godi
seal files, as well as files matching a glob pattern.
This is an example of the file-exclude-pattern in action, note the increasing amount of skipped files when the filter is in use.
The verify operation will always verify all files mentioned in the seal, a filter does not apply.
Seal Formats
A seal is a file that stores signatures of data files, each identifying the contents of the file. If a single bit within that data file changes, the signature will be a different one. In information technology, such a signature is called a hash. godi
computes not one, but two of these, called MD5 and SHA1.
Currently there are two seal file formats which can be written and verified.
-
gob
- A compressed binary format which can be streamed when writing and verifying. This is highly relevant when huge directory trees are sealed or verified - both in terms of memory and disk-space consumption. The gob format takes up 40MB for 700k files, using up to 200MB of RAM in the process, whereas the same process in MHL format used 450MB and produced a seal file with 190MB in size. Verifying the gob file starts right away, whereas it take 16s until the mhl file verification begins.
godi
s default format.- temper-proof thanks to signature (read more further down)
- Uses the gobz file extension.
-
mhl
- A human-readable XML based format as introduced by the media hash list(
mhl
) program. - Even though seal files created by the
mhl
tool won't have a signature, and therefore aren't temper-proof, those created bygodi
will have one. mhl
can read mhl seal files created bygodi
, and vice versa.godi
will not embed information about the creator of the seal, as it believes that meta-data should be provided by the user of the program, and should be sealed like any other file.- Uses the mhl file extension
- A human-readable XML based format as introduced by the media hash list(
All seal files generated by godi
will carry a signature to assure that changes to any information stored in the file will be detected. This is helpful to detect silent corruption of the file as well as intentional adjustments.
Performance Considerations
For understanding this paragraph, it's beneficial to understand how data is processed in godi. Without getting into too much detail, you can see that data is first read from storage, then hashed, and possibly written in sealed-copy
mode.
All data is handled in parallel, thus it is read, right after reading fed into two hashers, and into all outputs, and all that in parallel.
As the Hasher part can easily deliver 450MB/s per core, you can imagine that the bottleneck will occur during disk-based input or output operations. For example, reading from an SSD with a cold filesystem cache will rarely deliver more than 500MB/s, and writing to an SSD would not be much faster either.
Nonetheless, depending on the type of storage, you might benefit from multiple simultaneous reads, and/or multiple simultaneous writes, which may drastically increase the perceived performance. The amount of simultaneous reads and writes is setup per device you are reading from or writing to.
It is vital to test for good values for --streams-per-input-device
(-spid) and --streams-per-output-device
(-spod*) to get optimal performance for your respective hardware. By default, there may be as many hashers as you have cores, and this rarely needs a change unless godi
is competing with other programs for the CPU.
Have a look at this video, showing how a fast input device will need 6 cores for processing all the data. It maximized the CPU usage, as the device could feed data even faster.
As a summary, there are a few rules to remember
- seal and verify operations are as slow as the slowest input device.
- fast devices are effectively slowed down by slow devices
- sealed-copy operations are as slow as the slowest input device or the slowest output device.
- If a slow reader doesn't deliver, writers have to wait.
- Fast readers have to wait for slow writers.
- seal, verify and sealed-copy operations are only as fast as your CPU allows.
- If your CPU is the bottleneck, feed more cores by increasing the
--streams-per-input-device
values respectively. - Each input stream feeds two cores to create an
MD5
andSHA1
hash in parallel. An additional input device stream will feed 2 additional cores, until either the input bandwidth is to low or all cores are maxed out.
- If your CPU is the bottleneck, feed more cores by increasing the
As each input stream is fed by exactly one file, you need to have enough files to keep them busy. For example, if you have only one big file, there is only about 2 cores to work on it, no matter how many input streams are set up.
Error Handling
godi
will report and handle every error it encounters, reporting it to the user in any case. On error, it will abort the entire operation only if no chance of successful completion remains.
If a file could not be read from a source, the entire operation will abort as from this point on neither seal or sealed-copy operations will continue successfully. However, a verify operation will continue just to provide as much information to you as possible.
If a file could not be written during a sealed-copy operation, the respective destination will be marked as faulty and rolled-back. Nonetheless, godi
will continue to write to all remaining destinations generate as many duplicates as possible.
Limitations
Windows
- Multi-device optimizations don't currently apply on windows
- When ctrl+C is pressed in the in the git-bash to interrupt the program, godi will attempt to stop, but appears to be killed before it can finish cleanup. This seriously hampers atomic operation, and it is advised to use the cmd.exe prompt. Might be related to this issue in some way.
General
- Sealed copy ignores permission bits on directories, and will create them with
0777
in generally. It does, however, respect and maintain the mode of copied files. godi
is very careful about memory consumption, yet atomicity comes at the cost of keeping a list of files already copied for undo purposes. That list grows over time, and consumed ~200MB for 765895 files. It might be worth providing a flag to turn undo off. This limitation is already tracked- Hidden directories are currently not excluded. However, a
--folder-exclude-patterns
flag could put directory filtering under the user's control. This feature is already planned for a future release