ChiFS

Logo pending™

File Hash

This document lists a few random thoughts and considerations in the selection of a suitable hash function for ChiFS. This hash function will be used for the identification of unique files and for (incremental) verification of downloaded file contents.

Proposed candidate

BLAKE2b in tree hashing mode with the following parameters:

Parameter Value
Digest byte length 32 (256 bit hash)
Key none
Salt default (all-NULL)
Personalization default (all-NULL)
Fanout 2 (binary tree)
Maximal depth Unlimited
Leaf maximal byte length 4096
Inner hash byte length 32

The chosen fanout, depth and leaf byte length are taken from the generic tree mode described in the BLAKE2 paper. A 256-bit output hash is chosen as an acceptable compromise between security, efficiency and convenience. The inner hash byte length is the same as the digest byte length - using larger values will increase the storage requirements for intermediate hash nodes, for little security gain.

(Root and intermediate) hashes will be encoded in the base58 variant used by Bitcoin addresses. Like in Bitcoin addresses, base58-encoded hashes should be prefixed with as many leading '1's as there are leading zero bytes in the binary hash. This hash is referred to as b2t in the rest of the protocol.

Existing solutions and considerations

What follows are a few notes that helped in the decision process for the chosen hash function.

Hash list vs. Merkle tree

The main advantages of a Merkle tree is interoperability: When using a Merkle tree with a predefined leaf block size, the root hash of a file will be the same regardless of the chosen block size of the intermediate hash nodes that are being kept around for verification.

Hash lists, as used by BitTorrent, do not have this property: The same file could hash to different Torrent hashes if the clients choose different chunk sizes.

Tiger Tree Hash (TTH)

TTH is combination of this construction of Merkle Hash Trees, and the tiger hash function. Freetiger is the fastest implementation I'm aware of.

I think TTH is currently still the most widely used hash tree construction to identify files, but it has its problems.

Security: The Tiger hash function is not used much outside of its application in TTH and P2P networks. Tiger does not have as much published cryptanalysis as some of its alternatives. Some weaknesses have been found over the years.

Performance: There are faster hash functions with similar or better security guarantees. The used tree hashing mode also suffers a small performance hit due to the 1-byte 00 prefix in data blocks, which prevents aligned mmap-based file hashing.

The ADC project has shown interest in switching to a different hash. They've been hoping for SHA-3 to standardize a hash tree construction, but that hasn't happened so far.

BLAKE2

The BLAKE2 paper describes a tree hashing mode. BLAKE2 itself is slowly gaining widespread use, but this tree hashing mode has not been standardized as part of RFC-7693. I'm worried that not all BLAKE2 implementations will support it (or at least, will not have tested it as thoroughly).

I did find one user of BLAKE2's tree hash mode, s3git, but they use unlimited fanout mode. This has the same interoperability concerns as a regular hash list and is therefore not a great candidate for ChiFS.

KangarooTwelve

KangarooTwelve is a later development of Keccak (SHA-3) and includes a tree hashing mode. I've not looked at it long enough to see if this is suitable for ChiFS. There don't seem to be many implementations.

What does IPFS use?

This is, I think, defined as part of UnixFS and is being revamped in a UnixFS-2, for which the specification doesn't seem to be complete yet. I don't really know how they identify files.

What about Dat?

The Dat project makes use of Merkle trees. Overall ideas are described in DEP-0002, but I'm missing a description of how file contents are mapped to leaf nodes. Judging from the whitepaper, it seems that file contents are split into chunks and stored into a log structure. A file is described by its position and length in the log file, rather than by a root hash that is unique to that file, so this is not something that we can use.

It is interesting to note that, although Dat uses the BLAKE2 hash function, they do not make use of its tree hashing feature, but instead use a slight variation on the 1-byte prefix approach used in TTH.

Representation

Multihash seems useful, but I'm not sure we need anything that fancy. If our file hashes aren't going to be compatible with existing systems anyway, I have a slight preference for plain base32 or base58.