Merkle Trees (or hash trees) are data structures which represent multiple individual pieces of data in a way which offers incredibly efficient proofs that a single piece of data exists in the structure without revealing any of the other pieces of data, and only requiring the verifier of the proof to know a fixed-size fingerprint of the tree (which is the length of the output of whatever hash function is being used to create the tree).
If the term ‘Merkle Tree’ brings out suppressed memories of drawing pyramids of meticulously-labeled connected boxes on Post-it notes, you’re probably familiar with a diagram like this:
To quickly cover terminology, a ‘leaf’ is one of the bottom boxes, which contain some representation of source data, and a ‘node’ is any other box. Leaves are also technically nodes—but who cares.
Each node contains some representation of its children, which is generated by some function that can take the children, perform some math, and spit out a repeatable, irreversible output for each pair of children. As it turns out, hashing functions do exactly this, with some added bonuses of fixed-length output and (hopefully) enough collision resistance to make even self-driving cars jealous.
Say we have some hash function defined as H(message), where H takes one argument (often called a ‘message’), and spits out one output. The ‘+’ sign is the concatenation operator, however almost any method of combining children would also be valid.
It’s important to note that the top box in the above diagram wouldn’t calculate H(H(AB + CD) + H(EF + GH)), instead it would calculate H(child0 + child1), as H(AB + CD) has been stored as child0 and H(EF + GH) has been stored as child1 of the layer below. For any traditional Merkle Tree with n elements, generation of the full tree requires n – 1 hashes (ignoring the original hashes of the data before the actual tree is constructed, which is another n hashes). The workd required to generate a Merkle Tree scales linearly with the number of leaves.
Depending on storage/computation tradeoffs, the entire tree can be stored for later use, or only the bottom elements can be stored. In some instances, saving and transmitting only the very top hash (the Merkle Root) is required.
Verification involves tracing a path back up the tree—if your original leaf when followed up the tree produces the same Merkle Root, the leaf is identical to the original leaf used in building the tree. However, if the leaf has been altered, the Merkle Root won’t match the original.
Layers are generally numbered starting from zero:
Now that we’ve defined a Merkle Tree, we can actually use one! It’s June, and Jane wishes to prove that she has 8 files (numbered 0 through 7) at this time, without revealing the contents of the files. She builds a Merkle Tree, putting each file as a leaf on layer 0. She computes 15 hashes (2(8) – 1), and generates her final Merkle Root, which sits at (4, 0). She then tells this Merkle Root to the world (or anyone that wants to listen in her crypto IRC channel). Maybe these documents are market predictions that she won’t reveal until they expire. Maybe she’s organizing a lottery of some sort. Perhaps she is publishing an identity she can use to sign future communications (more on that in the next post, Lamport Signatures can’t be squeezed into two paragraphs).
Three months pass, and she wants to prove that file number 4 was part of her original Merkle Tree. She then reveals (1, 5), (2, 3), and (3, 0), and anyone who cares to check can follow her file back up the tree and verify that it was the original file—if the Merkle Root of the new tree match the original Merkle Root, the file must have been the original file (unless the hashing function produces a collision. Use a good hashing function!).
Anyone interested in verifying her file doesn’t need to (and can’t!) recalculate the entire tree. However, they can calculate the ‘authentication path’ she provides. Due to the one-way nature of good hashing functions, Jane can’t create a fake authentication path.
As someone who wishes to verify Jane’s claim that file 4 was a file that existed at or before the time she published the original Merkle Root, the verifier follows file 4 back up the tree:
Hash file 4
Concatenate result of 1.) with element (1, 5)
Hash result of 2.)
Concatenate result of 3.) with element (2, 3)
Hash result of 4.)
Concatenate result of 5.) with element (3, 0)
Hash result of 6.)
If the result of 7.) matches the original Merkle Root Jane published in June, file 4 must have existed when she published the original Merkle Root! Otherwise, she either screwed up or tried to lie and thought nobody would actually validate her proof.
Great, how are they useful? Aside from an interesting optimization exercise, Merkle Trees have found their way into databases, file systems, cryptocurrencies, and more. They provide a quick way to represent (or fingerprint/summarize) large sets of data in a way where portions of the data can be selectively proven as members. Generally, a hash function is applied to a file, and can be applied at a later point in time. If the digest (or output) doesn’t change, the file hasn’t changed. In a similar fashion, Merkle Trees can make sure a collection of files haven’t changed—and can be used to discover which files, if any, have been modified.
Similar behavior can be created from a variety of other cryptography methods (for example, RSA accumulators), which have their own tradeoffs. In cryptocurrencies like Bitcoin, the leaves are actually the transactions contained within the Bitcoin block, and the Merkle Root is used in the mining process (to ensure that, when given a block, nobody inserted or modified a transaction).
Additionally, Merkle Trees serve as the backbone for the Merkle Signature Scheme, which I’ll cover extensively on their own page. Dragging you through an explanation of Lamport Signatures and Merkle Tree Signatures in this post is probably prohibited somewhere in the Geneva Conventions.