How does deduplication work?
Tarsnap deduplicates your data by:
- Splitting it into chunks ("chunkification").
- Only uploading chunks that haven't been uploaded before.
Each archive is expressed as a list of chunks, which grants us a great deal of flexibility and efficiency.
Simplified example: wordification
To illustrate how chunkification and chunk-based archives function, let's look at a simpler example: Suppose that we only want to store text files. Instead of splitting our data into a fixed number of characters, we'll split them into words ("wordification").
In particular, we will store a few of revisions of the file
pets.txt
, wherein we try to list our favourite type of
pets.
First version: archive A
Since this is the Internet, we start off with pets.txt
containing:
cute small kittens
How can we use "wordification" to help us archive this very important file?
The first archive wasn't very exciting. The server knows that archive A consists of the words 1, 2, and 3 (those words being "cute", "small", and "kittens"). It would be easy to restore our valuable file if anything happened to it, but there wasn't any deduplicated data yet.
Second version: archive B
After a bit of thought, we change our pets.txt
to:
fuzzy small bunnies
and create a second archive.
This archive is more interesting. The computer knows that it already uploaded the word "small", so it can reuse that from the old word list. As a result, archive B is expressed as words 4, 2, and 5.
Later versions: archives C and D, and deleting A
We take three steps here:
-
Revise
pets.txt
to:small fuzzy bunnies
and create archive C. - Delete archive A.
-
Revise
pets.txt
to:small fuzzy kittens
and create archive D.
To focus on how the archives and server data changes, we'll look at all of the revisions in a table instead of diagrams for each archive.
Local computer | Transfer | Server | |||
---|---|---|---|---|---|
(Add / Delete) | Words | Archives | |||
New archive A:
cute small kittens |
+ words 1 2 3
+ archive A |
1: cute
2: small 3: kittens |
A: 1 2 3 | ||
New archive B:
fuzzy small bunnies |
+ words 4 5
+ archive B |
1: cute
2: small 3: kittens 4: fuzzy 5: bunnies |
A: 1 2 3
B: 4 2 5 |
||
New archive C:
small fuzzy bunnies |
+ archive C |
1: cute
2: small 3: kittens 4: fuzzy 5: bunnies |
A: 1 2 3
B: 4 2 5 C: 2 4 5 |
||
Delete archive A |
- words 1 3
- archive A |
2: small
4: fuzzy 5: bunnies |
B: 4 2 5
C: 2 4 5 |
||
New archive D:
small fuzzy kittens |
+ word 6
+ archive D |
2: small
4: fuzzy 5: bunnies 6: kittens |
B: 4 2 5
C: 2 4 5 D: 2 4 6 |
When we added archive C, we didn't need to upload any new words at all. Clear win for our wordification-based deduplication!
Deleting archive A prompted the server to remove any words that we no longer needed. That reduced our storage costs, however...
... when we added D, we incurred an "extra" transfer: Re-uploading the word "kittens" (now called word 6 instead of word 3). We could have avoided that transfer by creating archive D before deleting archive A.
Extending the "wordification" example
The simple "wordification" example only contained a single file, but it's a useful starting point to understand how Tarsnap archives work. Let's imagine how we could extend the example to cover a more realistic archive system:
-
Multiple files:
Instead of an archive merely containing a list of words, the archive
could contain a list of filenames, and each filename would then
be associated with a list of words. For example:
E:
Note that our list of words is shared and reused between all three files!pets.txt
(2 4 6)uvic.txt
(4 5)internet.txt
(3 3 3) - Files metadata: Information such as user/group ownership, permissions, and the modification date can also be associated with each filename.
- Directories: Directories can be stored as a special case of files.
- Local caching: The word list was sent from the server to the local computer every time a new archive was being created. That could make backups quite slow! In a real-world system, the list of words (or a hash of the chunked data) should be kept on the local system.
-
Encrypted data:
The simplified example stores server data unencrypted — if a
hacker broke in, they could find out that we considered
bunnies as pets instead of
kittens! To protect against that, the
data should be encrypted with a private key on the local computer.
For example, we should change:
2: fuzzy → 2: bzFxlArO5is=
-
Encrypted metadata:
However, encrypted data isn't enough. If an attacker saw that
pets.txt
had archive B: 4 2 5 and archive D: 2 4 6, they could still figure out that kittens were not our only choice for pets! To fix this, we need to encrypt the labels of words, as well as the archive lists. So we should replace:2: bzFxlArO5is= → 3l2jG1Wwu1k=: bzFxlArO5is=
Naturally, the archive metadata should also be encrypted:D: 2 4 6 → fnoMOFFK6Mk=: UB2vxoD8qbg=
Note that even the number of words in D is hidden —instead of storing the three word numbers as (2 4 6 → 3l2jG1Wwu1k= 22iBrXTZfNU= 6WsyTW1UI5k=), we encrypt the entire list to reduce the amount of information that an attacker could gather. -
Non-text files:
Most of our data is not in easily-handled text files, of course!
How can we handle binary data?
First, let's remind ourselves of two benefits of "wordification":- Variable-sized chunks: Words have different numbers of characters, so splitting data into fixed-size chunks (say, every 4 characters) would not be ideal.
- Context-dependent splitting: We don't try to guess where we should split our data before we've looked at the data. Words are split in spaces — in other words, the location of each split depends on the context (data).