2019-04-26

Just realised yesterday's post hadn't gone live due to an unterminated string. Pelican supports dying on errors and Netlify can send an email when a deploy fails, so that should make it more obvious when the deploy goes wrong. Of course, the right answer is to use git hooks, but that's easy to forget. Why not provide a backstop, it's easy enough.

There's also the option to die on warnings, but that doesn't work - building the site complains that there are multiple items (two separate index.rst files) with the same language. It seems like this is an interaction between how pagehier builds out the page and what Pelican is expecting. At first, I had the fatal warnings turned on and spent about 10 minutes trying to figure out how to work around it... then realised it's not a high priority problem and the pages not being published were actual errors. Death by a thousand papercuts, if only there were time to fix all the little bugs - is it any wonder computers are right clusterfucks of complexity?

I've been trying to come up with an idea for something I can start building; since finding Quiver, I'm less motivated to work on GraphBook right now, particularly since I'm going to have the same sort of problem - how to keep my notebook synced. Because I linked my personal account at work, I get the development builds of Dropbox which can be a little power consuming. I do get selective sync, though. What I want is something like syncthing, except I don't want to use the relays and I don't want a complicated config, but rather a more targeted thing.

What does that look like? Well, I think there's really two kinds of syncing that I want. I'll motivate this with two examples:

  1. My papers archive is a dump of all my papers; I add to this relatively frequently, but not terribly frequently. It's the kind of thing that I can afford to load up on my machine when I'm restoring from backups.
  2. This repo is under active development on the bus. If something happens to my laptop (what if I drop it while getting off the bus), then I want to be able to have my stuff synced to the cloud.

The first scenario is pretty well covered by Time Machine on macOS and restic on Linux, all of which handle full and incremental backups with encryption reasonably well. What about the "hot" files: files I'm actively using or working on? Those I want to update more often than nightly or weekly. It's the second scenario that I want to focus on, essentially keeping a cache of regularly used files available. It should be easy to modify the cache policy, which I imagine is something like

I don't really know much about systems programming with Erlang yet, so this might be best as a C program. If we consider the case where I drop my laptop while on the bus, it's important to note that I'm usually offline there. So it'd be helpful to have a secondary, local cache that I could use to quickly restore the cache. I have a small thumbdrive that barely protrudes [1] that could be used to cache data on for this purpose, opportunistically uploading as needed.

It'd be tedious to mark every file that needs to be added to the cache, so I should be able to mark directories; when new files are created, they should be added. It makes sense to mark a maximum size for a cache.

So we have the first piece: a local daemon that watches for file changes and maintains a cache. Now, if it's just running a kqueue watch and inotify watch and writing changes to the external storage immediately, there's a good chance this will use a lot of extra battery power, which is not good. So the watcher should mark files as dirty, and then a separate process or thread in the daemon would periodically check those dirty files.

At some point, we'll have to sync with the network and update the cache remotely. This implies two more pieces: an update protocol and a backend storage system.

I think it makes sense for this whole thing to be block based [2]; a sensible block size will have to be chosen. The files I'm dealing with are likely to be smaller. As an experiment, I've started a small package for running experiments on files. On my Macbook Air it tells me:

(0) <bragi:kyle> $ avgsize ~/code ~/src ~/Documents
/Users/kyle/code: elapsed=153.265749ms
        avg=72371 total=132946451 count=1837
        min=2 max=52551242
/Users/kyle/src: elapsed=4.614689369s
        avg=33857 total=2440619085 count=72086
        min=1 max=331297980
/Users/kyle/Documents: elapsed=2.52040076s
        avg=647819 total=22575212182 count=34848
        min=2 max=526411771

Meanwhile, on my Thinkpad, I get:

(0) <hephaestus:kyle> $ avgsize ~/code ~/Documents ~/Dropbox ~/src
/home/kyle/code: elapsed=906.866814ms
        avg=115604 total=3079359049 count=26637
        min=1 max=148258528
/home/kyle/Documents: elapsed=8.120137ms
        avg=2849516 total=558505145 count=196
        min=14 max=77898853
/home/kyle/Dropbox: elapsed=332.078966ms
        avg=1995956 total=24783791655 count=12417
        min=1 max=526411771
/home/kyle/src: elapsed=2.836887154s
        avg=48769 total=11301984434 count=231742
        min=1 max=914064882

Documents and Dropbox are directories I'd track with that first scenario, e.g. backups. Although it might be nice to store recently read papers and documents (for example, I'm reading Programming Erlang or I might be trying to implement a paper), I think that's the kind of thing I could skip on a first pass [3].

The two other directories mostly contain source code and writing. This also includes compiled files and whatnot; src has some third-party Rust projects I've been experimenting with. There's a lot of build artifacts in here, so that is probably throwing the metrics off, but it'd be a lot of work for this blog post to filter those out.

This brings up another point: it's probably useful to be able to specify a cache-ignore policy, which is a list of filters for files to ignore, much like .gitignore. If I switch over to my Linux machine, I probably don't want a bunch of macOS build files lying around, not to mention the wasted bandwidth and storage of those files [4].

Back to the subject of block sizes, let's play with some numbers.

Average Size 64 128 512 1024 2048 4096
33857 529 264 66 33 16 8
48769 762 381 95 47 23 11
72371 1130 565 141 70 35 17
115604 1806 903 225 112 56 28
647819 10122 5061 1265 632 316 158

I just arbitrarily picked some powers of two here; even given the possible outliers, it seems like 1024 or 2048 byte blocks would be a good compromise for this. Assuming every file is unique, the code and src directories on my mac total 2573565536 bytes (about 2.4G), and on my Thinkpad it's about 13.4G. Let's look at how many blocks we'd get with 1024, 2048, and 4096 byte blocks.

Why do these numbers matter? Because we have to store metadata about all these blocks; it'll probably need to be in-memory and on disk. How much space does a header take up? Let's look at a simplistic block header:

This struct takes up 80 bytes [5]. We can add this into the table; the parenthesized number is the size of the block metadata.

Size 1024 2048 4096
2573565536 2513247 (191.8M) 1256623 (95.9M) 628311 (47.9M)
14381343483 14044280 (1071.5M) 7022140 (535.7M) 3511070 (267.9M)

With 1024 byte blocks, the block metadata is 8% of the actual data storage. With 2048 bytes, that goes to 4% of the actual data. Finally, with 4096 bytes, it's 2%. Now these numbers assume no overhead, so let's add 1% overhead and see how much space we've left over:

Drive size 1024 @ 9% 2048 @ 5% 4096 @ 3%
32G 29.1G 30.4G 31.0G
64G 58.2G 60.8G 62.1G
128G 116.5G 121.6G 124.2G

So based on all this, I think starting with 2048 affords a decent compromise.

Another feature of this system is that it doesn't need to be versioned, per se. That is, the cache just keeps the latest version. Usually the data flows from my machine to the storage system, except when I need to restore.

Okay, this is long enough and I might have a few things to play around with and try to implement. The first versions will be crude and hacky, but that's okay.

[1]Though it does protrude just enough that it could be broken, which is a failure mode to be considered.
[2]This is based mostly on the fact that everyone else seems to do it. I'll revisit this assumption later.
[3]Later, a more granular cache policy that covers the maximum number of files to keep or a max size for a particular directory would work for this.
[4]Willfully ignoring the fact that there might be cases where I have to drop in a static library.
[5]Some of this could be tuned; I bet the block size could be 16 bits instead, and the SHA256 bit could be the raw bytes, not the string. In that case, we'd be down to 42 bytes, though 48 with padding for alignment if I remember my struct packing right.

Tags: