Just realised yesterday's post hadn't gone live due to an unterminated string. Pelican supports dying on errors and Netlify can send an email when a deploy fails, so that should make it more obvious when the deploy goes wrong. Of course, the right answer is to use git hooks, but that's easy to forget. Why not provide a backstop, it's easy enough.
There's also the option to die on warnings, but that doesn't work - building the site complains that there are multiple items (two separate index.rst files) with the same language. It seems like this is an interaction between how pagehier builds out the page and what Pelican is expecting. At first, I had the fatal warnings turned on and spent about 10 minutes trying to figure out how to work around it... then realised it's not a high priority problem and the pages not being published were actual errors. Death by a thousand papercuts, if only there were time to fix all the little bugs - is it any wonder computers are right clusterfucks of complexity?
I've been trying to come up with an idea for something I can start building; since finding Quiver, I'm less motivated to work on GraphBook right now, particularly since I'm going to have the same sort of problem - how to keep my notebook synced. Because I linked my personal account at work, I get the development builds of Dropbox which can be a little power consuming. I do get selective sync, though. What I want is something like syncthing, except I don't want to use the relays and I don't want a complicated config, but rather a more targeted thing.
What does that look like? Well, I think there's really two kinds of syncing that I want. I'll motivate this with two examples:
- My papers archive is a dump of all my papers; I add to this relatively frequently, but not terribly frequently. It's the kind of thing that I can afford to load up on my machine when I'm restoring from backups.
- This repo is under active development on the bus. If something happens to my laptop (what if I drop it while getting off the bus), then I want to be able to have my stuff synced to the cloud.
The first scenario is pretty well covered by Time Machine on macOS and restic on Linux, all of which handle full and incremental backups with encryption reasonably well. What about the "hot" files: files I'm actively using or working on? Those I want to update more often than nightly or weekly. It's the second scenario that I want to focus on, essentially keeping a cache of regularly used files available. It should be easy to modify the cache policy, which I imagine is something like
I don't really know much about systems programming with Erlang yet, so this might be best as a C program. If we consider the case where I drop my laptop while on the bus, it's important to note that I'm usually offline there. So it'd be helpful to have a secondary, local cache that I could use to quickly restore the cache. I have a small thumbdrive that barely protrudes  that could be used to cache data on for this purpose, opportunistically uploading as needed.
It'd be tedious to mark every file that needs to be added to the cache, so I should be able to mark directories; when new files are created, they should be added. It makes sense to mark a maximum size for a cache.
So we have the first piece: a local daemon that watches for file changes and maintains a cache. Now, if it's just running a kqueue watch and inotify watch and writing changes to the external storage immediately, there's a good chance this will use a lot of extra battery power, which is not good. So the watcher should mark files as dirty, and then a separate process or thread in the daemon would periodically check those dirty files.
At some point, we'll have to sync with the network and update the cache remotely. This implies two more pieces: an update protocol and a backend storage system.
I think it makes sense for this whole thing to be block based ; a sensible block size will have to be chosen. The files I'm dealing with are likely to be smaller. As an experiment, I've started a small package for running experiments on files. On my Macbook Air it tells me:
(0) <bragi:kyle> $ avgsize ~/code ~/src ~/Documents /Users/kyle/code: elapsed=153.265749ms avg=72371 total=132946451 count=1837 min=2 max=52551242 /Users/kyle/src: elapsed=4.614689369s avg=33857 total=2440619085 count=72086 min=1 max=331297980 /Users/kyle/Documents: elapsed=2.52040076s avg=647819 total=22575212182 count=34848 min=2 max=526411771
Meanwhile, on my Thinkpad, I get:
(0) <hephaestus:kyle> $ avgsize ~/code ~/Documents ~/Dropbox ~/src /home/kyle/code: elapsed=906.866814ms avg=115604 total=3079359049 count=26637 min=1 max=148258528 /home/kyle/Documents: elapsed=8.120137ms avg=2849516 total=558505145 count=196 min=14 max=77898853 /home/kyle/Dropbox: elapsed=332.078966ms avg=1995956 total=24783791655 count=12417 min=1 max=526411771 /home/kyle/src: elapsed=2.836887154s avg=48769 total=11301984434 count=231742 min=1 max=914064882
Documents and Dropbox are directories I'd track with that first scenario, e.g. backups. Although it might be nice to store recently read papers and documents (for example, I'm reading Programming Erlang or I might be trying to implement a paper), I think that's the kind of thing I could skip on a first pass .
The two other directories mostly contain source code and writing. This also includes compiled files and whatnot; src has some third-party Rust projects I've been experimenting with. There's a lot of build artifacts in here, so that is probably throwing the metrics off, but it'd be a lot of work for this blog post to filter those out.
This brings up another point: it's probably useful to be able to specify a cache-ignore policy, which is a list of filters for files to ignore, much like .gitignore. If I switch over to my Linux machine, I probably don't want a bunch of macOS build files lying around, not to mention the wasted bandwidth and storage of those files .
Back to the subject of block sizes, let's play with some numbers.
I just arbitrarily picked some powers of two here; even given the possible outliers, it seems like 1024 or 2048 byte blocks would be a good compromise for this. Assuming every file is unique, the code and src directories on my mac total 2573565536 bytes (about 2.4G), and on my Thinkpad it's about 13.4G. Let's look at how many blocks we'd get with 1024, 2048, and 4096 byte blocks.
Why do these numbers matter? Because we have to store metadata about all these blocks; it'll probably need to be in-memory and on disk. How much space does a header take up? Let's look at a simplistic block header:
This struct takes up 80 bytes . We can add this into the table; the parenthesized number is the size of the block metadata.
|2573565536||2513247 (191.8M)||1256623 (95.9M)||628311 (47.9M)|
|14381343483||14044280 (1071.5M)||7022140 (535.7M)||3511070 (267.9M)|
With 1024 byte blocks, the block metadata is 8% of the actual data storage. With 2048 bytes, that goes to 4% of the actual data. Finally, with 4096 bytes, it's 2%. Now these numbers assume no overhead, so let's add 1% overhead and see how much space we've left over:
|Drive size||1024 @ 9%||2048 @ 5%||4096 @ 3%|
So based on all this, I think starting with 2048 affords a decent compromise.
Another feature of this system is that it doesn't need to be versioned, per se. That is, the cache just keeps the latest version. Usually the data flows from my machine to the storage system, except when I need to restore.
Okay, this is long enough and I might have a few things to play around with and try to implement. The first versions will be crude and hacky, but that's okay.
|||Though it does protrude just enough that it could be broken, which is a failure mode to be considered.|
|||This is based mostly on the fact that everyone else seems to do it. I'll revisit this assumption later.|
|||Later, a more granular cache policy that covers the maximum number of files to keep or a max size for a particular directory would work for this.|
|||Willfully ignoring the fact that there might be cases where I have to drop in a static library.|
|||Some of this could be tuned; I bet the block size could be 16 bits instead, and the SHA256 bit could be the raw bytes, not the string. In that case, we'd be down to 42 bytes, though 48 with padding for alignment if I remember my struct packing right.|