Konubinix' opinionated web of thoughts

How I Organize My Static Files With IPFS and Org-Roam

Fleeting

my problem

Traditionally, when we store a file, there is a central question that we ask ourselves: where do I put the file?

I seldom find satisfying answers to this question using the classical folder hierarchy. Also, when I need the file in the future, I generally have a hard time finding it, because my mindset changed in the mean time.

Imagine I want to keep some file, like my emoji. In that traditional state of mind, I would put the data in some hard drive in a location that would make sense at the time. Something like “personal/identity/emoji.jpg”.

zettelkasten to the rescue

I’ve always wanted to remove the burden of finding a path. I was quite convinced by tools like git or ipfs, with a content addressable state of mind. But I did not have a satisfying way to index those files.

Then, I discovered org-roam that provides a way to curate a second brain, a conversation partner that helps me get back old thoughts when I need them.

Why stop here and just retrieve old thoughts? Why not also use my conversation partner to dig out files I have put in it.

my new workflow

Then, my workflow is:

  1. realize I need to store a file,
  2. decide what this file means. What is the concept that this file makes concrete,
  3. ipfs add the file,
  4. provide this hash to my conversation partner, like I put any other zettel,
  5. forget about it.

If I need to get the file back, I trust my conversation partner for showing my the appropriate note when the time comes. That means that when I want the file back, I don’t have to think in file hierarchy, but I can just look for whatever comes to mind and let the network of thoughts naturally lead me to my file.

For instance, if I want to get back to my emoji, I may ask myself “where is the image I use in my cv?”. In that case, I just go to my CV and quickly find my emoji.

how to store those files

To do so, the system needs to be able to store files in addition to simply store notes.

the stack

To deal with this, I created a private cluster of ipfs nodes1. Because ipfs pinning sucks, and because I only have a central index of CIDs, I also have a postgresql database to store those CIDs.

I also have an ipfs companion program per ipfs node that runs `ipfs get`, ensuring that the node has the content it is supposed to have. There is also an ipfs controller whose role is to find CIDs not yet allocated and ask the ipfs companions to ensure the nodes have downloaded them.

To make the ipfs companions and the controller discuss and store a temporary state of what files to get, I also have an instance of redis running.

Finally, my laptop runs also an ipfs node connected to the cluster, allowing me to ipfs add locally.

the workflow of the ipfs companions

The ipfs companions keep up-to-date a redis value of the remaining disk space.

The ipfs controller, at regular intervals, finds in postgresql what CIDs have not been replicated yet, decides what companion should store the CID based on the remaining space, then order via redis the companions to get the appropriate files and finally update the postgresql database to indicate what companion owns which CID.

Its code can be seen in here. It is not made to be usable by other people than me (yet?), but I published it anyway because people kept asking to get a look at it.

the tooling

Also, I made a script that

  1. extract all the CIDs from places of interest (like my zettelkasten),
  2. add missing CIDs in postgresql,
  3. remove extra CIDs from postgresql,

At first, the CID are only added in the ipfs node that runs in my laptop. Then, when the postgresql database is updated, the ipfs controller duplicate those in the ipfs nodes.

From time to time, I also have a script that consult the ipfs controller and the ipfs companion to check whether the sync is done.

The sync script is run a lot during the day, like every time I publish a new version of my braindump or my blog2.

likecycle of the data and backup

The ipfs controller makes sure that at least two nodes store each CID, making the system resilient to the possible crash of one disk.

Garbage collecting old files is a side effect of me practicing manual progressive chaos monkey. From time to time I replace a hard drive with an empty one. Because the controller allocated each CID to at least two nodes, then the system realizes there are bunch of files that are to be duplicated again. It does the work of asking another node to keep a new copy.

Now, the last part of the system is about backups. I have an external drive that receives weekly a copy of all the CIDs stored in postgresql. It puts them in a multi layered directories3. Then, using an encrypted borg, I create a remote backup of my backup on the drive of a friend that lives in another country.

Therefore, to loose some file, it means that

  • either my computer crashed before the sync happens,
  • or the same thing4 + at least two nodes crash during the week of adding the file without me noticing (or quickly enough for me not to have time to deal with the crashes),
  • or the same thing + my external backup and my friend encrypted copy crash quickly enough for me not to be able to fix them,

Well, I feel pretty good about the persistence of my data.

conclusion

Now, with all that in place, here is the complete workflow to add a file:

  1. I need to save a file,
  2. I ipfs add it,
  3. Put the CID in some sensible note,
  4. then, the system deals with replicating the file, so that I don’t have to bother thinking about it again.

And the workflow to retrieve a file:

  1. I ask the system,
  2. I get the note,
  3. I open the file,

Notice that the part of the flow where I actually use cognitive energy is very small (EAST). In the end, I just add a file where it naturally belong and let the system deal with the rest.

I have made the system such as it leans on some habits I must have, like running the sync script. This semi automated system is my personal way of dealing with automation.

Notes linking here


  1. I could have created any system that provides content addressable addresses. Ipfs shines in the way it works well on raspberry pi, it is meant to work in cluster so that I can easily add a node to add space and redundancy. ↩︎

  2. Actually, not only does it sync my CIDs with postgresql, but it also wait for the CIDs to be replicated before actually publish, so that I can poweroff my laptop without the risk of having a note referring to some unavailable file that would have only one copy (in my off laptop), ↩︎

  3. ${hash:0:5}/${hash:5:3}/${hash:8:3}/${hash:11:3}/${hash:14}

     ↩︎
  4. I run ipfs repo gc on my laptop only after a successful backup ↩︎