vendredi 31 juillet 2009

Gitification of Tomboy notes

I was previously using Dropbox to spread Tomboy synchronisation directories between the machines I access, yet I sometimes forget to Tomboy synchronize before leaving home or work, feeling a bit miserable afterwards — as I now depend on Tomboy for many aspects of my duties. However, as I never forget Git synchronisation, the idea came to me that I should use Git to synchronize Tomboy notes, just like for most of my other things.

Tomboy is a wonderful and very useful tool. Yet, its internal file and directory formats are under-documented, and some guess work is needed here and there, when I want to handle my Tomboy notes through various scripts. There is a D-Bus interface that I could use, and this is how I originally started my tboy tool. But I found out I do not master that interface so well; so tboy was sometimes using D-Bus, and other times avoiding it. As tboy was progressively growing to accomodate my various needs, I finally decided it was easier to uniformize it towards direct reading of directories and files, with the practical result I almost never use D-Bus by now. Another tiny advantage is that tboy may work even when Tomboy is not running.

The guess work may have strange consequences. I got the persistent impression that the Tomboy synchronisation directories were designed in such a way to preserve the evolution history of successive synchronisation calls — each being called a revision in Tomboy terminology — yet I had some difficulty in deciphering every detail of this. Sandy Armstrong, the actual maintainer of Tomboy, is especially friendly and talkable, so I dared asking him for some help. He explained to me that maintaining the synchronisation history has never been an intent in Tomboy, and that if I ever find two versions of the same note, than I'm uncovering a Tomboy bug. Surprised by this statement, I made a more thorough examination of my whole Tomboy synchronisation directory, and found a lot of duplication. So there is a bug in Tomboy sync in which the cleanup does not work properly — I'm not fully sure, but the cleanup apparently works only in a few cases, for notes being handled in adjacent revisions. All the clutter which results holds a lot of recoverable history. So that particular bug was quite productive in my case ☺.

Each revision has an associated number, counting chronologically from zero. The synchronisation directory has one subdirectory per revision, named after the revision number. To be precise, a revision directory is two-level down from the top synchronisation directory, as directories are grouped one hundred at a time (the grouping directory merely uses the hundred digit of the revision number). Each revision directory holds a manifest.xml file explaining which notes were still existing at that revision level, and for each note, at which revision it was last modified. The revision directory also holds the full note contents for notes which were introduced or modified at that revision level.

The gitification works in two passes. The first pass establishes, for each note, all revision numbers for which the revision holds a full note contents for that note. It also attributes a timestamp to the revision (the manifest.xml modification time seems a good estimator). The second pass checks, for each revision and referenced note, if we still have its contents at the last modified revision. Most of the times, because of the cleanup bug, we do. If we do not, than we pick a copy at the closest higher revision where the full text of the note exists. If there is no such higher revision, then most likely, the note has been deleted for good, but still exists in the ~/.tomboy/Backup/ directory, so we pick that backed up note instead.

Now that we have a set of existing notes at each revision level, it becomes a trivial matter to restore a previous state one revision at a time, and to generate Git commands for staging and commiting that state. After the execution of the transformation script, I manually copied back the few other administrative directories from the original ~/.tomboy/, that is: addins/, addin-db-001/, sync_temp/ and Backup/. With Tomboy stopped, the final step has been to replace ~/.tomboy/ by the new one.

When I gitified my Tomboy notes as described above, 375 notes existed after the 187'th revision, and I was ideally expecting 187 Git commits. Whenever the Tomboy sync cleanup code worked correctly, some historical information was lost, to the point that some commits were discarded as being empty. I ended up with 176 commits, which is rather satisfying result. The space savings are interesting as well. The Tomboy synchronisation directory was taking 22M; the resulting Git pack uses 1,5M, while the checkout itself uses 1,8 Meg.

While Tomboy built-in synchronisation works live, proper care is needed to stop Tomboy before Git synchronisation, and start it afterwards. If one plays straight and never takes chances about this, there is no reason to have a synchronisation problem ever. As I use many scripts and tools for moving files between machines, as I wander between home and work, it seems easy to slip a few more commands in the scripts ot make sure Tomboy does not run when it should not be running.

As the cleanup bug will likely be corrected in Tomboy some day, the whole trickery above, however fun it may be, might not work for long ☺. However, I'll continue using Git to store history, do synchronisation, and even knowingly let conflicts get in, knowing the powerful machinery I now have to resolve them.

vendredi 10 juillet 2009

Away from os.path.walk

Quite a number of times in all these years, I needed to explore directory hierarchies in my programming. When I switched to Python (it was version 1.5.2 then), os.path.walk was the documented way to do it, and I surely used it a lot. Yet, the functional argument was sometimes awkward for fitting the exact processing functionality I needed for each particular directory walking. The impedance mismatch between the functional argument and os.path.walk has been painful more than once (the From Cfengine to Python case comes to my memory as I write, I do not know why this one in particular).

Walking big directory trees has never been inordinately slow in Python, as it is I/O bound overall. This I/O may well be dominated by stat calls, checking if a particular directory entry is a sub-directory or another type of file. In Unix, each directory is pointed to by its father directory, has a . entry pointing at itself, and is pointed to by all .. entries from its immediate sub-directories, and these all contribute to the link count for that directory entry (the root directory has no father, however). GNU find uses this fact for optimizing out stat calls: it saves the link count and monitors the number of sub-directories seen so far, as soon as enough sub-directories has been seen to explain the link count, it safely assumes that the remaining entries have no further directories, without any need to stat them out. Of course, stat may be needed for lot of other checks within find, in which case the above optimization is useless. But in the most frequent use, by far, the optimization well applies. I discussed the thing with Guido and implemented it for os.path.walk, but finally never officially submitted the patch, as for some reason I never figured out, the patch did not yield the spectacular improvement I expected ☺.

The later os.walk, despite undoubtedly nicer than os.path.walk, was still a bit insufficient for my needs. It offers more control over some aspects of the walking, while loosing others. I rather see os.walk as part of the initial effort for pushing iterators in the Python library and in all Python programmers' mind. So I finally gave up completely on using the provided wrappers, and instead started rewriting the walking part explicitly within each application. This is only a few lines of code along:


  stack = [top_directory]
  while stack:
      directory = stack.pop()
      for base in os.listdir(directory):
          name = os.path.join(directory, base)
          if os.path.isdir(name):
              if not os.path.islink(name):
                  stack.append(name)
          else:
              # File processing part


(Note: the check for a symbolic link avoids a looping bug, when the link points to a directory which is higher in the hierarchy. Thanks to Al Danial for pointing out this potential problem.)

These may be torn and bent at will, depending of the particular need — pre-order, post-order, anything in between or elsewhere, name it — much more easily that if implicit recursion was used: the recursion stack is explicit, here. The advantage is especially bold when writing generators: if the file processing part contains any yield statement, implicit recursion gets cumbersome and expensive, having to chain imbricated yield loops just for delivering results at the outer level.

Now, let's presume the file processing part delivers diagnostic of some sort, prefixed by the directory name. Humans like output to be sorted, because the eyes are usually the search tool. And besides, output is more entertaining when sorted, one has a better feeling of a progression, than with a jumble of all-mixed up directory names that psychologically appears like if it will never finish.

One solution is to accumulate all output first, sort it, and only then display it. While being conceptually simple, this is not perfect. This is not a real problem usually that it spoils the memory savings from using a generator, but it surely defeats the elegance of using one. We lose the entertainment resulting of more progressive and parallel output.

There is only one way to keep the advantages, and this is to directly walk directories in sorted order. This means that appending names at the end of a stack, and popping from the end of the stack, is too naive. On the other hand, since the above loop is often rewritten, it should nevertheless stay simple. Here is one easy way to do so:


  stack = [top_directory]
  while stack:
      directory = stack.pop(0)
      for base in sorted(os.listdir(directory)):
          name = os.path.join(directory, base)
          if os.path.isdir(name):
              if not os.path.islink(name):
                  stack.append(name)
          else:
              # File processing part


There is only two modifications: we pop from the start of the stack instead of from its end, and we sort the bases separately for each directory. Doing so, we append in a kind of sorted order. But it still has a few disadvantages: the stack will be shifted whole at each removal, the sorted function produces copies, and the sort is not exactly what one wants: it will be a bit like restarted at each directory level, exploring width-first.

I attempted several remedies to these drawbacks over the years. To give only one example, I once saved directories in an intermediate list, reverse-sorting it before appending it to the stack, so I could still pop from the end of the stack, do smaller sorts, sometimes sparing sorts depending on the precise output I knew I was to get, and God knows what. But deep down, this is all annoying optimization complexity.

Here is the last trick I recently found in this series, and yet another application of priority queues — the documentation of which amusingly recycles another article of mine! ☺ As I still feel it now, it is surprinsingly elegant. Enough at least so I share it with my readers:


  from heapq import heappop, heappush
  stack = [top_directory]
  while stack:
      directory = heappop(stack)
      for base in os.listdir(directory):
          name = os.path.join(directory, base)
          if os.path.isdir(name):
              if not os.path.islink(name):
                  heappush(stack, name)
          else:
              # File processing part


It seems to work! Each generated name is produced by suffixing the last directory extracted from the priority queue; so being lexicographically after it, it can be inserted back in the heap within the same run (in sorting terminology). So, this is all a nice, simple compromise between both snippets above. Instead of shifting the whole stack at each removal, we rearrange only a logarithmic number of elements. Instead of doing a whole sort at each directory level, we spread a single heapsort over all directories. A real nice effect is that the traversal appears like lexicographical depth-first. The optimization complexity wholly vanished. Moreover, the code is not essentially longer nor different than the simplest which comes to mind. As a result, I find it quite easy to remember.

After thoughts


(These come from a recent conversation with Guido on the above.)

I did not time the above loops, but I'm pretty sure the improvement is more on the side of conceptual elegance. sort() is very fast, to the point one may easily abuse of it without having to pay the penalty. Unless we have a big list, and as I perceive it, sorted() is not so different than list() efficiency-wise. Unless sort is used in the bottleneck of a computation, there is no compelling reason not to liberally use it. I would have a hard time advocating that, for mere efficiency considerations, heapq is really the best approach.

In practice, if we keep a stack of unprocessed directories, that stack will be rather small on average, so popping the first element is likely faster that heap-popping it, even if it implies N operations rather than log(N). When N is small, log(N) is not very different from N, and mere shifting does not involve comparisons.

I said that there is a single heap sort over all directories, rather than one sort per directory, as an argument that we save processing. This is debatable. Each per-directory sort handles a small number of entries, and calling sort K times on N / K entries each time is roughly the same effort as calling sort once on N entries. However, the heap sort only sees directories, while the per-directory sorts handles both files and directories.

For this discussion, I was interested in sorting directories, not files. The precise case I had when I wrote this, was to produce one .gitignore file per project out of all .cvsignore files within that project, for many projects held in a single big directory hierarchy. So, there was not much to diagnose about non-directories. If files ought to be sorted as well, per-directory sorts are still required, and then, the elegance benefits of heapq fade out a bit.