Introduction to Sumatra – Automated tracking of scientific computations

I gave a seminar on how I use Sumatra as an “automated lab-notebook” for my computational research. These slides are tutorial to Sumatra and detail how to get started:


Matplotlib figures with Helvetica labels – helvet vs. tgheros

I recently spent a frustrating amount of time figuring out a good way to have labels and text in the Helvetica font in Matplotlib. Here’s what I found:

The figure


is generated by the following code:

import pylab as pl
import numpy as np

from matplotlib import rc

rc('text', usetex=True)
pl.rcParams['text.latex.preamble'] = [
    r'\usepackage{tgheros}',    # helvetica font
    r'\usepackage{sansmath}',   # math-font matching  helvetica
    r'\sansmath'                # actually tell tex to use it!
    r'\usepackage{siunitx}',    # micro symbols
    r'\sisetup{detect-all}',    # force siunitx to use the fonts

pl.figure(1, figsize=(6,4))
ax = pl.axes([0.1, 0.1, 0.8, 0.7])
t = np.arange(0.0, 1.0+0.01, 0.01)
s = np.cos(2*2*np.pi*t)+2
pl.plot(t, s)

pl.xlabel(r'time $\lambda$')
pl.ylabel(r'\textit{voltage (mV)}',fontsize=16)
pl.title(r'\TeX\ Number 1234567890 anisotropic ' + 
         r'$\displaystyle\sum_{n=1}^\infty' +
         r'\frac{-e^{i\pi}}{2^n}$' + 
         r' and \SI{3}{\micro\metre}',fontsize=16, color='r')


To achieve the consistent Helvetica font in the figure the LaTeX rendering of Matplotlib labels and text is used. This setup is adapted from an StackExchange answer by Paul H.

Most importantly, I use the tgheros instead of the helvet package. You can see the difference in the images below:


The bottom rendering of helvet has some baseline alignment problems, the “i” is positively floating in the air and the jump between “t” and “r” is particularly noticeable.

The sansmath package is the correct math-font matching Helvetica, as pointed out here.

Finally, siunitx is used for the correct display of measurements in SI units. Here, for example, 3 micrometres are rendered in the title – don’t make the mistake of putting the siuntix expression in a math environment, then the fonts won’t work!

Python: File operations & Data parsing lecture

In the latest edition of the Introduction to Scientific Programming in Python lecture at my institution, I held a tutorial on “File operations & Data parsing”. Below the lecture slides:

You can find the LaTeX source code for the slides on github.

Setting up Emacs with Compiz

I recently spent some time to set up GNU Emacs 24.3.1 properly in my Ubuntu 12.04/14.04 environments with the Compiz windows manager. Here I describe how to

  1. start Emacs in a specified position and size, and
  2. set focus (activate) Emacs in the current workspace

for the Compiz window manager. Note that if you’re using another window manager (find out by typing wmctrl -m in a terminal), most of the tips in this post will be useless to you.

Start Emacs in a specified position and size

Personally, I strictly keep Emacs on the right side of my screen and try to adjust the width to 80 columns. Emacs size can be controlled on start-up by passing the right set of parameters. In my setup I use

emacs24 -fh

to start in full height; the windows is 80 column wide by default.

To start in a specified position, I used the CompizConfig Settings Manager. Easily installed via

sudo apt-get install compiz compizconfig-settings-manager

the manager allows to set start-up window positions for classes of applications. In Place Windows selected the tab Fixed Window Placement. When adding a new fixed position for a class, note the grab functionality to easily extract the window class name.



Activate Emacs in the current workspace

The command line tool gives you (almost) everything you need to set the focus on your Emacs window. If you only ever have a single Emacs window open, all you need is

wmctrl -x -a Emacs

However, I like to have multiple Emacs instances across my workspaces. The solution to activate the Emacs window in the current workspace was given on AskUbuntu by user55822. Here’s the script adapted to activate Emacs:

# Adapted from user55822 on AskUbuntu: 

SCREEN_W=$(xwininfo -root | sed -n 's/^  Width: \(.*\)$/\1/p')
SCREEN_H=$(xwininfo -root | sed -n 's/^  Height: \(.*\)$/\1/p')


wmctrl -xlG | awk -v W="$SCREEN_W" -v H="$SCREEN_H" \
       -v NAME="$NAME" \
       '$7==NAME && $3>=0 && $3<W && $4>=0 && $4<H {print $1}' \
       | while read WINS; do wmctrl -ia "$WINS"; done

exit 0

Finally, set keyboard shortcuts to your custom commands/script in the Keyboard settings:


GSoC Update 3 – Data-centric view in Sumatra

This is the fourth entry in a series of posts on my participation in this year’s Google Summer of Code program working on the reproducible research tool Sumatra with mentor Andrew Davison under the mentoring organization INCF.

In my proposal to participate in this year’s Google Summer of Code program I suggested to implement a data-centric view to the reproducible research tool Sumatra. First establishing the necessary framework for strong process-to-data associations, it was my aim to update Sumatra’s web interface to feature an equally powerful and informative data-centric view, alongside the current record-centric pages. Users should not only be able to conveniently switch between the views, but use them in conjunction, finding information in the view that is currently most useful to their purposes.

Sumatra’s web interface is build on a Django framework. Three models stand out: Project, which holds information on the current computational project. Details on an executed computation within in a project is stored in the Record model. In the model, for example, time and duration of the simulation are stored, the used parameters and version of the code is stored. Information on the input and output data of the computation is stored in the DataKey model; for each of the input and output files a DataKey is created, capturing file meta-information such as file size and content-type. The Record and DataKey were formerly connected through two ManyToManyFields, input_data and output_data, both defined in the Record class. Through the input_data field, for example,

class Record(models.Model)


    input_data = models.ManyToManyField(DataKey, 

DataKey and Record are connected: While


returns the DataKeys, which served as input for the Record smt_record, through the reverse relationship,


returns the Records to which the DataKey smt_data provided input.

To strengthen a data-centric view in Sumatra, a subtle, yet important change to this logic was applied. Changing the output_data field from a many-to-many relationship to a one-to-many relationship, gives a distinct advantage; now at most one Record can be at any the related record from which a DataKey was created, allowing through this relationship to easily query for the creation time of a DataKey:

creation_time = smt_data.output_from_record.timestamp

The implementation of the one-to-many relationship is given by the ForeignKey field, here now implemented in DataKey model:

class DataKey(models.Model)


    output_from_record = models.ForeignKey('Record', 
          related_name = 'output_data', null = True)

Utilizing this updated framework as well as the DataTable integration referenced in the previous post, allowed me to create balanced record and data-centric in Sumatra. This includes the following four main pages of interest:

  • The listings of data and records,
  • and the detailed views of single data and record entries.

With this new interface, users can easily switch from between record and data-centric views, ultimately allowing them to better track provenance of their computations with Sumatra.

The implementation of what was described above is found in the datatable_dev) bookmark of my Sumatra fork. Below a gallery of the new interface:


GSoC Update 2 – DataTable Challenges

This is the third entry in a series of posts on my participation in this year’s Google Summer of Code program working on the reproducible research tool Sumatra with mentor Andrew Davison under the mentoring organization INCF.

Following up on the last post, I have now fully implemented the DataTables plug-in to Sumatra not only on the data page, but for record and data listings as well. Here are some code snippets as solutions to some of the challenges integrationg DataTables in Sumatra provided:
Sorting: In the record and data listing we use various ways to display information. For example, filesizes are formatted as “2 MB, 45 KB, …”, while durations are given as “15s, 2h 32min 10s, …”. Despite the formatting, it is highly desirable for the user to sort entries according to this information numerically. The solution in DataTables is to employ the HTML5 data attributes. In the code:

<td class='dataTable_td' id='size-t'

Here {{data|eval_metadata:'size'}} gives the filesize as an absolute numerical value, while the table itself displays the formatted filesize using Django’s built-in filesizeformat filter.
Word-wrapping: As Sumatra’s record and data listings have a lot of information to display, horizontal space is, by the relatively high number of columns, sparse. Table values include long labels and system paths. Default word wrapping (to allow columns to remain narrow) in the standard browsers occurs only at empty spaces (” “) and dash (“-“). Neither underscore (“_”) nor slash (“/”) allow word-wrapping, resulting in columns containing paths to grow exceedingly long. I did not find a way to enable wrapping at these characters globally, but was able to indicate breakables spaces by applying a custom template filter on the paths and labels to be displayed, using the <wbr> tag:

def ubreak(text):
    text_out = text.replace("_", "_<wbr>").replace("/","/<wbr>")
    return mark_safe(text_out)

Dynamic DataTable: Finally, DataTables provides a fantastic API to dynamically manipulate the table display. In my current version I’m using column.visibile() to dynamically show/hide columns, to control the number of entries shown in one page of the table and'previous'/'next') to enable page turning via the arrow keys.

GSoC Update 1 – Developing the Data page

This is the second entry in a series of posts on my participation in this year’s Google Summer of Code program working on the reproducible research tool Sumatra with mentor Andrew Davison under the mentoring organization INCF.

The goal of my Google Summer of Code project is to develop a data-centric approach to provenance capture and display in Sumatra. A first milestone in this endeavour is to create a rich data page, showing provenance and meta information of a data file, used in a recorded computational process. This includes two main challenges: 1) Being able to query for the desired information about the data file and 2) displaying that information in a concise and robust manner.

The work in my project so far tackles both points. In a first commit and subsequent pull request (#31) I was able to query for a data file’s “Associated Records”, i.e. provenance information about the processes in which the data item was created and in which it is used as input. Next I updated the display using the jQuery UI Accordions, as they are already used in the record page.

Eventually, Sumatra should feature two equally powerful modes of displaying provenance information: The existing record-centric view and a new, data-centric view developed in this project. For this, an easily accessible listing in the web interface of all data objects involved in the current project is essential. A first idea is to use the existing listing of records and adapt it to show the data items. This display, however, is itself problematic. In the discussion of Issue #167, it is suggested to move away from a <div> listing to a <table> environment.

Researching this issue, I found the jQuery plug-in DataTables. The latest commit (833ead4, bookmark datatable_dev) of my Sumatra fork shows a prototype of how the plug-in can be used to list the record information in the web interface. Also integrating the DataTable view in the data page, I’m happy with the progress of the page so far, as seen from the original to the current status in the gallery below.