GSoC Update 3 – Data-centric view in Sumatra

This is the fourth entry in a series of posts on my participation in this year’s Google Summer of Code program working on the reproducible research tool Sumatra with mentor Andrew Davison under the mentoring organization INCF.

In my proposal to participate in this year’s Google Summer of Code program I suggested to implement a data-centric view to the reproducible research tool Sumatra. First establishing the necessary framework for strong process-to-data associations, it was my aim to update Sumatra’s web interface to feature an equally powerful and informative data-centric view, alongside the current record-centric pages. Users should not only be able to conveniently switch between the views, but use them in conjunction, finding information in the view that is currently most useful to their purposes.

Sumatra’s web interface is build on a Django framework. Three models stand out: Project, which holds information on the current computational project. Details on an executed computation within in a project is stored in the Record model. In the model, for example, time and duration of the simulation are stored, the used parameters and version of the code is stored. Information on the input and output data of the computation is stored in the DataKey model; for each of the input and output files a DataKey is created, capturing file meta-information such as file size and content-type. The Record and DataKey were formerly connected through two ManyToManyFields, input_data and output_data, both defined in the Record class. Through the input_data field, for example,

class Record(models.Model)

    #...

    input_data = models.ManyToManyField(DataKey, 
                     related_name="input_to_records")

DataKey and Record are connected: While

smt_record.input_data.all()

returns the DataKeys, which served as input for the Record smt_record, through the reverse relationship,

smt_data.input_to_records.all()

returns the Records to which the DataKey smt_data provided input.

To strengthen a data-centric view in Sumatra, a subtle, yet important change to this logic was applied. Changing the output_data field from a many-to-many relationship to a one-to-many relationship, gives a distinct advantage; now at most one Record can be at any the related record from which a DataKey was created, allowing through this relationship to easily query for the creation time of a DataKey:

creation_time = smt_data.output_from_record.timestamp

The implementation of the one-to-many relationship is given by the ForeignKey field, here now implemented in DataKey model:

class DataKey(models.Model)

    #...

    output_from_record = models.ForeignKey('Record', 
          related_name = 'output_data', null = True)

 
Utilizing this updated framework as well as the DataTable integration referenced in the previous post, allowed me to create balanced record and data-centric in Sumatra. This includes the following four main pages of interest:

  • The listings of data and records,
  • and the detailed views of single data and record entries.

With this new interface, users can easily switch from between record and data-centric views, ultimately allowing them to better track provenance of their computations with Sumatra.

The implementation of what was described above is found in the datatable_dev) bookmark of my Sumatra fork. Below a gallery of the new interface:

 

Advertisements

GSoC Update 2 – DataTable Challenges

This is the third entry in a series of posts on my participation in this year’s Google Summer of Code program working on the reproducible research tool Sumatra with mentor Andrew Davison under the mentoring organization INCF.

Following up on the last post, I have now fully implemented the DataTables plug-in to Sumatra not only on the data page, but for record and data listings as well. Here are some code snippets as solutions to some of the challenges integrationg DataTables in Sumatra provided:
 
Sorting: In the record and data listing we use various ways to display information. For example, filesizes are formatted as “2 MB, 45 KB, …”, while durations are given as “15s, 2h 32min 10s, …”. Despite the formatting, it is highly desirable for the user to sort entries according to this information numerically. The solution in DataTables is to employ the HTML5 data attributes. In the code:

<td class='dataTable_td' id='size-t'
    data-sort="{{data|eval_metadata:'size'}}">
  {{data|eval_metadata:'size'|filesizeformat}}
</td>

Here {{data|eval_metadata:'size'}} gives the filesize as an absolute numerical value, while the table itself displays the formatted filesize using Django’s built-in filesizeformat filter.
 
Word-wrapping: As Sumatra’s record and data listings have a lot of information to display, horizontal space is, by the relatively high number of columns, sparse. Table values include long labels and system paths. Default word wrapping (to allow columns to remain narrow) in the standard browsers occurs only at empty spaces (” “) and dash (“-“). Neither underscore (“_”) nor slash (“/”) allow word-wrapping, resulting in columns containing paths to grow exceedingly long. I did not find a way to enable wrapping at these characters globally, but was able to indicate breakables spaces by applying a custom template filter on the paths and labels to be displayed, using the <wbr> tag:

@register.filter
@stringfilter
def ubreak(text):
    text_out = text.replace("_", "_<wbr>").replace("/","/<wbr>")
    return mark_safe(text_out)

 
Dynamic DataTable: Finally, DataTables provides a fantastic API to dynamically manipulate the table display. In my current version I’m using column.visibile() to dynamically show/hide columns, table.page.len() to control the number of entries shown in one page of the table and table.page('previous'/'next') to enable page turning via the arrow keys.


GSoC Update 1 – Developing the Data page

This is the second entry in a series of posts on my participation in this year’s Google Summer of Code program working on the reproducible research tool Sumatra with mentor Andrew Davison under the mentoring organization INCF.

The goal of my Google Summer of Code project is to develop a data-centric approach to provenance capture and display in Sumatra. A first milestone in this endeavour is to create a rich data page, showing provenance and meta information of a data file, used in a recorded computational process. This includes two main challenges: 1) Being able to query for the desired information about the data file and 2) displaying that information in a concise and robust manner.

The work in my project so far tackles both points. In a first commit and subsequent pull request (#31) I was able to query for a data file’s “Associated Records”, i.e. provenance information about the processes in which the data item was created and in which it is used as input. Next I updated the display using the jQuery UI Accordions, as they are already used in the record page.

Eventually, Sumatra should feature two equally powerful modes of displaying provenance information: The existing record-centric view and a new, data-centric view developed in this project. For this, an easily accessible listing in the web interface of all data objects involved in the current project is essential. A first idea is to use the existing listing of records and adapt it to show the data items. This display, however, is itself problematic. In the discussion of Issue #167, it is suggested to move away from a <div> listing to a <table> environment.

Researching this issue, I found the jQuery plug-in DataTables. The latest commit (833ead4, bookmark datatable_dev) of my Sumatra fork shows a prototype of how the plug-in can be used to list the record information in the web interface. Also integrating the DataTable view in the data page, I’m happy with the progress of the page so far, as seen from the original to the current status in the gallery below.

 

 


GSoC Update 0 – Starting Summer of Code

This is the first entry in a series of posts on my participation in this year’s Google Summer of Code program working on the reproducible research tool Sumatra with mentor Andrew Davison under the mentoring organization INCF.

Summer of Code! In their annual program, taking place for the 10th time since 2005, Google supports students to work with a mentor on a free and open-source project over the summer. My proposal “Data-centric provenance capture with Sumatra” was accepted in March and I’m happy to post a first update to my work on the project here. Sumatra is a tool promoting reproducible research in computational sciences – “a lab notebook for computational projects”. Stumbling upon the software while looking up best practices in computational research, I have come to highly appreciate what the tool can do.

But of course, it can always be better! This is why I was writing Andrew Davison, the maintainer of Sumatra and now mentor of my GSoC project, about a potential Summer of Code participation already in January. I suggested a stronger connection in Sumatra’s architecture and display between process records and the data generated in these processes.

In my work I’m extensively using Sumatra and have, even before thinking about Summer of Code, written some bash scripts to achieve better data to process associations purely through the data and process labels. As the program is not called Summer of Text, let’s look at some code! This is an example of a custom bash script I’m using to first get a label string, possibly depending on parameters, and using it as the Sumatra label as well as the file label for generated plots in plot_data.py:

#!/bin/bash         

inputfiles="$@"

labelstr=`python comp/figure_label.py` ;

smt run --executable=python \
        --main=comp/plot_data.py $inputfiles $labelstr \
        --reason=Test graphic \
        --tag=graphic \
        --label=$labelstr \
        comp/params/plot_data_params_template.py

Paths to the data to plot is just passed as parameter while calling the bash script. You can find a full repository with example usage here. After a good week of coding on the project now, I got a working prototype of displaying associated records of data in the web interface and opened a first pull-request. Once reviewed by the maintainer, I hope to take this as a base to expand upon in the coming weeks!

data_view

More with the next update!