GSoC Update 3 – Data-centric view in SumatraPosted: August 20, 2014
This is the fourth entry in a series of posts on my participation in this year’s Google Summer of Code program working on the reproducible research tool Sumatra with mentor Andrew Davison under the mentoring organization INCF.
In my proposal to participate in this year’s Google Summer of Code program I suggested to implement a data-centric view to the reproducible research tool Sumatra. First establishing the necessary framework for strong process-to-data associations, it was my aim to update Sumatra’s web interface to feature an equally powerful and informative data-centric view, alongside the current record-centric pages. Users should not only be able to conveniently switch between the views, but use them in conjunction, finding information in the view that is currently most useful to their purposes.
Sumatra’s web interface is build on a Django framework. Three models stand out: Project, which holds information on the current computational project. Details on an executed computation within in a project is stored in the Record model. In the model, for example, time and duration of the simulation are stored, the used parameters and version of the code is stored. Information on the input and output data of the computation is stored in the DataKey model; for each of the input and output files a DataKey is created, capturing file meta-information such as file size and content-type. The Record and DataKey were formerly connected through two
output_data, both defined in the Record class. Through the input_data field, for example,
class Record(models.Model) #... input_data = models.ManyToManyField(DataKey, related_name="input_to_records")
DataKey and Record are connected: While
returns the DataKeys, which served as input for the Record
smt_record, through the reverse relationship,
returns the Records to which the DataKey
smt_data provided input.
To strengthen a data-centric view in Sumatra, a subtle, yet important change to this logic was applied. Changing the
output_data field from a many-to-many relationship to a one-to-many relationship, gives a distinct advantage; now at most one Record can be at any the related record from which a DataKey was created, allowing through this relationship to easily query for the creation time of a DataKey:
creation_time = smt_data.output_from_record.timestamp
The implementation of the one-to-many relationship is given by the
ForeignKey field, here now implemented in DataKey model:
class DataKey(models.Model) #... output_from_record = models.ForeignKey('Record', related_name = 'output_data', null = True)
Utilizing this updated framework as well as the DataTable integration referenced in the previous post, allowed me to create balanced record and data-centric in Sumatra. This includes the following four main pages of interest:
- The listings of data and records,
- and the detailed views of single data and record entries.
With this new interface, users can easily switch from between record and data-centric views, ultimately allowing them to better track provenance of their computations with Sumatra.
The implementation of what was described above is found in the datatable_dev) bookmark of my Sumatra fork. Below a gallery of the new interface: