Scalable Software Services for Life Science

Standards for Data Exchange and Management

About

We aim to develop and push standards for handling both storage and exchange of the ever-increasing amount of simulation data in life science. This will include XML representations of input/output files formats for simulation data, standardized compressed file formats for simulation data, and high-level APIs.

We currently conduct:

  • Analysis of requirements on data storage and exchange formats in life science
  • Try to formulate file format standards for complex molecular data and job description
  • Develop an open application programming interfaces (APIs) for molecular modelling
  • Setting up automatic procedures for setting-up and analysis of MD simulations
  • Develop a database structure for data storage trajectories

Work in progress

  • Data Storage System
    • Molecular dynamics trajectories constitute one of the most challenging problems in terms of data storage and transmission in bioinformatics. Scalalife project is working towards database infrastructure for bio-molecular simulations. To facilitate the searches on stored trajectories, and the design and storage of analysis data, a second level database is discussed and proposed in the Competence Center.

  • Standard File Format for Molecular Dynamics
    • The definition of data file formats for Molecular Dynamics trajectories will help us to rationalize simulation results and secondary data analysis. The Competence Center is our channel to connect community, developers and Scalalife project, where you will find the current status of the work, from preliminary discussions to future XML schema specifications.

  • UMM File Format
    • The UMM (Unified Molecular Modeling) file format specifies descriptions of molecular systems and associated simulation procedures.The format uses a simplified XML-like syntax that aims to be human readable. The file is processed by a converter script that generates a standards-compliant XML file that will be used as actual input. The latter is more ʻcomputer-readableʼ and can be easily processed by standard XML libraries.

  • TNG File Format for binary trajectory data

    • The TNG (Trajectory Next Generation) file format is a container-type format that supports stoarge of different types of payload and different levels of compression including temporal one. It is intended to be used as the default standard for storage of molecular simulations data.

  • MDWeb: The Automatic Input Generator
    • The web portal MDWeb provides a friendly environment to setup new systems, run test simulations and perform analysis within a guided interface. MDWeb currently supports ScalaLife Molecular Dynamics applications: Gromacs and AMBER. The platform can also prepare and launch MD using the Amber and NAMD. It provides users with a personal workspace where intermediate data, trajectories and analysis results can be stored. Registration is free but necessary to maintain a permanent workspace. The primary entry is a structure (uploaded or obtained from PDB) for setup or a trajectory for analysis. The input structure or trajectory acts as the root of a tree to those new sets of data are added according to the operations performed. Results of trajectory analysis are presented through alphanumerical values, 2D plots or Jmol based 3D visualizations, as appropriate.

  • NAFLEX. A web interface for the study of Nucleic Acids Flexibility
    • NAFlex offers a variety of methods to explore nucleic acids flexibility, from a colour-less worm-like chain model to a base-pair resolution elastic model of flexibility and even atomistic molecular dynamics (MD) simulation. Within the MD-framework NAFlex uses the MDWeb platform to perform all the set-up of the simulation (structural validation and correction, solvation, minimization, thermalization, pre-equilibration, and equilibration) following well-tested procedures. Simulations can be prepared to launch GROMACS, NAMD or AMBER calculations with any of the commonly used force-fields and solvent models.
  • NAFLEX. A web interface for the study of Nucleic Acids Flexibility

  • UMM-MoDEL: Structure of the Database with Input/Output interface
    • The UMM-MoDEL environment is a version of the MoDEL database infrastructure more friendly,  with an input/output end-user interface and compatible with the Scalalife UMM XML format

  • PCASUITE: A tool to compress Molecular Dynamics trajectories using Principal Components Analysis
    • One of the main shortcomings to popularize the use of Molecular Dynamics are its potentially large trajectory files. State-of-the-art simulation in the high nanosecond time scale can span easily several Gb, especially for large systems. Traditional general compression algorithms like LZW have been used in order to reduce the required space, but they usually do not perform well with this kind of data. However, trajectory data is not random. It follows patterns with well defined meaning that can be exploted for data compression. In particular, higher frequency movements can be discarded without affecting the overall dynamics of the system. Principal Component Analysis is one of the most used techniques to capture the essential movements of a macromolecular system. It implies a change in the coordinate space where reference eigenvectors are chosen according to the amount of system variance explained. The aim is to select the minimum number of reference coordinates that explain a given amount of system variance. The technique allows to select the degree of fidelity to the original trajectory. Chosing all eigenvectors there is no change in the accurancy of the trajectory. However, removing eigenvectors with the lowest amount of explained variance, has little effect on the overall behavior of the system, but has a remarkable effect on the size of the data.

  • COPERNICUS: a framework for ensemble and distributed computing

    • Copernicus is a platform that enables large scale computing to be defined as a workflow. The platform will take care of the task breakdown, distribute it to available compute resources, all in a secure and fault tolerant manner.
      Its overlay P2P network can utilize a wide variety of heterogeneus compute resources such as desktops,clusters and cloud com- pute instances and automatic resource awareness makes sure to to use the best resources for the defined job.

  • SNAP-parallel: Parallel scalable Implementation of the SNAP software package on a hybrid cloud

    • In this task a framework has been developed which allows the execution of the software package SNAP in a scalable way. The software is a prototypical implementation for a weakly coupled, robust, self-healing and architecture agnostic framework for scalable software services. It uses as implementation language the free and open source language R together with the noSQL database software redis and the interface package doRedis together with the foreach libraries in R. The separation of job management and processing, as implemented in the package doRedis, allows the use of different machines for processing a large task

  • Virtual Screening based on MD-Ensemble Docking

    • Software interface for Virtual Screening experiments based on MD-Ensemble docking.  The interface will allow  non-experts to prepare computing-intensive experiments to screen large collections of hundreds of thousands of chemical compounds. The interface will accept rigid structures of MD simulations, selecting the most representative sub sets for docking experiments. Thanks to Scalalife’s  data standards, it is possible to integrate MDWeb to generate MD simulation from rigid receptor structures. This interface will also access MoDEL database through UMM.