Completed Project: DASPOS: Data and Software Preservation for Open Science

PIs: Michael Hildreth, Robert Gardner, Douglas Thain, Mark Neubauer, Jaroslaw Nabrzyski

The DASPOS project (Data and Software Preservation for Open Science) was a multi-disciplinary effort between six universities and the Fermi and Brookhaven national laboratories whose aim was to understand the problems associated with knowledge preservation in data-heavy sciences like experimental particle physics, astrophysics, genomics, etc. The team included physicists, digital librarians, computer scientists, and other experts from different fields of research. Stated simply, the intellectual goals of the project were the following: (1) determine what needs to be saved in order to preserve the different aspects of a complicated data analysis with many processing steps for reproducibility and re-use, (2) determine how to save these elements in a manner that they could be archived, searchable, and re-useable, and (3) demonstrate a prototype preservation system that could do this. The research followed several different paths. One aspect focused on how to capture all of the necessary information to re-run a given process, including the operating system, the input data, all of the external database connections, etc. Several solutions to this were explored, including ones based on tracing the system calls of the process to find all of the necessary dependencies, and several based on linux containers. The relative performances of the different techniques were assessed, with the linux container approach (embodied, for example, by Docker containers) given the slight edge due to ease of use and available infrastructure. A second aspect of the research was to understand how to describe what was being done in a computational step as part of an analysis. This would be necessary for the material to be searched for and retrieved from an archive, or so another person could understand what was done and re-use some elements. The studies performed on this aspect of the project resulted in several new metadata vocabularies, including one that describes a “computational step” in a complex analysis, and one that describes a “detector final state” in High Energy Physics (HEP), the first such description to be recorded. In collaboration with the IT and SIS groups at CERN, we have been involved in building the CERN Analysis Preservation Portal (CAP) and the REANA analysis platform, both of which incorporate DASPOS research and represent the achievement of the original goals of the proposal. The CAP will allow individual researchers to store a wealth of pertinent information about their analysis, some of it collected automatically from their LHC experiment. Executables, scripts, and data can also be stored. In particular, individual computational steps can be described and captured, currently using container technology. The metadata description used to archive the information is based on the DASPOS work. The REANA analysis back-end is able to re-assemble complete analysis workflows based on the archived information and re-instantiate them using workflow engines implemented by the DASPOS and CERN teams. The infrastructure required is quite generic and includes many commondity elements that can orchestrate container-based applications on distributed high-throughput computing systems. We have demonstrated the functionality of this system using sample analyses from the LHCb, ATLAS, and CMS experiments at the LHC. The analyses preserved in the CAP portal can be re-run inside of the REANA infrastructure and produce identical results to the original processing.

Related Publications

  1. Reproducibility in Scientific Computing
    Peter Ivie and Douglas Thain
    ACM Computing Surveys, 2018
    doi: 10.1145/3186266
  2. PRUNE: A Preserving Run Environment for Reproducible Computing
    Peter Ivie and Douglas Thain
    In IEEE Conference on e-Science, 2016
    doi: 10.1109/eScience.2016.7870886
  3. Conducting Reproducible Research with Umbrella: Tracking, Creating, and Preserving Execution Environments
    Haiyan Meng, Douglas Thain, Alexander Vyushkov, Matthias Wolf, and Anna Woodard
    In IEEE Conference on e-Science, 2016
    doi: 10.1109/eScience.2016.7870889
  4. A First Look at Reproducibility and Non-Determinism in CMS Codes and ROOT Data
    Peter Ivie, Charles (Chao) Zheng, and Douglas Thain
    2016
  5. An Analysis of Reproducibility and Non-Determinism in HEP Software and ROOT Data
    Peter Ivie, Charles (Chao) Zheng, and Douglas Thain
    In International Conference on Computing in High Energy and Nuclear Physics, 2016
    doi: 10.1088/1742-6596/898/10/102007
  6. A Case Study in Preserving a High Energy Physics Application with Parrot
    Haiyan Meng, Matthias Wolf, Peter Ivie, Anna Woodard, Michael Hildreth, and Douglas Thain
    In Journal of Physics: Conference Series (CHEP 2015), 2015
    doi: 10.1088/1742-6596/664/3/032022
  7. Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?
    Douglas Thain, Peter Ivie, and Haiyan Meng
    In 12th International Conference on Digital Preservation (iPres), 2015
    doi: 10.7274/R0CZ353M