Completed Project: DASPOS: Data and Software Preservation for Open Science
PIs: Michael Hildreth, Robert Gardner, Douglas Thain, Mark Neubauer, Jaroslaw Nabrzyski
The DASPOS project (Data and Software Preservation for Open Science) was a multi-disciplinary effort between six universities and the Fermi and Brookhaven national laboratories whose aim was to understand the problems associated with knowledge preservation in data-heavy sciences like experimental particle physics, astrophysics, genomics, etc. The team included physicists, digital librarians, computer scientists, and other experts from different fields of research. Stated simply, the intellectual goals of the project were the following: (1) determine what needs to be saved in order to preserve the different aspects of a complicated data analysis with many processing steps for reproducibility and re-use, (2) determine how to save these elements in a manner that they could be archived, searchable, and re-useable, and (3) demonstrate a prototype preservation system that could do this. The research followed several different paths. One aspect focused on how to capture all of the necessary information to re-run a given process, including the operating system, the input data, all of the external database connections, etc. Several solutions to this were explored, including ones based on tracing the system calls of the process to find all of the necessary dependencies, and several based on linux containers. The relative performances of the different techniques were assessed, with the linux container approach (embodied, for example, by Docker containers) given the slight edge due to ease of use and available infrastructure. A second aspect of the research was to understand how to describe what was being done in a computational step as part of an analysis. This would be necessary for the material to be searched for and retrieved from an archive, or so another person could understand what was done and re-use some elements. The studies performed on this aspect of the project resulted in several new metadata vocabularies, including one that describes a “computational step” in a complex analysis, and one that describes a “detector final state” in High Energy Physics (HEP), the first such description to be recorded. In collaboration with the IT and SIS groups at CERN, we have been involved in building the CERN Analysis Preservation Portal (CAP) and the REANA analysis platform, both of which incorporate DASPOS research and represent the achievement of the original goals of the proposal. The CAP will allow individual researchers to store a wealth of pertinent information about their analysis, some of it collected automatically from their LHC experiment. Executables, scripts, and data can also be stored. In particular, individual computational steps can be described and captured, currently using container technology. The metadata description used to archive the information is based on the DASPOS work. The REANA analysis back-end is able to re-assemble complete analysis workflows based on the archived information and re-instantiate them using workflow engines implemented by the DASPOS and CERN teams. The infrastructure required is quite generic and includes many commondity elements that can orchestrate container-based applications on distributed high-throughput computing systems. We have demonstrated the functionality of this system using sample analyses from the LHCb, ATLAS, and CMS experiments at the LHC. The analyses preserved in the CAP portal can be re-run inside of the REANA infrastructure and produce identical results to the original processing.
@article{repro-survey,author={Ivie, Peter and Thain, Douglas},title={{Reproducibility in Scientific Computing}},journal={{ACM Computing Surveys}},volume={51},number={3},year={2018},note={{doi: 10.1145/3186266}},cclpaperid={952},keywords={daspos},}
PRUNE: A Preserving Run Environment for Reproducible Computing
@inproceedings{prune-escience-2016,author={Ivie, Peter and Thain, Douglas},title={{PRUNE: A Preserving Run Environment for Reproducible Computing}},booktitle={{IEEE Conference on e-Science}},year={2016},note={{doi: 10.1109/eScience.2016.7870886}},cclpaperid={930},keywords={workqueue, prune, daspos},}
Conducting Reproducible Research with Umbrella: Tracking, Creating, and Preserving Execution Environments
Haiyan Meng, Douglas Thain, Alexander Vyushkov, Matthias Wolf, and Anna Woodard
@inproceedings{umbrella-escience-2016,author={Meng, Haiyan and Thain, Douglas and Vyushkov, Alexander and Wolf, Matthias and Woodard, Anna},title={{Conducting Reproducible Research with Umbrella: Tracking, Creating, and Preserving Execution Environments}},booktitle={{IEEE Conference on e-Science}},year={2016},note={{doi: 10.1109/eScience.2016.7870889}},cclpaperid={931},keywords={parrot, umbrella, daspos},}
A First Look at Reproducibility and Non-Determinism in CMS Codes and ROOT Data
Peter Ivie, Charles (Chao) Zheng, and Douglas Thain
@techreport{repro-tr-2016,author={Ivie, Peter and Zheng, Charles (Chao) and Thain, Douglas},title={{A First Look at Reproducibility and Non-Determinism in CMS Codes and ROOT Data}},institution={{University of Notre Dame, Computer Science and Engineering Department}},number={2016-01},year={2016},cclpaperid={933},keywords={daspos},}
An Analysis of Reproducibility and Non-Determinism in HEP Software and ROOT Data
Peter Ivie, Charles (Chao) Zheng, and Douglas Thain
In International Conference on Computing in High Energy and Nuclear Physics, 2016
@inproceedings{PAPER936,author={Ivie, Peter and Zheng, Charles (Chao) and Thain, Douglas},title={{An Analysis of Reproducibility and Non-Determinism in HEP Software and ROOT Data}},booktitle={{International Conference on Computing in High Energy and Nuclear Physics}},year={2016},note={{doi: 10.1088/1742-6596/898/10/102007}},cclpaperid={936},keywords={daspos},}
A Case Study in Preserving a High Energy Physics Application with Parrot
Haiyan Meng, Matthias Wolf, Peter Ivie, Anna Woodard, Michael Hildreth, and Douglas Thain
In Journal of Physics: Conference Series (CHEP 2015), 2015
@inproceedings{tauroast-chep-2015,author={Meng, Haiyan and Wolf, Matthias and Ivie, Peter and Woodard, Anna and Hildreth, Michael and Thain, Douglas},title={{A Case Study in Preserving a High Energy Physics Application with Parrot}},booktitle={{Journal of Physics: Conference Series (CHEP 2015)}},year={2015},note={{doi: 10.1088/1742-6596/664/3/032022}},cclpaperid={925},keywords={parrot, daspos},}
Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?
Douglas Thain, Peter Ivie, and Haiyan Meng
In 12th International Conference on Digital Preservation (iPres), 2015
@inproceedings{techniques-ipres-2015,author={Thain, Douglas and Ivie, Peter and Meng, Haiyan},title={{Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?}},booktitle={{12th International Conference on Digital Preservation (iPres)}},year={2015},note={{doi: 10.7274/R0CZ353M}},cclpaperid={921},keywords={parrot, prune, umbrella, daspos},}