Data Intensive Abstractions for High End Biometric Applications
PIs: Douglas Thain and Patrick Flynn. This work is supported by the National Science Foundation under grant CNS-06-21434
Biometric research requires the execution of very large data intensive batch workloads. To evaluate new matching algorithms, researchers wish to compare thousands of images to each other by brute force. When these sort of workloads are submitted to conventional batch systems in the usual way, they induce massive amount ofs network and I/O traffic that result in very poor throughput. How can we execute such large workloads effectively?
To attack this problem, we are introducing data intensive abstractions that allow the user to easily provide the system with more information about the structure of a workload so that is can partition the data and execute it effectively. The abstraction explicitly specifies the data to be processed, the code that will process it, and the relationship between the two. One example of an abstraction is All-Pairs:
All-Pairs( set S, function F ):
For all Si and Sj in set S, compute: F( Si, Sj )
A computing system with an All-Pairs interface can easily find a more efficient implementation than a demand-paged filesystem. The input data can be staged to the computation nodes by a spanning tree, and the partitioning of work units into jobs can be done according to the performance properties of the system. In this project, we are designing a variety of similar data intensive abstractions that allow for the easy and efficient execution of large scientific workloads.
Related Publications
Designing Self-Tuning Split-Map-Merge Applications for High Cost-Efficiency in the Cloud
@article{tuning-tcc-2015,author={Rajan, Dinesh and Thain, Douglas},title={{Designing Self-Tuning Split-Map-Merge Applications for High Cost-Efficiency in the Cloud}},journal={{IEEE Transactions on Cloud Computing}},volume={5},number={2},pages={303-316},year={2017},note={{doi: 10.1109/TCC.2015.2415780}},cclpaperid={909},keywords={makeflow, workqueue, hecura},}
A Compiler Toolchain For Data Intensive Scientific Workflows
@thesis{pbui-dissertation.pdf,author={Bui, Peter},title={{A Compiler Toolchain For Data Intensive Scientific Workflows}},editor={Thesis, Ph.D.},booktitle={{University of Notre Dame}},year={2012},cclpaperid={889},keywords={makeflow, hecura}}
Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids
Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain
In Workshop on Scalable Workflow Enactment Engines and Technologies (SWEET) at ACM SIGMOD, 2012
@inproceedings{makeflow-sweet12,author={Albrecht, Michael and Donnelly, Patrick and Bui, Peter and Thain, Douglas},title={{Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids}},booktitle={{Workshop on Scalable Workflow Enactment Engines and Technologies (SWEET) at ACM SIGMOD}},year={2012},note={{doi: 10.1145/2443416.2443417}},cclpaperid={104},keywords={makeflow, hecura}}
Harnessing Parallelism in Multicore Clusters with the All-Pairs, Wavefront, and Makeflow Abstractions
Li Yu, Christopher Moretti, Andrew Thrasher, Scott Emrich, Kenneth Judd, and Douglas Thain
@article{abstr-jcc,author={Yu, Li and Moretti, Christopher and Thrasher, Andrew and Emrich, Scott and Judd, Kenneth and Thain, Douglas},title={{Harnessing Parallelism in Multicore Clusters with the All-Pairs, Wavefront, and Makeflow Abstractions}},journal={{Journal of Cluster Computing}},volume={13},number={3},pages={243-256},year={2010},note={{doi: 10.1007/s10586-010-0134-7}},cclpaperid={83},keywords={makeflow, workqueue, allpairs, wavefront, hecura},}
Abstractions for Cloud Computing with Condor
Douglas Thain and Christopher Moretti
In Cloud Computing and Software Services: Theory and Techniques, 2010
@incollection{abstr-cloudbook,author={Thain, Douglas and Moretti, Christopher},title={{Abstractions for Cloud Computing with Condor}},editor={Ahson, Syed and Ilyas, Mohammad},booktitle={{Cloud Computing and Software Services: Theory and Techniques}},pages={153-171},publisher={CRC Press},year={2010},note={{isbn: 9781439803158}},cclpaperid={78},keywords={workqueue, wavefront, hecura},}
ROARS: A Scalable Repository for Data Intensive Scientific Computing
Hoang Bui, Peter Bui, Patrick Flynn, and Douglas Thain
In The Third International Workshop on Data Intensive Distributed Computing at ACM HPDC 2010, 2010
@inproceedings{roars-didc10,author={Bui, Hoang and Bui, Peter and Flynn, Patrick and Thain, Douglas},title={{ROARS: A Scalable Repository for Data Intensive Scientific Computing}},booktitle={{The Third International Workshop on Data Intensive Distributed Computing at ACM HPDC 2010}},year={2010},note={{doi: 10.1145/1851476.1851587}},cclpaperid={85},keywords={chirp, filesystems, career, hecura, gridfs},}
Weaver: Integrating Distributed Computing Abstractions into Scientific Workflows using Python
Peter Bui, Li Yu, and Douglas Thain
In Challenges of Large Applications in Distributed Environments at ACM HPDC 2010, 2010
@inproceedings{weaver-clade10,author={Bui, Peter and Yu, Li and Thain, Douglas},title={{Weaver: Integrating Distributed Computing Abstractions into Scientific Workflows using Python}},booktitle={{Challenges of Large Applications in Distributed Environments at ACM HPDC 2010}},year={2010},note={{doi: 10.1145/1851476.1851570}},cclpaperid={86},keywords={workqueue, hecura},}
Abstractions for Scientific Computing on Campus Grids
@thesis{moretti-dissertation,author={Moretti, Christopher},title={{Abstractions for Scientific Computing on Campus Grids}},editor={Thesis, Ph.D.},booktitle={{University of Notre Dame}},year={2010},cclpaperid={88},keywords={hecura},}
All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids
Christopher Moretti, Hoang Bui, Karen Hollingsworth, Brandon Rich, Patrick Flynn, and Douglas Thain
IEEE Transactions on Parallel and Distributed Systems, 2010
@article{allpairs-tpds,author={Moretti, Christopher and Bui, Hoang and Hollingsworth, Karen and Rich, Brandon and Flynn, Patrick and Thain, Douglas},title={{All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids}},journal={{IEEE Transactions on Parallel and Distributed Systems}},volume={21},number={1},pages={33-46},year={2010},note={{doi: 10.1109/TPDS.2009.49}},cclpaperid={12},keywords={allpairs, hecura},}
Exploiting Locality with QThreads for Portable Parallel Performance
@thesis{wheeler-thesis,author={Wheeler, Kyle},title={{Exploiting Locality with QThreads for Portable Parallel Performance}},editor={Thesis, Ph.D.},booktitle={{University of Notre Dame}},year={2009},cclpaperid={81},keywords={allpairs, wavefront, hecura},}
Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions
Li Yu, Christopher Moretti, Scott Emrich, Kenneth Judd, and Douglas Thain
In IEEE High Performance Distributed Computing, 2009
@inproceedings{abstr-hpdc09,author={Yu, Li and Moretti, Christopher and Emrich, Scott and Judd, Kenneth and Thain, Douglas},title={{Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions}},booktitle={{IEEE High Performance Distributed Computing}},pages={1-10},year={2009},note={{doi: 10.1145/1551609.1551613}},cclpaperid={5},keywords={workqueue, allpairs, wavefront, hecura},}
Scaling Up Classifiers to Cloud Computers
Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, and Nitesh V. Chawla
In IEEE International Conference on Data Mining (ICDM), 2008
@inproceedings{classify-icdm08,author={Moretti, Christopher and Steinhaeuser, Karsten and Thain, Douglas and Chawla, Nitesh V.},title={{Scaling Up Classifiers to Cloud Computers}},booktitle={{IEEE International Conference on Data Mining (ICDM)}},pages={472-481},year={2008},note={{doi: 10.1109/ICDM.2008.99}},cclpaperid={25},keywords={hecura},}
Poster: DataLab: Transactional Data Parallel Computing on an Active Storage Cloud
Brandon Rich and Douglas Thain
In IEEE/ACM High Performance Distributed Computing, 2008
@inproceedings{datalab-hpdc08,author={Rich, Brandon and Thain, Douglas},title={{Poster: DataLab: Transactional Data Parallel Computing on an Active Storage Cloud}},booktitle={{IEEE/ACM High Performance Distributed Computing}},pages={233-234},year={2008},note={{isbn: 10.1145/1383422.1383461}},cclpaperid={27},keywords={chirp, hecura},}
All-Pairs: An Abstraction for Data Intensive Cloud Computing
Christopher Moretti, Jared Bulosan, Douglas Thain, and Patrick Flynn
In IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2008
@inproceedings{allpairs-ipdps08,author={Moretti, Christopher and Bulosan, Jared and Thain, Douglas and Flynn, Patrick},title={{All-Pairs: An Abstraction for Data Intensive Cloud Computing}},booktitle={{IEEE International Parallel and Distributed Processing Symposium (IPDPS)}},pages={1-11},year={2008},note={{doi: 10.1109/IPDPS.2008.4536311 }},cclpaperid={28},keywords={allpairs, hecura},}
Poster: All-Pairs: An Abstraction for Data Intensive Computing
Christopher Moretti, Jared Bulosan, Douglas Thain, and Patrick J. Flynn
@inproceedings{allpairs-grid07,author={Moretti, Christopher and Bulosan, Jared and Thain, Douglas and Flynn, Patrick J.},title={{Poster: All-Pairs: An Abstraction for Data Intensive Computing}},booktitle={{IEEE/ACM Grid Computing}},year={2007},cclpaperid={63},keywords={allpairs, hecura},}
Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid
Christopher Moretti, Timothy Faltemier, Douglas Thain, and Patrick J. Flynn
In Workshop on Large Scale and Volatile Desktop Grids at IEEE IPDPS, 2007
@inproceedings{challenges-pcgrid07,author={Moretti, Christopher and Faltemier, Timothy and Thain, Douglas and Flynn, Patrick J.},title={{Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid}},booktitle={{Workshop on Large Scale and Volatile Desktop Grids at IEEE IPDPS}},pages={481-489},year={2007},note={{doi: 10.1109/IPDPS.2007.370671}},cclpaperid={34},keywords={hecura},}
Separating Abstractions from Resources in a Tactical Storage System
Douglas Thain, Sander Klous, Justin Wozniak, Paul Brenner, Aaron Striegel, and Jesus Izaguirre
@inproceedings{tactical-sc05,author={Thain, Douglas and Klous, Sander and Wozniak, Justin and Brenner, Paul and Striegel, Aaron and Izaguirre, Jesus},title={{Separating Abstractions from Resources in a Tactical Storage System}},booktitle={{IEEE/ACM Supercomputing}},pages={55-67},year={2005},note={{doi: 10.1109/SC.2005.64}},cclpaperid={52},keywords={parrot, chirp, allocfs, filesystems, career, hecura, gridfs},}
Explicit Control in a Batch Aware Distributed File System
John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny
In USENIX Networked Systems Design and Implementation (NSDI), 2004
@inproceedings{badfs-nsdi-04,author={Bent, John and Thain, Douglas and Arpaci-Dusseau, Andrea and Arpaci-Dusseau, Remzi and Livny, Miron},title={{Explicit Control in a Batch Aware Distributed File System}},booktitle={{USENIX Networked Systems Design and Implementation (NSDI)}},pages={365-378},year={2004},cclpaperid={58},keywords={filesystems, career, hecura, gridfs},}