Data Intensive Abstractions for High End Biometric Applications

PIs: Douglas Thain and Patrick Flynn. This work is supported by the National Science Foundation under grant CNS-06-21434

Biometric research requires the execution of very large data intensive batch workloads. To evaluate new matching algorithms, researchers wish to compare thousands of images to each other by brute force. When these sort of workloads are submitted to conventional batch systems in the usual way, they induce massive amount ofs network and I/O traffic that result in very poor throughput. How can we execute such large workloads effectively?

To attack this problem, we are introducing data intensive abstractions that allow the user to easily provide the system with more information about the structure of a workload so that is can partition the data and execute it effectively. The abstraction explicitly specifies the data to be processed, the code that will process it, and the relationship between the two. One example of an abstraction is All-Pairs:

All-Pairs( set S, function F ):
For all Si and Sj in set S, compute: F( Si, Sj )

A computing system with an All-Pairs interface can easily find a more efficient implementation than a demand-paged filesystem. The input data can be staged to the computation nodes by a spanning tree, and the partitioning of work units into jobs can be done according to the performance properties of the system. In this project, we are designing a variety of similar data intensive abstractions that allow for the easy and efficient execution of large scientific workloads.

Related Publications

  1. Designing Self-Tuning Split-Map-Merge Applications for High Cost-Efficiency in the Cloud
    Dinesh Rajan and Douglas Thain
    IEEE Transactions on Cloud Computing, 2017
    doi: 10.1109/TCC.2015.2415780
  2. A Compiler Toolchain For Data Intensive Scientific Workflows
    Peter Bui
    2012
  3. Makeflow: A Portable Abstraction for Data Intensive Computing on Clusters, Clouds, and Grids
    Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain
    In Workshop on Scalable Workflow Enactment Engines and Technologies (SWEET) at ACM SIGMOD, 2012
    doi: 10.1145/2443416.2443417
  4. Harnessing Parallelism in Multicore Clusters with the All-Pairs, Wavefront, and Makeflow Abstractions
    Li Yu, Christopher Moretti, Andrew Thrasher, Scott Emrich, Kenneth Judd, and Douglas Thain
    Journal of Cluster Computing, 2010
    doi: 10.1007/s10586-010-0134-7
  5. Abstractions for Cloud Computing with Condor
    Douglas Thain and Christopher Moretti
    In Cloud Computing and Software Services: Theory and Techniques, 2010
    isbn: 9781439803158
  6. ROARS: A Scalable Repository for Data Intensive Scientific Computing
    Hoang Bui, Peter Bui, Patrick Flynn, and Douglas Thain
    In The Third International Workshop on Data Intensive Distributed Computing at ACM HPDC 2010, 2010
    doi: 10.1145/1851476.1851587
  7. Weaver: Integrating Distributed Computing Abstractions into Scientific Workflows using Python
    Peter Bui, Li Yu, and Douglas Thain
    In Challenges of Large Applications in Distributed Environments at ACM HPDC 2010, 2010
    doi: 10.1145/1851476.1851570
  8. Abstractions for Scientific Computing on Campus Grids
    Christopher Moretti
    2010
  9. All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids
    Christopher Moretti, Hoang Bui, Karen Hollingsworth, Brandon Rich, Patrick Flynn, and Douglas Thain
    IEEE Transactions on Parallel and Distributed Systems, 2010
    doi: 10.1109/TPDS.2009.49
  10. Exploiting Locality with QThreads for Portable Parallel Performance
    Kyle Wheeler
    2009
  11. Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions
    Li Yu, Christopher Moretti, Scott Emrich, Kenneth Judd, and Douglas Thain
    In IEEE High Performance Distributed Computing, 2009
    doi: 10.1145/1551609.1551613
  12. Scaling Up Classifiers to Cloud Computers
    Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, and Nitesh V. Chawla
    In IEEE International Conference on Data Mining (ICDM), 2008
    doi: 10.1109/ICDM.2008.99
  13. Poster: DataLab: Transactional Data Parallel Computing on an Active Storage Cloud
    Brandon Rich and Douglas Thain
    In IEEE/ACM High Performance Distributed Computing, 2008
    isbn: 10.1145/1383422.1383461
  14. All-Pairs: An Abstraction for Data Intensive Cloud Computing
    Christopher Moretti, Jared Bulosan, Douglas Thain, and Patrick Flynn
    In IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2008
    doi: 10.1109/IPDPS.2008.4536311
  15. Poster: All-Pairs: An Abstraction for Data Intensive Computing
    Christopher Moretti, Jared Bulosan, Douglas Thain, and Patrick J. Flynn
    In IEEE/ACM Grid Computing, 2007
  16. Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid
    Christopher Moretti, Timothy Faltemier, Douglas Thain, and Patrick J. Flynn
    In Workshop on Large Scale and Volatile Desktop Grids at IEEE IPDPS, 2007
    doi: 10.1109/IPDPS.2007.370671
  17. Separating Abstractions from Resources in a Tactical Storage System
    Douglas Thain, Sander Klous, Justin Wozniak, Paul Brenner, Aaron Striegel, and Jesus Izaguirre
    In IEEE/ACM Supercomputing, 2005
    doi: 10.1109/SC.2005.64
  18. Explicit Control in a Batch Aware Distributed File System
    John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny
    In USENIX Networked Systems Design and Implementation (NSDI), 2004