Resource_monitor | The Cooperative Computing Lab

The resource_monitor is a tool to monitor the computational resources of complex, multi-process applications. This is an essential capability for executing large scale applications reliably in clusters, clouds, and grids. It works on Linux, FreeBSD, and OSX, and can be used as a standalone tool, or automatically with distributed systems like Makeflow, Work Queue and TaskVine.

When invoked, the resource monitor tracks all of the processes and threads created by the subject program, and monitors their individual resource and I/O behavior. It generates up to three report files: a summary file with the maximum values of resource used, a time-series that shows the resources used at given time intervals, and a list of files that were opened during execution, together with the count of read and write operations.

Additionally, the monitor can be used as a watchdog. Maximum resource limits can be specified, and if one of the resources goes over the limit, then the monitor terminates the task, including a report of the resource that was above the limit.

The resource_monitor_visualizer creates a series of webpages summarizing the logs produced by the resource_monitor. It generates histograms for each resource and each group. For example, the histogram to the right shows the distribution of cpu usage of a workflow with 5,000 tasks. To use the resource_monitor_visualizer specify the location of the resource logs and the location for the output.

Related Publications

Not All Tasks Are Created Equal: Adaptive Resource Allocation for Heterogeneous Tasks in Dynamic Workflows

Thanh Son Phung, Logan Ward, Kyle Chard, and Douglas Thain

In WORKS Workshop on Workflows at Supercomputing, 2021

Bib PDF

@inproceedings{tasks-works-2021,
  author = {Phung, Thanh Son and Ward, Logan and Chard, Kyle and Thain, Douglas},
  title = {{Not All Tasks Are Created Equal: Adaptive Resource Allocation for Heterogeneous Tasks in Dynamic Workflows}},
  booktitle = {{WORKS Workshop on Workflows at Supercomputing}},
  year = {2021},
  cclpaperid = {978},
  keywords = {workqueue, resource_monitor},
}

Lightweight Function Monitors for Fine-Grained Management in Large Scale Python Applications

Tim Shaffer, Zhuozhao Li, Ben Tovar, Yadu Babuji, TJ Dasso, Zoe Surma, Kyle Chard, Ian Foster, and Douglas Thain

In IEEE International Parallel and Distributed Processing Symposium, 2021

doi: 10.1109/IPDPS49936.2021.00088

Bib PDF

@inproceedings{lfm-ipdps-2021,
  author = {Shaffer, Tim and Li, Zhuozhao and Tovar, Ben and Babuji, Yadu and Dasso, TJ and Surma, Zoe and Chard, Kyle and Foster, Ian and Thain, Douglas},
  title = {{Lightweight Function Monitors for Fine-Grained Management in Large Scale Python Applications}},
  booktitle = {{IEEE International Parallel and Distributed Processing Symposium}},
  year = {2021},
  note = {{doi: 10.1109/IPDPS49936.2021.00088}},
  cclpaperid = {968},
  keywords = {workqueue, resource_monitor},
}

Reduction of Workflow Resource Consumption Using a Density-based Clustering Model

Qimin Zhang, Ben Tovar, Nate Kremer-Herman, and Douglas Thain

In WORKS Workshop at Supercomputing, 2018

Bib PDF

@inproceedings{clustering-works-2018,
  author = {Zhang, Qimin and Tovar, Ben and Kremer-Herman, Nate and Thain, Douglas},
  title = {{Reduction of Workflow Resource Consumption Using a Density-based Clustering Model}},
  booktitle = {{WORKS Workshop at Supercomputing}},
  year = {2018},
  cclpaperid = {956},
  keywords = {makeflow, resource_monitor},
}

A Job Sizing Strategy for High-Throughput Scientific Workflows

Benjamin Tovar, Rafael Ferreira Silva, Gideon Juve, Ewa Deelman, William Allcock, Douglas Thain, and Miron Livny

IEEE Transactions on Parallel and Distributed Systems, 2018

doi: 10.1109/TPDS.2017.2762310

Bib PDF

@article{tovar-tpds-2017,
  author = {Tovar, Benjamin and da Silva, Rafael Ferreira and Juve, Gideon and Deelman, Ewa and Allcock, William and Thain, Douglas and Livny, Miron},
  title = {{A Job Sizing Strategy for High-Throughput Scientific Workflows}},
  journal = {{IEEE Transactions on Parallel and Distributed Systems}},
  volume = {29},
  number = {2},
  pages = {240-253},
  year = {2018},
  note = {{doi: 10.1109/TPDS.2017.2762310}},
  cclpaperid = {941},
  keywords = {workqueue, resource_monitor},
}

Practical Resource Monitoring for Robust High Throughput Computing

Gideon Juve, Benjamin Tovar, Rafael Ferreira Silva, Dariusz Krol, Douglas Thain, Ewa Deelman, William Allcock, and Miron Livny

In Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications at IEEE Cluster Computing, 2015

Bib PDF

@inproceedings{monitoring-hpcmaspa-2015,
  author = {Juve, Gideon and Tovar, Benjamin and da Silva, Rafael Ferreira and Krol, Dariusz and Thain, Douglas and Deelman, Ewa and Allcock, William and Livny, Miron},
  title = {{Practical Resource Monitoring for Robust High Throughput Computing}},
  booktitle = {{Workshop on Monitoring and Analysis for High Performance Computing Systems Plus Applications at IEEE Cluster Computing}},
  year = {2015},
  cclpaperid = {922},
  keywords = {resource_monitor},
}