Wrangling Massive Tasks Graphs with Dynamic Hierarchical Composition

On Thursday, Octobor 30, research engineer Ben Tovar presented our recent work on accelerating the execution of High Energy Physics (HEP) workflows at the PyHEP 2025 Workshop , a hybrid workshop held at CERN . The presentation centered on an execution schema called Dynamic Data Reduction (DDR) that runs on top of TaskVine .

Current HEP analysis tools, like Coffea , provides users with an easy way to express the overall workflow and leverage local vectorization on column-oriented data. However, this often requires expressing the entire computation graph statically from the start. This introduces several issues, such as graph generation overhead which may take several hours longer than the actual computation needed, and the creation of computation units that do not fit the resources available.

With a DDR , we take advantage of the structure inherent in many HEP applications where when processing multiple collision events, the accumulation (reduction) step is typically both associative and commutative. This means that it is unnecessary to pre-determine which processed events are reduced together and can leverage factors such availability of data location. Further, the number of events processed together can respond dynamically to the resources available, and datasets can be processed independently.

In the DDR application stack, TaskVine acts as the execution platform that distributes the computation to the cluster.

As an example, we ran Cortado, a HEP application that processes 419 datasets, 19,631 files, and 14TB of data (totaling 12,000 million events) in about 5.5 hours using over 1600 cores at any one time. During the run some of these cores had to be replaced because of resources eviction.

For more information, please visit the DDR pipy page at https://pypi.org/project/dynamic-data-reduction/

Enjoy Reading This Article?