Add tld grouping back to hive pipeline
This MR addresses the major pain points of the previous iterations of this batch pipeline. The first iteration submitted a single array task for each tld, and all tasks used the same set of resource parameters. This led to longer wait times for jobs to complete and wasted most allocated resources across many tasks. The second iteration grouped tlds by estimated resource size and submitted multiple jobs per hive pipeline, each with a different memory request for its tasks. This solved the resource issue but caused the scheduler to overload if there were many of these pipelines being submitted at once.
Instead, I went back to grouping the tlds to where each array task operates on a similar amount of data. The group size and requested memory amount is defined either by estimated memory required to process the largest tld, rounded up to the nearest power of 2, or to 16 GB, whichever is larger.
Preliminary tests show that the current resource estimates work well. For instance, the dataset for data-project
from 2025-04-21
has a max estimated memory requirement of 128 GB based on bhattlab
having an estimated in-memory size of 40 GB. This size is multiplied by the default memory factor (3) and then rounded up to the nearest power of 2, 128 GB. This defines the group size, and all other tlds are grouped based on the cumulative sum of their sizes. This resulted in 3 groups. bhattlab
was in a group by itself, the next 5 largest tlds were in the second group, and every other tld was in the last group.
Slurm job efficiency reports showed memory efficiency of approximately 70% for all 3 tasks, and all tasks finished successfully within 4 minutes.