Add calculation for data churn
In order to determine which projects are heavily active vs. mostly dormant, we need a measure of data churn. Only looking at latest access time doesn't tell us how often a project space is being used, only that at least one thing changed on a given date. In fact, that actually misses cases where the changes are file deletion since deleted files aren't included in later logs. Churn from day-to-day (or log-to-log as the case may be) encompasses the following cases:
- Number of files created
- Number of files deleted
- Number of files modified
The most straightforward and expedient way to calculate all of these at once is through a merge and checking which files existed in the later log but not the earlier (creation), the opposite (deletion), and which files exist in both but have different modification times. The sum of these defines churn.
The current memory usage hovers around 150 GB+ per dataset when loaded in memory. Loading two full datasets into memory at once and performing an expensive join is not feasible on the current Cheaha hardware (maybe with half of a B200 DGX). Paring down the datasets to only the required columns saves some amount of memory, but the vast majority of the memory use is in the path
since these can be extremely long strings in some cases. Instead, joins should be performed on a per-tld basis to limit memory use and done in either a local Dask cluster job or in a Slurm array job. Unfortunately, the flat, unordered nature of the current parquet datasets means iterating over every individual parquet file every time a different tld
is loaded. This is very slow and cannot be improved through parallelization. Sorting the full dataset to where the partition and index of each tld
is known requires a full read into memory and expensive shuffles. Instead, the parquet dataset should be converted to a hive structure partitioned by tld
and policy run date. Additionally, the path should be set as the index and sorted to improve merge performance. One hive should be created and partitioned logs added to it each day. This can create full, mutable timeseries as part of the directory structure.
Once the hive has been created, calculation of churn should be fairly straightforward. Load data from one tld
and two timepoints, merge on the path index, and determine which files do not match based on the rules listed above. Smaller tld
comparisons can be run in a CPU-only job using pandas while larger tld
comparisons can be run either in a single GPU job or a dask cluster. For reference, an approximate ~10 million row dataframe in pandas containing our GPFS log data uses ~2.5 GB of RAM. The same dataframe with a cuDF backend shrinks to ~1.8 GB of VRAM. (Source is tld == xnat & acq == '2024-11-14
.).
The results should be kept in a local SQLite database for convenience. A directory can be created in /data/rc/gpfs-policy/data
that is a parent for both the hive and the database for each of project
, user
, and scratch
. Each of these should also be backed up to LTS automatically.