Convert to polars for backend compute (!56) · Merge requests · rc / gpfs-policy

Matthew K Defenderfer requested to merge mdefende/gpfs-policy:ref-convert-to-polars into main Apr 23, 2025

Overview

This is a complete overhaul of the backend compute for this package. Prior versions used a combination of pandas and dask with CUDA capabilities to improve compute power and address pandas's performance issues with large datasets. This caused a number of problems as far as usability and maintenance:

There were 4 possible compute backends which necessitated slight algorithmic modifications based on the backend. So essentially 4 different versions of the same function/method needed to be written, one for each backend.
The CUDA and Dask packages are extremely heavy and caused very long load times even for simple actions. Workarounds involving altering where packages were loaded in the script made the code flow more confusing.
Dask, while performant, is prone to crashes when creating and using its local cluster functionality.
There was an over-reliance on A100s being available in order to completely processing in a timely manner.
Installation on Cheaha is slightly cumbersome due to needing the conda repository for most of the packages as opposed to being able to use pip.

Switching the compute backend to polars solves every problem. Polars is written in Rust which is highly performant and natively supports parallelization wherever possible. Additionally, the most recent releases include an updated data streaming engine to enable fast, out-of-memory computation using only CPUs. This reduces the resources required for processing which should increase job/task throughput on the cluster.

As part of this, there were breaking changes made to a few of the primary functions. Many functions are reduced in scope such as removing automatic data I/O. This increases usability in the long run by making these functions a bit more modular.

Changelog

Major Changes

Updated all parquet/dataframe logic to Polars
Removed cudf,dask_cuda,dask, and pandas dependencies, among others.
Removed the compute module
Removed the process.factory submodule
Removed all GPU interfaces including pynvml and GPUtil. GPU integration will be added back later when Cheaha's OS is updated so the newer versions of cudf-polars can be correctly compiled.

Minor changes

All primary cli submodule functions are now imported in cli.__init__.py
All primary policy submodule functions are now imported in policy.__init__.py
Deleted unused cli.gpfs_preproc.py
Deleted unused report/general-report.qmd
hivize no longer includes a preparation and grouping step (prep_hivize)
bytes_to_human_readable_size,create_size_bin_labels, and calculate_size_distribution moved from .policy.hive to .utils.{core,units}
as_path and parse_scontrol moved to .utils.core
as_datetime,as_timedelta,create_timedelta_breakpoints,create_timedelta_labels moved/added to .utils.datetime
added prep_{age,size}_distribution functions that return the correctly formatted breakpoints and labels used in pl.Series.cut. The calculate_* versions of those commands now call those functions and then apply cut and return the groups as a Series
aggregate_gpfs_dataset now can group on file size in addition to file age

Scope Changes

aggregate_gpfs_dataset now takes in a dataframe/lazyframe and returns the aggregated dataframe. No data I/O is performed.
create_db more general purpose. Takes table definition an input. create_churn_db added for convenience to specifically create the churn database/table.

Cleanup

Removed notebooks which were no longer relevant

Issues

Fixes #55 (closed)
Fixes #44 (closed)
Fixes #43 (closed)
Closes #56 (closed)
Closes #31 (closed)
Closes #45 (closed)

Convert to polars for backend compute