Convert to polars for backend compute
Overview
This is a complete overhaul of the backend compute for this package. Prior versions used a combination of pandas and dask with CUDA capabilities to improve compute power and address pandas's performance issues with large datasets. This caused a number of problems as far as usability and maintenance:
- There were 4 possible compute backends which necessitated slight algorithmic modifications based on the backend. So essentially 4 different versions of the same function/method needed to be written, one for each backend.
- The CUDA and Dask packages are extremely heavy and caused very long load times even for simple actions. Workarounds involving altering where packages were loaded in the script made the code flow more confusing.
- Dask, while performant, is prone to crashes when creating and using its local cluster functionality.
- There was an over-reliance on A100s being available in order to completely processing in a timely manner.
- Installation on Cheaha is slightly cumbersome due to needing the conda repository for most of the packages as opposed to being able to use
pip.
Switching the compute backend to polars solves every problem. Polars is written in Rust which is highly performant and natively supports parallelization wherever possible. Additionally, the most recent releases include an updated data streaming engine to enable fast, out-of-memory computation using only CPUs. This reduces the resources required for processing which should increase job/task throughput on the cluster.
As part of this, there were breaking changes made to a few of the primary functions. Many functions are reduced in scope such as removing automatic data I/O. This increases usability in the long run by making these functions a bit more modular.
Changelog
Major Changes
- Updated all parquet/dataframe logic to Polars
- Removed
cudf,dask_cuda,dask, andpandasdependencies, among others. - Removed the
computemodule - Removed the
process.factorysubmodule - Removed all GPU interfaces including
pynvmlandGPUtil. GPU integration will be added back later when Cheaha's OS is updated so the newer versions ofcudf-polarscan be correctly compiled.
Minor changes
- All primary
clisubmodule functions are now imported incli.__init__.py - All primary
policysubmodule functions are now imported inpolicy.__init__.py - Deleted unused
cli.gpfs_preproc.py - Deleted unused
report/general-report.qmd -
hivizeno longer includes a preparation and grouping step (prep_hivize) -
bytes_to_human_readable_size,create_size_bin_labels, andcalculate_size_distributionmoved from.policy.hiveto.utils.{core,units} -
as_pathandparse_scontrolmoved to.utils.core -
as_datetime,as_timedelta,create_timedelta_breakpoints,create_timedelta_labelsmoved/added to.utils.datetime - added
prep_{age,size}_distributionfunctions that return the correctly formatted breakpoints and labels used inpl.Series.cut. Thecalculate_*versions of those commands now call those functions and then applycutand return the groups as a Series -
aggregate_gpfs_datasetnow can group on file size in addition to file age
Scope Changes
-
aggregate_gpfs_datasetnow takes in a dataframe/lazyframe and returns the aggregated dataframe. No data I/O is performed. -
create_dbmore general purpose. Takes table definition an input.create_churn_dbadded for convenience to specifically create the churn database/table.
Cleanup
- Removed notebooks which were no longer relevant
Issues
- Fixes #55 (closed)
- Fixes #44 (closed)
- Fixes #43 (closed)
- Closes #56 (closed)
- Closes #31 (closed)
- Closes #45 (closed)