Modify churn structure to account for storage affected
This MR modifies the previous churn algorithm and database to include the number of bytes affected by file deletion, creation, and modification to see exactly how storage is impacted by file volatility. This did involve changing the exact procedure of how churn is calculated. Instead of using the duplicate method described in !42 (merged), policy dataframes are merged and their sizes, access, and modification data compared. Additionally, two forms of storage change for modified files are calculated. One value is the sum of the total sizes of the new versions of the files, and the other is the net change in storage between the old and new versions.
Major Changes
- Churn algorithm is changed to use dataframe merges instead of concatentation and marking duplicates
- The following were added to the churn database table:
- Fields for storage affected by file changes (total modified, net modified, total deleted, and total created)
- Fields for files which were accessed but not churned and the corresponding sizes of those files
Minor Changes
- Added example plots using churned storage and files accessed to the
churn-analysis.ipynb
notebook