KeyError in `convert-to-parquet.py` when run on metadata from directories other than /data/user/ and /data/project/
The following error occurred when running on a directory other than /data/user/ and /data/project/.
Traceback (most recent call last):
File "/home/wwarr/repos/gpfs-policy/src/convert-to-parquet/convert-to-parquet.py", line 117, in <module>
main()
File "/home/wwarr/repos/gpfs-policy/src/convert-to-parquet/convert-to-parquet.py", line 110, in main
df = pd.DataFrame.from_dict(dicts).sort_values('tld')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/core/frame.py", line 7189, in sort_values
k = self._get_label_or_level_values(by[0], axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/pandas/core/generic.py", line 1911, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'tld'
The root source of the error is probably at or near line 34. We'll need something in case tld
is not a match for the regex there.
Ruff (linter) tells me that the Type Hint for tld
is str | Any
. At the very least checking for tld is None
and not tld
would be good. That would check for no match, and zero-length matches, respectively.
It isn't clear what to do in that situation, perhaps make the tld
the full path? Or the empty string? It shouldn't be any other arbitrary value, however, because a lot of things are valid paths and we don't want ambiguity.