Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Virtual Datasets (VDS) #89

Open
jbhatch opened this issue Jun 11, 2020 · 1 comment
Open

Issues with Virtual Datasets (VDS) #89

jbhatch opened this issue Jun 11, 2020 · 1 comment
Assignees

Comments

@jbhatch
Copy link

jbhatch commented Jun 11, 2020

There are several issues that result from using H5PYD or the HSDS CLI commands on an HDF5 VDS made using H5PY. If a few HDF5 files are combined in a VDS, and if H5PYD is used to send the VDS to the HSDS, a tiny (~KB-sized) and unusable file is produced on the HSDS. This unusable file shows up on the HSDS with an HSLS command, but cannot be retrieved back to an NFS with HSGET. However, if HSLOAD is used to send the VDS file to the HSDS, all of the data that comprises the VDS is written to the HSDS, effectively undoing the virtual aspect of the VDS.

@jreadey
Copy link
Member

jreadey commented Jun 12, 2020

HSDS doesn't support VDS and hsload reading HDF5 files with h5py isn't "VDS aware", but I would have thought the ingest should have worked in the sense of setting up the HSDS datasets that include all the data from the source files.

Can you post some sample VDS files? I could do some experimentation.

BTW, there's another approach to combining multiple files that can be used in HSDS... an HSDS dataset can be created that maps to chunks stored in one or more HDF5 files (as long as they have the same chunk shape, type, and compression options). You can read about how this works here: https://github.com/HDFGroup/hsds/blob/master/docs/design/single_object/SingleObject.md.

This approach is not as general as VDS but works well when data you are pulling in aligns on the chunk boundaries. We used this approach to aggregate data from 7850 HDF5 files to create one (7850, 720, 1440) dataset. See:
https://github.com/HDFGroup/hdflab_examples/blob/master/NCEP3/ncep3_example.ipynb.

The hsload util isn't able to link to multiple files, so the chunk map needs to be setup manually. I can walk you through it if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants