Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test CheckpointFileTransfer from recipes PR #7

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

jbusecke
Copy link
Contributor

@jbusecke jbusecke commented Jun 3, 2024

Testing pangeo-forge/pangeo-forge-recipes#750.

I did the following here:

  • Use PR branch of pgf-recipes
  • Deactivate the cache in config (how can I conclusively see that the OpenWithFsspec stage is not caching 'again'?)
  • Added the CheckpointFileTransfer with a new cache dir to confirm it working.

Todo:

@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 3, 2024

Getting some errors like this:

FileNotFoundError: [Errno 2] No such file or directory: '//leap-scratch/data-library/feedstocks/cache_concurrent/9530739710fbcf2b76dfc53b9015733e-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-10_e3sm-mmf.mli.0001-10-31-45600.nc' [while running 'Create|CheckpointFileTransfer|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr|InjectAttrs|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenWithXarray/Open with Xarray-ptransform-69']
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 640, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/tmp/49c62bd385aca3688d1a12714f2750fe8c1ff62820958899edc077a7a1b05cccc61vc7xo/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2046, in <lambda>
  File "/tmp/49c62bd385aca3688d1a12714f2750fe8c1ff62820958899edc077a7a1b05cccc61vc7xo/lib/python3.10/site-packages/pangeo_forge_recipes/transforms.py", line 321, in <lambda>
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/openers.py", line 233, in open_with_xarray
    _copy_btw_filesystems(url_or_file_obj, target_opener)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/storage.py", line 32, in _copy_btw_filesystems
    with input_opener as source:
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/core.py", line 105, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/spec.py", line 1298, in open
    f = self._open(
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 191, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 355, in __init__
    self._open()
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 360, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '//leap-scratch/data-library/feedstocks/cache_concurrent/9530739710fbcf2b76dfc53b9015733e-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-10_e3sm-mmf.mli.0001-10-31-45600.nc'

I think this is due to the fact that I only provide a url, not a CacheFSSpecTarget object to the stage.
@moradology maybe we should not allow string input?

@moradology
Copy link

moradology commented Jun 3, 2024

Not a bad idea. Still, I wonder what's going wrong. Later in the process (in the ParDo) a CacheFSSpecTarget is required (https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR152), but it should be created here in the outer transform: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR205-R208

Perhaps relevant that it opens with // rather than gs://? Maybe it is not creating the target appropriately? Actually, yeah. A closer look at this trace shows that it is trying to use the local file system rather than google storage as is clearly desired

@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 4, 2024

So weird that this is happening in only some elements!
they seem to be reproducible (non-random against filenames)though! I ran the recipe again and it seemed to have failed on many of the same files.
Will investigate further later today

@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 4, 2024

Oh shoot! Wrapping the url in CacheFSSpecTarget fixed it! Will up the number of concurrency again and test with the full dataset.

@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 4, 2024

Note that I did not use CacheFSSpecTarget.from_url() but did this instead:

cache_target = CacheFSSpecTarget(
    fs = gcsfs.GCSFileSystem(),
    root_path="gs://leap-scratch/data-library/feedstocks/cache_concurrent"
)

@moradology
Copy link

This finding is super relevant for the upstream PR. I'll see if I can't drum up a test case to reveal the unexpected behavior with .from_url (I'm guessing)

@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 5, 2024

Ok so I was able to run a complete lowres-mli here, with https-sync patch activated for both the caching and the openwith fsspec but I want the download to be faster.

Disabling the https-sync patch and setting concurrency to 20 gives me a bunch of these:

Name (https) already in the registry and clobber is False [while running 'Create|CheckpointFileTransfer|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr|InjectAttrs|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenURLWithFSSpec/MapWithConcurrencyLimit/open_url-ptransform-68']
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 640, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/tmp/e3f79e737a8ab2d7b8203218342b4dd2085573636ed42230ccb55a58d8a96f4ep4wjutk7/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2046, in <lambda>
  File "/tmp/e3f79e737a8ab2d7b8203218342b4dd2085573636ed42230ccb55a58d8a96f4ep4wjutk7/lib/python3.10/site-packages/pangeo_forge_recipes/transforms.py", line 123, in <lambda>
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/openers.py", line 36, in open_url
    open_file = _get_opener(url, secrets, fsspec_sync_patch, **kw)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/storage.py", line 234, in _get_opener
    SyncHTTPFileSystem.overwrite_async_registration()
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/httpfs_sync/core.py", line 403, in overwrite_async_registration
    register_implementation("https", cls)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/registry.py", line 53, in register_implementation
    raise ValueError(
ValueError: Name (https) already in the registry and clobber is False

Wondering if this goes away if I reduce the concurrency.

@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 5, 2024

This finding is super relevant for the upstream PR. I'll see if I can't drum up a test case to reveal the unexpected behavior with .from_url (I'm guessing)

@moradology should we track this in a separate issue? Just asking since I expect to close this PR soon.

@moradology
Copy link

Issue up here: pangeo-forge/pangeo-forge-recipes#752

@jbusecke
Copy link
Contributor Author

jbusecke commented Jun 5, 2024

Yoinks I am all the sudden getting a lot of failed transfers (for the mlo dataset). Not entirely sure if I am getting rate limited because I just downloaded 800GB of data in a short succession, or if one of the many alterations here screwed somethign.

Have now submitted a job with reduced concurrency for now, and will wait until tomorrow to continue.

@jbusecke
Copy link
Contributor Author

I just tried to freeze the actual commit hash for the requirements and increase concurrency (all files are cached rn).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants