Test `CheckpointFileTransfer` from recipes PR #7

jbusecke · 2024-06-03T15:38:58Z

Testing pangeo-forge/pangeo-forge-recipes#750.

I did the following here:

Use PR branch of pgf-recipes
Deactivate the cache in config (how can I conclusively see that the OpenWithFsspec stage is not caching 'again'?)
Added the CheckpointFileTransfer with a new cache dir to confirm it working.

Todo:

Check that previously downloaded files are skipped (Example leap-scratch/data-library/feedstocks/cache_concurrent/000b7ecb864a18a4a2b56492d8cf35d4-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-05_e3sm-mmf.mli.0001-05-11-68400.nc) - Confirmed here
Once this works, modify the cache dir back to the previous one from the config
Start discussion of how to 'officially' support this in Allow for finer-grained control of concurrency in file transfers pangeo-forge/pangeo-forge-recipes#750

jbusecke · 2024-06-03T16:12:29Z

Getting some errors like this:

FileNotFoundError: [Errno 2] No such file or directory: '//leap-scratch/data-library/feedstocks/cache_concurrent/9530739710fbcf2b76dfc53b9015733e-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-10_e3sm-mmf.mli.0001-10-31-45600.nc' [while running 'Create|CheckpointFileTransfer|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr|InjectAttrs|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenWithXarray/Open with Xarray-ptransform-69']
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 640, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/tmp/49c62bd385aca3688d1a12714f2750fe8c1ff62820958899edc077a7a1b05cccc61vc7xo/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2046, in <lambda>
  File "/tmp/49c62bd385aca3688d1a12714f2750fe8c1ff62820958899edc077a7a1b05cccc61vc7xo/lib/python3.10/site-packages/pangeo_forge_recipes/transforms.py", line 321, in <lambda>
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/openers.py", line 233, in open_with_xarray
    _copy_btw_filesystems(url_or_file_obj, target_opener)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/storage.py", line 32, in _copy_btw_filesystems
    with input_opener as source:
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/core.py", line 105, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/spec.py", line 1298, in open
    f = self._open(
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 191, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 355, in __init__
    self._open()
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/implementations/local.py", line 360, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '//leap-scratch/data-library/feedstocks/cache_concurrent/9530739710fbcf2b76dfc53b9015733e-https_huggingface.co_datasets_leap_climsim_low-res_resolve_main_train_0001-10_e3sm-mmf.mli.0001-10-31-45600.nc'

I think this is due to the fact that I only provide a url, not a CacheFSSpecTarget object to the stage.
@moradology maybe we should not allow string input?

moradology · 2024-06-03T16:33:23Z

Not a bad idea. Still, I wonder what's going wrong. Later in the process (in the ParDo) a CacheFSSpecTarget is required (https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR152), but it should be created here in the outer transform: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/750/files#diff-8bac120398898793cd4f9daf94551b1f3d3f1867bed8a68b14cceed49d6dc30fR205-R208

Perhaps relevant that it opens with // rather than gs://? Maybe it is not creating the target appropriately? Actually, yeah. A closer look at this trace shows that it is trying to use the local file system rather than google storage as is clearly desired

jbusecke · 2024-06-04T12:25:18Z

So weird that this is happening in only some elements!
they seem to be reproducible (non-random against filenames)though! I ran the recipe again and it seemed to have failed on many of the same files.
Will investigate further later today

jbusecke · 2024-06-04T14:31:58Z

Oh shoot! Wrapping the url in CacheFSSpecTarget fixed it! Will up the number of concurrency again and test with the full dataset.

jbusecke · 2024-06-04T14:33:38Z

Note that I did not use CacheFSSpecTarget.from_url() but did this instead:

cache_target = CacheFSSpecTarget(
    fs = gcsfs.GCSFileSystem(),
    root_path="gs://leap-scratch/data-library/feedstocks/cache_concurrent"
)

moradology · 2024-06-04T14:49:10Z

This finding is super relevant for the upstream PR. I'll see if I can't drum up a test case to reveal the unexpected behavior with .from_url (I'm guessing)

jbusecke · 2024-06-05T14:27:59Z

Ok so I was able to run a complete lowres-mli here, with https-sync patch activated for both the caching and the openwith fsspec but I want the download to be faster.

Disabling the https-sync patch and setting concurrency to 20 gives me a bunch of these:

Name (https) already in the registry and clobber is False [while running 'Create|CheckpointFileTransfer|OpenURLWithFSSpec|OpenWithXarray|ExpandTimeDimAndAddMetadata|StoreToZarr|InjectAttrs|ConsolidateDimensionCoordinates|ConsolidateMetadata|Copy/OpenURLWithFSSpec/MapWithConcurrencyLimit/open_url-ptransform-68']
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1435, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 640, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/tmp/e3f79e737a8ab2d7b8203218342b4dd2085573636ed42230ccb55a58d8a96f4ep4wjutk7/lib/python3.10/site-packages/apache_beam/transforms/core.py", line 2046, in <lambda>
  File "/tmp/e3f79e737a8ab2d7b8203218342b4dd2085573636ed42230ccb55a58d8a96f4ep4wjutk7/lib/python3.10/site-packages/pangeo_forge_recipes/transforms.py", line 123, in <lambda>
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/openers.py", line 36, in open_url
    open_file = _get_opener(url, secrets, fsspec_sync_patch, **kw)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/pangeo_forge_recipes/storage.py", line 234, in _get_opener
    SyncHTTPFileSystem.overwrite_async_registration()
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/httpfs_sync/core.py", line 403, in overwrite_async_registration
    register_implementation("https", cls)
  File "/opt/apache/beam-venv/beam-venv-worker-sdk-0-0/lib/python3.10/site-packages/fsspec/registry.py", line 53, in register_implementation
    raise ValueError(
ValueError: Name (https) already in the registry and clobber is False

Wondering if this goes away if I reduce the concurrency.

jbusecke · 2024-06-05T15:02:57Z

This finding is super relevant for the upstream PR. I'll see if I can't drum up a test case to reveal the unexpected behavior with .from_url (I'm guessing)

@moradology should we track this in a separate issue? Just asking since I expect to close this PR soon.

moradology · 2024-06-05T16:12:07Z

Issue up here: pangeo-forge/pangeo-forge-recipes#752

jbusecke · 2024-06-05T19:50:40Z

Yoinks I am all the sudden getting a lot of failed transfers (for the mlo dataset). Not entirely sure if I am getting rate limited because I just downloaded 800GB of data in a short succession, or if one of the many alterations here screwed somethign.

Have now submitted a job with reduced concurrency for now, and will wait until tomorrow to continue.

jbusecke · 2024-06-12T17:55:40Z

I just tried to freeze the actual commit hash for the requirements and increase concurrency (all files are cached rn).

Use PR branch + deactivate cache in config + add concurrent stage

e803d7c

jbusecke mentioned this pull request Jun 3, 2024

Allow for finer-grained control of concurrency in file transfers pangeo-forge/pangeo-forge-recipes#750

Open

jbusecke added 3 commits June 3, 2024 11:42

Fix typo in dependencies

eb05a50

Disable the garbage collection hack

bdc81a4

Limit concurrency to number of cores

1f9755a

jbusecke added 2 commits June 4, 2024 09:39

Wrap cache_target explicitly in CacheFSpecTarget

3700585

Fix fs for cache target

86d2be2

jbusecke added 4 commits June 4, 2024 10:36

Update requirements.txt

21326c7

Update recipe.py

79c9764

Update recipe.py

8245abf

remove time subset, redirect to regular cache dir, increase concurrency

2e33bc1

jbusecke added 10 commits June 4, 2024 13:13

Update recipe.py

648de51

Update requirements.txt

b17c848

Update recipe.py

816c650

Update recipe.py

670b208

Pump up executors

f012eb1

Update recipe.py

38e5382

Update recipe.py

0ca4d72

Update config_dataflow.py

c3d0946

Update recipe.py

cc104ec

Update recipe.py

14a1d98

Pin httpfs-sync >= 0.0.2

bc4d195

jbusecke added 3 commits June 5, 2024 11:41

Update recipe.py

ab91a90

Update requirements.txt

1375ac8

Update recipe.py

950eae5

moradology mentioned this pull request Jun 5, 2024

fsspec_target.from_url losing protocol pangeo-forge/pangeo-forge-recipes#752

Open

jbusecke added 6 commits June 5, 2024 14:27

add mlo dataset

ffbe444

Update meta.yaml

c40a81b

Update recipe.py

686e63a

Update catalog.yaml

92e51b8

Update recipe.py

54cbb34

Update recipe.py

c364cc1

jbusecke added 9 commits June 5, 2024 15:57

Update recipe.py

759b4b4

Update recipe.py

cc481cf

Update recipe.py

ddb6515

Update requirements.txt

aa6c849

Switch back to the ugly hack

4f9751c

Update requirements.txt

7cb4770

Update config_dataflow.py

ffb0959

Update requirements.txt

2b29371

Update recipe.py

0d3b392

jbusecke added 3 commits June 12, 2024 14:18

Update requirements.txt

a8604dc

New cache dir + force full redownload

4053892

Use new version with size check after download

612f102

jbusecke mentioned this pull request Jun 24, 2024

High-res zarr products - build tracking thread leap-stc/ClimSim#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test `CheckpointFileTransfer` from recipes PR #7

Test `CheckpointFileTransfer` from recipes PR #7

jbusecke commented Jun 3, 2024 •

edited

Loading

jbusecke commented Jun 3, 2024

moradology commented Jun 3, 2024 •

edited

Loading

jbusecke commented Jun 4, 2024

jbusecke commented Jun 4, 2024

jbusecke commented Jun 4, 2024

moradology commented Jun 4, 2024

jbusecke commented Jun 5, 2024

jbusecke commented Jun 5, 2024

moradology commented Jun 5, 2024

jbusecke commented Jun 5, 2024

jbusecke commented Jun 12, 2024

Test CheckpointFileTransfer from recipes PR #7

Are you sure you want to change the base?

Test CheckpointFileTransfer from recipes PR #7

Conversation

jbusecke commented Jun 3, 2024 • edited Loading

Todo:

jbusecke commented Jun 3, 2024

moradology commented Jun 3, 2024 • edited Loading

jbusecke commented Jun 4, 2024

jbusecke commented Jun 4, 2024

jbusecke commented Jun 4, 2024

moradology commented Jun 4, 2024

jbusecke commented Jun 5, 2024

jbusecke commented Jun 5, 2024

moradology commented Jun 5, 2024

jbusecke commented Jun 5, 2024

jbusecke commented Jun 12, 2024

Test `CheckpointFileTransfer` from recipes PR #7

Test `CheckpointFileTransfer` from recipes PR #7

jbusecke commented Jun 3, 2024 •

edited

Loading

moradology commented Jun 3, 2024 •

edited

Loading