Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Huggingface Integration #916

Open
wants to merge 45 commits into
base: master
Choose a base branch
from

Conversation

pranayasinghcsmpl
Copy link
Contributor

Fixes #727

Proposed Changes

  • Added Huggingface Upload & Download functionality in a subcommand.
  • Added library name, version & git hash in huggingface tags for huggingface uploads.
  • Added functionality to save a copy of config.yaml during training.

Checklist

  • CONTRIBUTING guide has been followed.
  • PR is based on the current GaNDLF master .
  • Non-breaking change (does not break existing functionality): provide as many details as possible for any breaking change.
  • Function/class source code documentation added/updated (ensure typing is used to provide type hints, including and not limited to using Optional if a variable has a pre-defined value).
  • Code has been blacked for style consistency and linting.
  • If applicable, version information has been updated in GANDLF/version.py.
  • If adding a git submodule, add to list of exceptions for black styling in pyproject.toml file.
  • Usage documentation has been updated, if appropriate.
  • Tests added or modified to cover the changes; if coverage is reduced, please give explanation.
  • If customized dependency installation is required (i.e., a separate pip install step is needed for PR to be functional), please ensure it is reflected in all the files that control the CI, namely: python-test.yml, and all docker files [1,2,3].

@pranayasinghcsmpl pranayasinghcsmpl requested a review from a team as a code owner August 14, 2024 12:03
Copy link
Contributor

github-actions bot commented Aug 14, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Collaborator

@sarthakpati sarthakpati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I start reviewing this in earnest, I would need at least the following 2 pieces of information to be added to the PR:

  1. Documentation: (it is absolutely fine to have a bullet point list of items that link to the main HF docs)
  2. Tests

I believe both of these were present in the previous PR.

setup.py Outdated Show resolved Hide resolved
@sarthakpati
Copy link
Collaborator

Hi @Wauplin and @NielsRogge - this PR looks good from my end. Do you have any feedback?

Copy link

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there 👋 Thanks for the ping!

I left a few comments from an outsider point of view. I do think that the CLI should be more opinionated (understand "have less options and decide things for the user") otherwise we pretty much end up with a CLI close to what huggingface-cli upload and huggingface-cli download do.

GANDLF/cli/huggingface_hub_handler.py Outdated Show resolved Hide resolved
from pathlib import Path
from GANDLF.utils import get_git_hash

readme_template = """
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a simple copy of the model card template found here? If yes, I can suggest to either:

  • directly reuse the template from huggingface_hub (i.e. ModelCard.from_template(card_data) without the template_str).
  • or define your own template but in this case you should only put the relevant fields and descriptions for your library (instead of having all fields as empty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Wauplin for making me aware of this ,I will definitely go through it and make required changes as you mentioned

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Wauplin,

We had an internal discussion on what would be the best way for us to showcase potential model uploaders with a specific set of required options for the model card. Thus far, we have landed on using a custom model card. The reason to have all the fields present is provide the ability for a user to put in more information than what we require.

Here, we have put the string "REQUIRED_FOR_GANDLF" for the fields that are explicitly needed for the user to populate, and the rest have been left as present in the template.

In the code, we plan to add 2 checks:

  1. If "REQUIRED_FOR_GANDLF" is found, we present an error to the user saying that this field needs to be populated with appropriate information.
  2. The Repository key should always be https://github.com/mlcommons/GaNDLF.

Thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a sensible idea to me yes!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant, thanks for the confirmation! We'll get on it right away. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarthakpati @Wauplin so how can we test this file if we propose the upload functionality as we only have entry points tests, do we have to mention a specific directory there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can leverage one of the existing training tests to test the upload. I would recommend this one, since this would only upload a single model.

Ensure you put an appropriate description for it (such as Unit testing model or something) to make it clear for anyone viewing it. Is there a way to update an existing model, @Wauplin?

GANDLF/cli/huggingface_hub_handler.py Outdated Show resolved Hide resolved
tags += [git_hash]

card_data = ModelCardData(library_name="GaNDLF", tags=tags)
card = ModelCard.from_template(card_data, template_str=readme_template)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about template_str

Comment on lines +244 to +259
def download_from_hub(
repo_id: str,
revision: Union[str, None] = None,
cache_dir: Union[str, None] = None,
local_dir: Union[str, None] = None,
force_download: bool = False,
token: Union[str, None] = None,
):
snapshot_download(
repo_id=repo_id,
revision=revision,
cache_dir=cache_dir,
local_dir=local_dir,
force_download=force_download,
token=token,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this alias is really needed. I would simply call snapshot_download in other places in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the alias is not needed and that snapshot_download could be used by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wauplin actually this alias is use for the alignment of hugging face downloading feature with Gandlf command line design pattern ,change it to by default may abruptly conflict the command line argument

GANDLF/cli/huggingface_hub_handler.py Show resolved Hide resolved
GANDLF/cli/huggingface_hub_handler.py Show resolved Hide resolved
GANDLF/cli/huggingface_hub_handler.py Show resolved Hide resolved
@sarthakpati
Copy link
Collaborator

@pranayasinghcsmpl some lint fixes (unused variables and whatnot) will be needed for this PR. Thanks for taking care of it!

Thanks for your comments and suggestions, @Wauplin!

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 97.05882% with 4 lines in your changes missing coverage. Please review.

Project coverage is 94.61%. Comparing base (e066e88) to head (5e8a97b).
Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
testing/test_full.py 94.23% 3 Missing ⚠️
GANDLF/entrypoints/hf_hub_integration.py 96.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #916      +/-   ##
==========================================
+ Coverage   94.58%   94.61%   +0.03%     
==========================================
  Files         161      164       +3     
  Lines        9567     9701     +134     
==========================================
+ Hits         9049     9179     +130     
- Misses        518      522       +4     
Flag Coverage Δ
unittests 94.61% <97.05%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sarthakpati
Copy link
Collaborator

Support ticket generated with Codacy to explore the coverage issue.

@sarthakpati
Copy link
Collaborator

Codacy folks suggested not to use coverage reporter for anything coming in from other forks 🙄

Anyway, we should be good to go from my end. @Wauplin is this PR good to merge for you?

Copy link

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration looks good yes :) I left a few comments but nothing blocking on my side.
Thanks for the iterations!

GANDLF/cli/huggingface_hub_handler.py Outdated Show resolved Hide resolved
GANDLF/cli/huggingface_hub_handler.py Show resolved Hide resolved
GANDLF/cli/huggingface_hub_handler.py Outdated Show resolved Hide resolved
Comment on lines +244 to +259
def download_from_hub(
repo_id: str,
revision: Union[str, None] = None,
cache_dir: Union[str, None] = None,
local_dir: Union[str, None] = None,
force_download: bool = False,
token: Union[str, None] = None,
):
snapshot_download(
repo_id=repo_id,
revision=revision,
cache_dir=cache_dir,
local_dir=local_dir,
force_download=force_download,
token=token,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the alias is not needed and that snapshot_download could be used by default

GANDLF/entrypoints/hf_hub_integration.py Outdated Show resolved Hide resolved
Comment on lines 88 to 93
@click.option(
"--hf-template",
"-hft",
help="Adding the template path for the model card it is Required during Uploaing a model",
type=click.Path(exists=True, file_okay=True, dir_okay=False),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe default it to hugging_face.md to reduce friction? Users are free to provide another template if they want but having one by default should reduce friction and help grow usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have to provide a default path for the hugging face template so that when ever the user want to upload a ,model by default that template will be uploaded ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I suggest to do yes. There is already a hugging_face.md template in this PR so I was suggesting to reuse it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for you suggestion @Wauplin we will do the same

setup.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Hugging Face Hub integration
3 participants