Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Deepspeed/HuggingFace slowness in finetuner #171

Open
wbrown opened this issue Mar 30, 2023 · 3 comments
Open

Investigate Deepspeed/HuggingFace slowness in finetuner #171

wbrown opened this issue Mar 30, 2023 · 3 comments
Assignees

Comments

@wbrown
Copy link
Contributor

wbrown commented Mar 30, 2023

DeepSpeed and Huggingface appears to be slowing training down significantly. We should investigate why -- it may be the optimizer states.

@wbrown
Copy link
Contributor Author

wbrown commented Apr 27, 2023

@harubaru Where are we with this? Did the performance reporting you do have any insight?

@harubaru
Copy link
Contributor

The investigating that I have done has mainly revolved around using different ZeRO stages and trying out different hyperparameters. Different optimizers could not be used due to lacking a proper NCCL dependency in the base Torch image for the trainer, but besides that, here are some of the things that can most definitely improve training speed is:

  • Using a different optimizer. Currently, we use DeepSpeed's CPU AdamW optimizer which means that all of the optimizer states are stored in Float32. This amount of precision is not that necessary and if we are able to shrink this via 8-bit quantization, we can definitely speed up each optimization step. We also support 8-bit AdamW in the Stable Diffusion finetuning example, but it would require a bit of effort to port this to be usable by DeepSpeed.
  • Increase Gradient Accumulation Steps (aka GAS). The effect of increasing GAS is to emulate training at a higher batch size. Of course, this increases the time per step but the samples per second also increases, therefore throughput increases with higher GAS (see perf table)
  • Sacrificing offloading for performance. ZeRO stages such as stage 3 offloads both the model parameters and the optimizer state (which is already running on CPU) to CPU RAM to save VRAM. If we are able to somehow fit both the parameter states and the optimizer states into VRAM, we would completely eliminate the bottleneck of communication between the GPUs and system RAM. The only downside to this of course is that we would no longer be able to finetune large models affordably.

These are also the changeable factors (meaning: variables that can be adjusted through the workflow) that affect the training speed:

  1. Training performance is less affected by the ZeRO stage used and is more affected by the amount of Gradient Accumulation Steps.
  2. The amount of Gradient Accumulation Steps. Higher GAS means higher throughput.
  3. Higher batch size. This also increases overall throughput.
Run Name GAS Time (s) OPT Time (s) World Samples per Second Rank Samples per Second Total Time per Step (s)
zero_stage_3 58.096 25.079 0.4688 0.4688 83.892
zero_stage_2 51.642 25.455 0.5082 0.5082 78.514
zero_stage_1 51.863 27.691 0.5018 0.5018 79.112
gas-5* 68.412 23.677 0.4384 0.2209 91.512
gas-2* 17.238 23.498 0.3899 0.1976 40.524

* The runs for testing the timings for GAS uses two GPUs instead of one. They also use a different version of the finetuner that is currently being tested in #128, so those tests have to be reran but the above recommendations for improving training speed should remain the same regardless.

For future work, we should definitely look into seeing if we can use a different optimizer as CPU AdamW has a ridiculously high amount of performance overhead. There are also possible methods that we could try out such as incorporating flash-attention and using fused kernels for the optimizers which would decrease memory usage further, however the former of which requires a lot of monkey patching, and the latter of which would need more investigating as DeepSpeed does support fused Adam out of the box.

@wbrown
Copy link
Contributor Author

wbrown commented Jul 17, 2023

Marking this as done, as investigation is complete. Write an issue for using a different optimizer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants