Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多机多卡分布式训练(Multi-machine Multi-GPU Distributed Training ) #68480

Open
xlg-go opened this issue Sep 26, 2024 · 3 comments
Open
Assignees

Comments

@xlg-go
Copy link

xlg-go commented Sep 26, 2024

bug描述 Describe the Bug

PaddlePaddle/PaddleOCR#13912

整个环境分别在两个主(134)从(131)机的docker容器环境下, 容器的网络是--ipc=host --network=host --gpus all;主从机已经分别指定nccl通信的网卡;ssh也已经互为免密,ssh端口是22;主从机能互ping通;

export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1
export NCCL_SOCKET_IFNAME=eno1
export NCCL_SOCKET_IFNAME=enp96s0f0

借鉴:https://paddlepaddle.github.io/PaddleOCR/ppocr/blog/distributed_training.html

然而主从机执行到 NCCL INFO Using network Socket 就卡主不动了;主从机的指定gpu,分别只是占用一点显存,除此之外,没有任何利用率,主占用显存520M,从占用显存410M。

主机执行的语句:

python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131"  --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml

从机执行的语句:

python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131"  --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml

其他补充信息 Additional Supplementary Information

🏃‍♂️ Environment (运行环境)

系 统 os:docker ubuntu 20.04
paddleocr:0.1.0.dev0+d20240926
paddlepaddle-gpu:3.0.0.dev20240925
cuda: 12.3
nccl: 2.19.3+cu12.3

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

  1. master节点(192.168.8.134)

(py310_ppocr) ai@hf-13f-gpu-134:/workspace3/code/paddle-ocr-contribute$ python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131"  --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml
/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
LAUNCH INFO 2024-09-26 09:28:01,525 -----------  Configuration  ----------------------
LAUNCH INFO 2024-09-26 09:28:01,525 auto_cluster_config: 0
LAUNCH INFO 2024-09-26 09:28:01,525 auto_parallel_config: None
LAUNCH INFO 2024-09-26 09:28:01,525 auto_tuner_json: None
LAUNCH INFO 2024-09-26 09:28:01,525 devices: 3
LAUNCH INFO 2024-09-26 09:28:01,525 elastic_level: -1
LAUNCH INFO 2024-09-26 09:28:01,525 elastic_timeout: 30
LAUNCH INFO 2024-09-26 09:28:01,525 enable_gpu_log: True
LAUNCH INFO 2024-09-26 09:28:01,525 gloo_port: 6767
LAUNCH INFO 2024-09-26 09:28:01,525 host: None
LAUNCH INFO 2024-09-26 09:28:01,525 ips: 192.168.8.134,192.168.8.131
LAUNCH INFO 2024-09-26 09:28:01,525 job_id: default
LAUNCH INFO 2024-09-26 09:28:01,525 legacy: False
LAUNCH INFO 2024-09-26 09:28:01,525 log_dir: log
LAUNCH INFO 2024-09-26 09:28:01,525 log_level: INFO
LAUNCH INFO 2024-09-26 09:28:01,525 log_overwrite: False
LAUNCH INFO 2024-09-26 09:28:01,525 master: None
LAUNCH INFO 2024-09-26 09:28:01,525 max_restart: 3
LAUNCH INFO 2024-09-26 09:28:01,525 nnodes: 1
LAUNCH INFO 2024-09-26 09:28:01,525 nproc_per_node: None
LAUNCH INFO 2024-09-26 09:28:01,525 rank: -1
LAUNCH INFO 2024-09-26 09:28:01,525 run_mode: collective
LAUNCH INFO 2024-09-26 09:28:01,525 server_num: None
LAUNCH INFO 2024-09-26 09:28:01,525 servers: 
LAUNCH INFO 2024-09-26 09:28:01,525 sort_ip: False
LAUNCH INFO 2024-09-26 09:28:01,525 start_port: 6070
LAUNCH INFO 2024-09-26 09:28:01,525 trainer_num: None
LAUNCH INFO 2024-09-26 09:28:01,525 trainers: 
LAUNCH INFO 2024-09-26 09:28:01,525 training_script: tools/train.py
LAUNCH INFO 2024-09-26 09:28:01,525 training_script_args: ['-c', './DIY/configs/rec/rec_svtrnet-hw_english_word.yml']
LAUNCH INFO 2024-09-26 09:28:01,525 with_gloo: 1
LAUNCH INFO 2024-09-26 09:28:01,526 --------------------------------------------------
LAUNCH INFO 2024-09-26 09:28:01,526 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-09-26 09:28:01,527 Run Pod: pcelng, replicas 1, status ready
LAUNCH INFO 2024-09-26 09:28:01,545 Watching Pod: pcelng, replicas 1, status running
/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
[2024/09/26 09:28:03] ppocr INFO: Architecture : 
[2024/09/26 09:28:03] ppocr INFO:     Backbone : 
[2024/09/26 09:28:03] ppocr INFO:         depth : [3, 6, 3]
[2024/09/26 09:28:03] ppocr INFO:         embed_dim : [64, 128, 256]
[2024/09/26 09:28:03] ppocr INFO:         img_size : [32, 600]
[2024/09/26 09:28:03] ppocr INFO:         last_stage : True
[2024/09/26 09:28:03] ppocr INFO:         local_mixer : [[7, 11], [7, 11], [7, 11]]
[2024/09/26 09:28:03] ppocr INFO:         mixer : ['Local', 'Local', 'Local', 'Local', 'Local', 'Local', 'Global', 'Global', 'Global', 'Global', 'Global', 'Global']
[2024/09/26 09:28:03] ppocr INFO:         name : SVTRNet
[2024/09/26 09:28:03] ppocr INFO:         num_heads : [2, 4, 8]
[2024/09/26 09:28:03] ppocr INFO:         out_channels : 192
[2024/09/26 09:28:03] ppocr INFO:         out_char_num : 50
[2024/09/26 09:28:03] ppocr INFO:         patch_merging : Conv
[2024/09/26 09:28:03] ppocr INFO:         prenorm : False
[2024/09/26 09:28:03] ppocr INFO:     Head : 
[2024/09/26 09:28:03] ppocr INFO:         name : CTCHead
[2024/09/26 09:28:03] ppocr INFO:     Neck : 
[2024/09/26 09:28:03] ppocr INFO:         encoder_type : reshape
[2024/09/26 09:28:03] ppocr INFO:         name : SequenceEncoder
[2024/09/26 09:28:03] ppocr INFO:     Transform : 
[2024/09/26 09:28:03] ppocr INFO:         name : STN_ON
[2024/09/26 09:28:03] ppocr INFO:         num_control_points : 20
[2024/09/26 09:28:03] ppocr INFO:         stn_activation : none
[2024/09/26 09:28:03] ppocr INFO:         tps_inputsize : [32, 64]
[2024/09/26 09:28:03] ppocr INFO:         tps_margins : [0.05, 0.05]
[2024/09/26 09:28:03] ppocr INFO:         tps_outputsize : [32, 600]
[2024/09/26 09:28:03] ppocr INFO:     algorithm : SVTR
[2024/09/26 09:28:03] ppocr INFO:     model_type : rec
[2024/09/26 09:28:03] ppocr INFO: Eval : 
[2024/09/26 09:28:03] ppocr INFO:     dataset : 
[2024/09/26 09:28:03] ppocr INFO:         data_dir : ./datasets/hw_english_word_dictation-rec
[2024/09/26 09:28:03] ppocr INFO:         label_file_list : ['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt']
[2024/09/26 09:28:03] ppocr INFO:         name : SimpleDataSet
[2024/09/26 09:28:03] ppocr INFO:         transforms : 
[2024/09/26 09:28:03] ppocr INFO:             DecodeImage : 
[2024/09/26 09:28:03] ppocr INFO:                 channel_first : False
[2024/09/26 09:28:03] ppocr INFO:                 img_mode : BGR
[2024/09/26 09:28:03] ppocr INFO:             CTCLabelEncode : None
[2024/09/26 09:28:03] ppocr INFO:             SVTRRecResizeImg : 
[2024/09/26 09:28:03] ppocr INFO:                 image_shape : [3, 32, 600]
[2024/09/26 09:28:03] ppocr INFO:                 padding : True
[2024/09/26 09:28:03] ppocr INFO:             KeepKeys : 
[2024/09/26 09:28:03] ppocr INFO:                 keep_keys : ['image', 'label', 'length']
[2024/09/26 09:28:03] ppocr INFO:     loader : 
[2024/09/26 09:28:03] ppocr INFO:         batch_size_per_card : 100
[2024/09/26 09:28:03] ppocr INFO:         drop_last : False
[2024/09/26 09:28:03] ppocr INFO:         num_workers : 12
[2024/09/26 09:28:03] ppocr INFO:         shuffle : False
[2024/09/26 09:28:03] ppocr INFO: Global : 
[2024/09/26 09:28:03] ppocr INFO:     cal_metric_during_train : True
[2024/09/26 09:28:03] ppocr INFO:     character_dict_path : ./DIY/character/hw_english_word.txt
[2024/09/26 09:28:03] ppocr INFO:     character_type : en
[2024/09/26 09:28:03] ppocr INFO:     checkpoints : None
[2024/09/26 09:28:03] ppocr INFO:     d2s_train_image_shape : [3, 32, 600]
[2024/09/26 09:28:03] ppocr INFO:     distributed : True
[2024/09/26 09:28:03] ppocr INFO:     epoch_num : 150
[2024/09/26 09:28:03] ppocr INFO:     eval_batch_step : [0, 9736]
[2024/09/26 09:28:03] ppocr INFO:     infer_img : ./datasets/hw_score_data/images_08_test/
[2024/09/26 09:28:03] ppocr INFO:     infer_mode : False
[2024/09/26 09:28:03] ppocr INFO:     log_smooth_window : 1
[2024/09/26 09:28:03] ppocr INFO:     max_text_length : 50
[2024/09/26 09:28:03] ppocr INFO:     pretrained_model : None
[2024/09/26 09:28:03] ppocr INFO:     print_batch_step : 1
[2024/09/26 09:28:03] ppocr INFO:     save_epoch_step : 1
[2024/09/26 09:28:03] ppocr INFO:     save_inference_dir : ./output/rec_svtrnet-hw_english_word/infer_model/
[2024/09/26 09:28:03] ppocr INFO:     save_model_dir : ./output/rec_svtrnet-hw_english_word/
[2024/09/26 09:28:03] ppocr INFO:     save_res_path : ./output/rec_svtrnet-hw_english_word/rec/predicts_rec_svtrnet-hw_english_word.txt
[2024/09/26 09:28:03] ppocr INFO:     use_gpu : True
[2024/09/26 09:28:03] ppocr INFO:     use_space_char : True
[2024/09/26 09:28:03] ppocr INFO:     use_visualdl : False
[2024/09/26 09:28:03] ppocr INFO: Loss : 
[2024/09/26 09:28:03] ppocr INFO:     name : CTCLoss
[2024/09/26 09:28:03] ppocr INFO: Metric : 
[2024/09/26 09:28:03] ppocr INFO:     main_indicator : acc
[2024/09/26 09:28:03] ppocr INFO:     name : RecMetric
[2024/09/26 09:28:03] ppocr INFO: Optimizer : 
[2024/09/26 09:28:03] ppocr INFO:     beta1 : 0.9
[2024/09/26 09:28:03] ppocr INFO:     beta2 : 0.99
[2024/09/26 09:28:03] ppocr INFO:     epsilon : 1e-08
[2024/09/26 09:28:03] ppocr INFO:     lr : 
[2024/09/26 09:28:03] ppocr INFO:         learning_rate : 0.0005
[2024/09/26 09:28:03] ppocr INFO:         name : Cosine
[2024/09/26 09:28:03] ppocr INFO:         warmup_epoch : 2
[2024/09/26 09:28:03] ppocr INFO:     name : AdamW
[2024/09/26 09:28:03] ppocr INFO:     no_weight_decay_name : norm pos_embed
[2024/09/26 09:28:03] ppocr INFO:     one_dim_param_no_weight_decay : True
[2024/09/26 09:28:03] ppocr INFO:     weight_decay : 0.05
[2024/09/26 09:28:03] ppocr INFO: PostProcess : 
[2024/09/26 09:28:03] ppocr INFO:     name : CTCLabelDecode
[2024/09/26 09:28:03] ppocr INFO: Train : 
[2024/09/26 09:28:03] ppocr INFO:     dataset : 
[2024/09/26 09:28:03] ppocr INFO:         data_dir : ./datasets/hw_english_word_dictation-rec
[2024/09/26 09:28:03] ppocr INFO:         label_file_list : ['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt']
[2024/09/26 09:28:03] ppocr INFO:         name : SimpleDataSet
[2024/09/26 09:28:03] ppocr INFO:         ratio_list : [1]
[2024/09/26 09:28:03] ppocr INFO:         transforms : 
[2024/09/26 09:28:03] ppocr INFO:             DecodeImage : 
[2024/09/26 09:28:03] ppocr INFO:                 channel_first : False
[2024/09/26 09:28:03] ppocr INFO:                 img_mode : BGR
[2024/09/26 09:28:03] ppocr INFO:             CTCLabelEncode : None
[2024/09/26 09:28:03] ppocr INFO:             SVTRRecResizeImg : 
[2024/09/26 09:28:03] ppocr INFO:                 image_shape : [3, 32, 600]
[2024/09/26 09:28:03] ppocr INFO:                 padding : True
[2024/09/26 09:28:03] ppocr INFO:             KeepKeys : 
[2024/09/26 09:28:03] ppocr INFO:                 keep_keys : ['image', 'label', 'length']
[2024/09/26 09:28:03] ppocr INFO:     loader : 
[2024/09/26 09:28:03] ppocr INFO:         batch_size_per_card : 100
[2024/09/26 09:28:03] ppocr INFO:         drop_last : False
[2024/09/26 09:28:03] ppocr INFO:         num_workers : 12
[2024/09/26 09:28:03] ppocr INFO:         shuffle : True
[2024/09/26 09:28:03] ppocr INFO: profiler_options : None
[2024/09/26 09:28:03] ppocr INFO: train with paddle 3.0.0 and device Place(gpu:3)
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_cudnn_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cudnn/lib', default_value='')
FLAGS(name='FLAGS_cublas_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cublas/lib', default_value='')
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')
FLAGS(name='FLAGS_cusolver_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cusolver/lib', default_value='')
FLAGS(name='FLAGS_nvidia_package_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia', default_value='')
FLAGS(name='FLAGS_nccl_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/nccl/lib', default_value='')
FLAGS(name='FLAGS_cusparse_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cusparse/lib', default_value='')
FLAGS(name='FLAGS_cupti_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cuda_cupti/lib', default_value='')
FLAGS(name='FLAGS_curand_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/curand/lib', default_value='')
=======================================================================
I0926 09:28:03.863519 17650 tcp_utils.cc:181] The server starts to listen on IP_ANY:6070
I0926 09:28:03.863718 17650 tcp_utils.cc:130] Successfully connected to 192.168.8.134:6070
I0926 09:28:11.758026 17650 process_group_nccl.cc:150] ProcessGroupNCCL pg_timeout_ 1800000
I0926 09:28:11.758095 17650 process_group_nccl.cc:151] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024/09/26 09:28:12] ppocr INFO: Initialize indexs of datasets:['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt']
[2024/09/26 09:28:12] ppocr INFO: Initialize indexs of datasets:['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt']
W0926 09:28:12.926563 17650 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 8.9, Driver API Version: 12.5, Runtime API Version: 12.3
W0926 09:28:12.928436 17650 gpu_resources.cc:164] device: 3, cuDNN Version: 9.0.
[2024/09/26 09:28:13] ppocr INFO: train dataloader has 1954 iters
[2024/09/26 09:28:13] ppocr INFO: valid dataloader has 3907 iters
[2024/09/26 09:28:13] ppocr INFO: train from scratch
hf-13f-gpu-134:17650:17650 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
hf-13f-gpu-134:17650:17650 [3] NCCL INFO Bootstrap : Using eno1:192.168.8.134<0>
hf-13f-gpu-134:17650:17650 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hf-13f-gpu-134:17650:17650 [3] NCCL INFO cudaDriverVersion 12050
NCCL version 2.19.3+cuda12.3
hf-13f-gpu-134:17650:18005 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
hf-13f-gpu-134:17650:18005 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eno1
hf-13f-gpu-134:17650:18005 [3] NCCL INFO NET/Socket : Using [0]eno1:192.168.8.134<0>
hf-13f-gpu-134:17650:18005 [3] NCCL INFO Using non-device net plugin version 0
hf-13f-gpu-134:17650:18005 [3] NCCL INFO Using network Socket
  1. slave节点(192.168.8.131)

(py310_ppocr) ai@hf-13f-gpu-131:/workspace2/xlg/code/paddle-ocr-contribute$ python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131"  --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml
/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
LAUNCH INFO 2024-09-26 09:28:09,243 -----------  Configuration  ----------------------
LAUNCH INFO 2024-09-26 09:28:09,243 auto_cluster_config: 0
LAUNCH INFO 2024-09-26 09:28:09,243 auto_parallel_config: None
LAUNCH INFO 2024-09-26 09:28:09,243 auto_tuner_json: None
LAUNCH INFO 2024-09-26 09:28:09,243 devices: 3
LAUNCH INFO 2024-09-26 09:28:09,243 elastic_level: -1
LAUNCH INFO 2024-09-26 09:28:09,243 elastic_timeout: 30
LAUNCH INFO 2024-09-26 09:28:09,243 enable_gpu_log: True
LAUNCH INFO 2024-09-26 09:28:09,243 gloo_port: 6767
LAUNCH INFO 2024-09-26 09:28:09,243 host: None
LAUNCH INFO 2024-09-26 09:28:09,243 ips: 192.168.8.134,192.168.8.131
LAUNCH INFO 2024-09-26 09:28:09,243 job_id: default
LAUNCH INFO 2024-09-26 09:28:09,243 legacy: False
LAUNCH INFO 2024-09-26 09:28:09,243 log_dir: log
LAUNCH INFO 2024-09-26 09:28:09,243 log_level: INFO
LAUNCH INFO 2024-09-26 09:28:09,243 log_overwrite: False
LAUNCH INFO 2024-09-26 09:28:09,243 master: None
LAUNCH INFO 2024-09-26 09:28:09,243 max_restart: 3
LAUNCH INFO 2024-09-26 09:28:09,243 nnodes: 1
LAUNCH INFO 2024-09-26 09:28:09,243 nproc_per_node: None
LAUNCH INFO 2024-09-26 09:28:09,243 rank: -1
LAUNCH INFO 2024-09-26 09:28:09,243 run_mode: collective
LAUNCH INFO 2024-09-26 09:28:09,243 server_num: None
LAUNCH INFO 2024-09-26 09:28:09,243 servers: 
LAUNCH INFO 2024-09-26 09:28:09,243 sort_ip: False
LAUNCH INFO 2024-09-26 09:28:09,243 start_port: 6070
LAUNCH INFO 2024-09-26 09:28:09,243 trainer_num: None
LAUNCH INFO 2024-09-26 09:28:09,243 trainers: 
LAUNCH INFO 2024-09-26 09:28:09,243 training_script: tools/train.py
LAUNCH INFO 2024-09-26 09:28:09,243 training_script_args: ['-c', './DIY/configs/rec/rec_svtrnet-hw_english_word.yml']
LAUNCH INFO 2024-09-26 09:28:09,243 with_gloo: 1
LAUNCH INFO 2024-09-26 09:28:09,243 --------------------------------------------------
LAUNCH INFO 2024-09-26 09:28:09,244 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-09-26 09:28:09,244 Run Pod: nkpvvc, replicas 1, status ready
LAUNCH INFO 2024-09-26 09:28:09,265 Watching Pod: nkpvvc, replicas 1, status running
/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/utils/cpp_extension/extension_utils.py:686: UserWarning: No ccache found. Please be aware that recompiling all source files may be required. You can download and install ccache from: https://github.com/ccache/ccache/blob/master/doc/INSTALL.md
  warnings.warn(warning_message)
[2024/09/26 09:28:11] ppocr INFO: Architecture : 
[2024/09/26 09:28:11] ppocr INFO:     Backbone : 
[2024/09/26 09:28:11] ppocr INFO:         depth : [3, 6, 3]
[2024/09/26 09:28:11] ppocr INFO:         embed_dim : [64, 128, 256]
[2024/09/26 09:28:11] ppocr INFO:         img_size : [32, 600]
[2024/09/26 09:28:11] ppocr INFO:         last_stage : True
[2024/09/26 09:28:11] ppocr INFO:         local_mixer : [[7, 11], [7, 11], [7, 11]]
[2024/09/26 09:28:11] ppocr INFO:         mixer : ['Local', 'Local', 'Local', 'Local', 'Local', 'Local', 'Global', 'Global', 'Global', 'Global', 'Global', 'Global']
[2024/09/26 09:28:11] ppocr INFO:         name : SVTRNet
[2024/09/26 09:28:11] ppocr INFO:         num_heads : [2, 4, 8]
[2024/09/26 09:28:11] ppocr INFO:         out_channels : 192
[2024/09/26 09:28:11] ppocr INFO:         out_char_num : 50
[2024/09/26 09:28:11] ppocr INFO:         patch_merging : Conv
[2024/09/26 09:28:11] ppocr INFO:         prenorm : False
[2024/09/26 09:28:11] ppocr INFO:     Head : 
[2024/09/26 09:28:11] ppocr INFO:         name : CTCHead
[2024/09/26 09:28:11] ppocr INFO:     Neck : 
[2024/09/26 09:28:11] ppocr INFO:         encoder_type : reshape
[2024/09/26 09:28:11] ppocr INFO:         name : SequenceEncoder
[2024/09/26 09:28:11] ppocr INFO:     Transform : 
[2024/09/26 09:28:11] ppocr INFO:         name : STN_ON
[2024/09/26 09:28:11] ppocr INFO:         num_control_points : 20
[2024/09/26 09:28:11] ppocr INFO:         stn_activation : none
[2024/09/26 09:28:11] ppocr INFO:         tps_inputsize : [32, 64]
[2024/09/26 09:28:11] ppocr INFO:         tps_margins : [0.05, 0.05]
[2024/09/26 09:28:11] ppocr INFO:         tps_outputsize : [32, 600]
[2024/09/26 09:28:11] ppocr INFO:     algorithm : SVTR
[2024/09/26 09:28:11] ppocr INFO:     model_type : rec
[2024/09/26 09:28:11] ppocr INFO: Eval : 
[2024/09/26 09:28:11] ppocr INFO:     dataset : 
[2024/09/26 09:28:11] ppocr INFO:         data_dir : ./datasets/hw_english_word_dictation-rec
[2024/09/26 09:28:11] ppocr INFO:         label_file_list : ['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt']
[2024/09/26 09:28:11] ppocr INFO:         name : SimpleDataSet
[2024/09/26 09:28:11] ppocr INFO:         transforms : 
[2024/09/26 09:28:11] ppocr INFO:             DecodeImage : 
[2024/09/26 09:28:11] ppocr INFO:                 channel_first : False
[2024/09/26 09:28:11] ppocr INFO:                 img_mode : BGR
[2024/09/26 09:28:11] ppocr INFO:             CTCLabelEncode : None
[2024/09/26 09:28:11] ppocr INFO:             SVTRRecResizeImg : 
[2024/09/26 09:28:11] ppocr INFO:                 image_shape : [3, 32, 600]
[2024/09/26 09:28:11] ppocr INFO:                 padding : True
[2024/09/26 09:28:11] ppocr INFO:             KeepKeys : 
[2024/09/26 09:28:11] ppocr INFO:                 keep_keys : ['image', 'label', 'length']
[2024/09/26 09:28:11] ppocr INFO:     loader : 
[2024/09/26 09:28:11] ppocr INFO:         batch_size_per_card : 100
[2024/09/26 09:28:11] ppocr INFO:         drop_last : False
[2024/09/26 09:28:11] ppocr INFO:         num_workers : 12
[2024/09/26 09:28:11] ppocr INFO:         shuffle : False
[2024/09/26 09:28:11] ppocr INFO: Global : 
[2024/09/26 09:28:11] ppocr INFO:     cal_metric_during_train : True
[2024/09/26 09:28:11] ppocr INFO:     character_dict_path : ./DIY/character/hw_english_word.txt
[2024/09/26 09:28:11] ppocr INFO:     character_type : en
[2024/09/26 09:28:11] ppocr INFO:     checkpoints : None
[2024/09/26 09:28:11] ppocr INFO:     d2s_train_image_shape : [3, 32, 600]
[2024/09/26 09:28:11] ppocr INFO:     distributed : True
[2024/09/26 09:28:11] ppocr INFO:     epoch_num : 150
[2024/09/26 09:28:11] ppocr INFO:     eval_batch_step : [0, 9736]
[2024/09/26 09:28:11] ppocr INFO:     infer_img : ./datasets/hw_score_data/images_08_test/
[2024/09/26 09:28:11] ppocr INFO:     infer_mode : False
[2024/09/26 09:28:11] ppocr INFO:     log_smooth_window : 1
[2024/09/26 09:28:11] ppocr INFO:     max_text_length : 50
[2024/09/26 09:28:11] ppocr INFO:     pretrained_model : None
[2024/09/26 09:28:11] ppocr INFO:     print_batch_step : 1
[2024/09/26 09:28:11] ppocr INFO:     save_epoch_step : 1
[2024/09/26 09:28:11] ppocr INFO:     save_inference_dir : ./output/rec_svtrnet-hw_english_word/infer_model/
[2024/09/26 09:28:11] ppocr INFO:     save_model_dir : ./output/rec_svtrnet-hw_english_word/
[2024/09/26 09:28:11] ppocr INFO:     save_res_path : ./output/rec_svtrnet-hw_english_word/rec/predicts_rec_svtrnet-hw_english_word.txt
[2024/09/26 09:28:11] ppocr INFO:     use_gpu : True
[2024/09/26 09:28:11] ppocr INFO:     use_space_char : True
[2024/09/26 09:28:11] ppocr INFO:     use_visualdl : False
[2024/09/26 09:28:11] ppocr INFO: Loss : 
[2024/09/26 09:28:11] ppocr INFO:     name : CTCLoss
[2024/09/26 09:28:11] ppocr INFO: Metric : 
[2024/09/26 09:28:11] ppocr INFO:     main_indicator : acc
[2024/09/26 09:28:11] ppocr INFO:     name : RecMetric
[2024/09/26 09:28:11] ppocr INFO: Optimizer : 
[2024/09/26 09:28:11] ppocr INFO:     beta1 : 0.9
[2024/09/26 09:28:11] ppocr INFO:     beta2 : 0.99
[2024/09/26 09:28:11] ppocr INFO:     epsilon : 1e-08
[2024/09/26 09:28:11] ppocr INFO:     lr : 
[2024/09/26 09:28:11] ppocr INFO:         learning_rate : 0.0005
[2024/09/26 09:28:11] ppocr INFO:         name : Cosine
[2024/09/26 09:28:11] ppocr INFO:         warmup_epoch : 2
[2024/09/26 09:28:11] ppocr INFO:     name : AdamW
[2024/09/26 09:28:11] ppocr INFO:     no_weight_decay_name : norm pos_embed
[2024/09/26 09:28:11] ppocr INFO:     one_dim_param_no_weight_decay : True
[2024/09/26 09:28:11] ppocr INFO:     weight_decay : 0.05
[2024/09/26 09:28:11] ppocr INFO: PostProcess : 
[2024/09/26 09:28:11] ppocr INFO:     name : CTCLabelDecode
[2024/09/26 09:28:11] ppocr INFO: Train : 
[2024/09/26 09:28:11] ppocr INFO:     dataset : 
[2024/09/26 09:28:11] ppocr INFO:         data_dir : ./datasets/hw_english_word_dictation-rec
[2024/09/26 09:28:11] ppocr INFO:         label_file_list : ['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt']
[2024/09/26 09:28:11] ppocr INFO:         name : SimpleDataSet
[2024/09/26 09:28:11] ppocr INFO:         ratio_list : [1]
[2024/09/26 09:28:11] ppocr INFO:         transforms : 
[2024/09/26 09:28:11] ppocr INFO:             DecodeImage : 
[2024/09/26 09:28:11] ppocr INFO:                 channel_first : False
[2024/09/26 09:28:11] ppocr INFO:                 img_mode : BGR
[2024/09/26 09:28:11] ppocr INFO:             CTCLabelEncode : None
[2024/09/26 09:28:11] ppocr INFO:             SVTRRecResizeImg : 
[2024/09/26 09:28:11] ppocr INFO:                 image_shape : [3, 32, 600]
[2024/09/26 09:28:11] ppocr INFO:                 padding : True
[2024/09/26 09:28:11] ppocr INFO:             KeepKeys : 
[2024/09/26 09:28:11] ppocr INFO:                 keep_keys : ['image', 'label', 'length']
[2024/09/26 09:28:11] ppocr INFO:     loader : 
[2024/09/26 09:28:11] ppocr INFO:         batch_size_per_card : 100
[2024/09/26 09:28:11] ppocr INFO:         drop_last : False
[2024/09/26 09:28:11] ppocr INFO:         num_workers : 12
[2024/09/26 09:28:11] ppocr INFO:         shuffle : True
[2024/09/26 09:28:11] ppocr INFO: profiler_options : None
[2024/09/26 09:28:11] ppocr INFO: train with paddle 3.0.0 and device Place(gpu:3)
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_cusparse_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cusparse/lib', default_value='')
FLAGS(name='FLAGS_cublas_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cublas/lib', default_value='')
FLAGS(name='FLAGS_enable_pir_in_executor', current_value=True, default_value=False)
FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='')
FLAGS(name='FLAGS_nccl_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/nccl/lib', default_value='')
FLAGS(name='FLAGS_curand_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/curand/lib', default_value='')
FLAGS(name='FLAGS_cupti_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cuda_cupti/lib', default_value='')
FLAGS(name='FLAGS_nvidia_package_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia', default_value='')
FLAGS(name='FLAGS_cusolver_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cusolver/lib', default_value='')
FLAGS(name='FLAGS_cudnn_dir', current_value='/home/ai/miniconda3/envs/py310_ppocr/lib/python3.10/site-packages/paddle/../nvidia/cudnn/lib', default_value='')
=======================================================================
I0926 09:28:11.714093 14571 tcp_utils.cc:181] The server starts to listen on IP_ANY:6070
I0926 09:28:11.714478 14571 tcp_utils.cc:130] Successfully connected to 192.168.8.134:6070
I0926 09:28:11.715672 14571 process_group_nccl.cc:150] ProcessGroupNCCL pg_timeout_ 1800000
I0926 09:28:11.715679 14571 process_group_nccl.cc:151] ProcessGroupNCCL nccl_comm_init_option_ 0
[2024/09/26 09:28:11] ppocr INFO: Initialize indexs of datasets:['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt']
[2024/09/26 09:28:12] ppocr INFO: Initialize indexs of datasets:['./datasets/hw_english_word_dictation-rec/train_add_val_test.txt']
W0926 09:28:12.378768 14571 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 8.6, Driver API Version: 12.5, Runtime API Version: 12.3
W0926 09:28:12.379539 14571 gpu_resources.cc:164] device: 3, cuDNN Version: 9.0.
[2024/09/26 09:28:12] ppocr INFO: train dataloader has 1954 iters
[2024/09/26 09:28:12] ppocr INFO: valid dataloader has 3907 iters
[2024/09/26 09:28:12] ppocr INFO: train from scratch
hf-13f-gpu-131:14571:14571 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp96s0f0
hf-13f-gpu-131:14571:14571 [3] NCCL INFO Bootstrap : Using enp96s0f0:192.168.8.131<0>
hf-13f-gpu-131:14571:14571 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
hf-13f-gpu-131:14571:14571 [3] NCCL INFO cudaDriverVersion 12050
NCCL version 2.19.3+cuda12.3
hf-13f-gpu-131:14571:14710 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
hf-13f-gpu-131:14571:14710 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp96s0f0
hf-13f-gpu-131:14571:14710 [3] NCCL INFO NET/Socket : Using [0]enp96s0f0:192.168.8.131<0>
hf-13f-gpu-131:14571:14710 [3] NCCL INFO Using non-device net plugin version 0
hf-13f-gpu-131:14571:14710 [3] NCCL INFO Using network Socket
@lijialin03
Copy link
Contributor

您好,正在联系相关同学进行复现,感谢您的反馈!

@xlg-go
Copy link
Author

xlg-go commented Sep 26, 2024

您好,正在联系相关同学进行复现,感谢您的反馈!

万分感谢....

@tianhaodongbd
Copy link
Contributor

你好,可以参考下面的启动命令试一下:python -m paddle.distributed.launch --ips="192.168.8.134,192.168.8.131" --host 192.168.8.134 --nnodes 2 --master 192.168.8.134:55555 --gpus 3 tools/train.py -c ./DIY/configs/rec/rec_svtrnet-hw_english_word.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants