Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式CPU训练使用gloo卡主 #68308

Open
welsonzhang opened this issue Sep 19, 2024 · 3 comments
Open

分布式CPU训练使用gloo卡主 #68308

welsonzhang opened this issue Sep 19, 2024 · 3 comments
Assignees
Labels

Comments

@welsonzhang
Copy link

请提出你的问题 Please ask your question

分布式cpu多机训练, 启动gloo卡主了, (不启动gloo不会卡主), 想问一下这个gloo是干什么用的?
具体代码如下:

class Main(object):
    def __init__(self, config):
        self.config = config
    
    def run(self):
        self.init_fleet()
        self.init_network()
        if fleet.is_server():
            self.run_server()
        elif fleet.is_worker():
            self.run_online_worker()
        logger.info("Run Success, Exit.")
    
    def init_fleet(self):
        #fleet.init()
        os.environ["PADDLE_WITH_GLOO"] = "1"
        role = role_maker.PaddleCloudRoleMaker()
        fleet.init(role)

worker端日志:
server not ready, wait 3 sec to retry...
not ready endpoints:['10.60.174.62:43747']

server端日志:
fl-ps > coordinator address is null!
Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.60.174.62', 'http.port': '52503', 'store.prefix': '', 'start_http_server': True, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7fda1adf2160>}
to start http_server
worker_key:_worker, size: {'_worker': 10}
start http_server: 52503, {'_worker': 10}

@welsonzhang
Copy link
Author

启动命令:/data/miniconda3/envs/py36/bin/python3.6 -m paddle.distributed.launch --server_num=1 --worker_num=10 --servers=10.62.88.103:4425 --workers=10.62.86.154:4426,10.62.70.134:4426,10.60.150.6:4426,10.62.64.49:4426,10.60.172.0:4426,10.60.147.70:4426,10.62.81.110:4426,10.62.98.74:4426,10.60.163.132:4426,10.62.77.233:4426 /usr/local/train.py

@welsonzhang
Copy link
Author

版本2.4

@welsonzhang
Copy link
Author

已解决,主要手动设置变量才能对齐。不然会随机取值,导致ps和worker对不齐。
export PADDLE_WITH_GLOO=1
export FLAGS_START_PORT=5678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants