We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
分布式cpu多机训练, 启动gloo卡主了, (不启动gloo不会卡主), 想问一下这个gloo是干什么用的? 具体代码如下:
class Main(object): def __init__(self, config): self.config = config def run(self): self.init_fleet() self.init_network() if fleet.is_server(): self.run_server() elif fleet.is_worker(): self.run_online_worker() logger.info("Run Success, Exit.") def init_fleet(self): #fleet.init() os.environ["PADDLE_WITH_GLOO"] = "1" role = role_maker.PaddleCloudRoleMaker() fleet.init(role)
worker端日志: server not ready, wait 3 sec to retry... not ready endpoints:['10.60.174.62:43747']
server端日志: fl-ps > coordinator address is null! Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.60.174.62', 'http.port': '52503', 'store.prefix': '', 'start_http_server': True, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7fda1adf2160>} to start http_server worker_key:_worker, size: {'_worker': 10} start http_server: 52503, {'_worker': 10}
The text was updated successfully, but these errors were encountered:
启动命令:/data/miniconda3/envs/py36/bin/python3.6 -m paddle.distributed.launch --server_num=1 --worker_num=10 --servers=10.62.88.103:4425 --workers=10.62.86.154:4426,10.62.70.134:4426,10.60.150.6:4426,10.62.64.49:4426,10.60.172.0:4426,10.60.147.70:4426,10.62.81.110:4426,10.62.98.74:4426,10.60.163.132:4426,10.62.77.233:4426 /usr/local/train.py
Sorry, something went wrong.
版本2.4
已解决,主要手动设置变量才能对齐。不然会随机取值,导致ps和worker对不齐。 export PADDLE_WITH_GLOO=1 export FLAGS_START_PORT=5678
lugimzzz
No branches or pull requests
请提出你的问题 Please ask your question
分布式cpu多机训练, 启动gloo卡主了, (不启动gloo不会卡主), 想问一下这个gloo是干什么用的?
具体代码如下:
worker端日志:
server not ready, wait 3 sec to retry...
not ready endpoints:['10.60.174.62:43747']
server端日志:
fl-ps > coordinator address is null!
Gloo init with HTTP: need_init_all: False, args: {'http.host': '10.60.174.62', 'http.port': '52503', 'store.prefix': '', 'start_http_server': True, 'http_server_d': <DictProxy object, typeid 'dict' at 0x7fda1adf2160>}
to start http_server
worker_key:_worker, size: {'_worker': 10}
start http_server: 52503, {'_worker': 10}
The text was updated successfully, but these errors were encountered: