模型安装选用GPU2、3卡，但是运行总报错，提示gpu0没有资源 #2405

SDAIer · 2024-10-09T03:50:44Z

System Info / 系統信息

NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2
linux

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

0.15.2

The command used to start Xinference / 用以启动 xinference 的命令

docker run

Reproduction / 复现过程

完整日志如下

经营业绩和盈利能力作出正确判断的各项交易和事项产生的损益。
二净资产收益率及每股收益加权平均每股收益净资产收益率(%) 基本每股收益稀释每股收益2024 年1 至 6 月2023 年1 至 6 月2024 年1 至 6 月2023 年1 至 6 月2024 年1 至 6 月2023 年1 至 6 月归属于公司普通股股东的净利润 7.16% 7.08% 0.68 0.65 0.68 0.65扣除非经常性损益后归属于公司普通股股东的净利润 7.02% 7.01% 0.67 0.64 0.67 0.64

<|im_end|>
<|im_start|>user
总结<|im_end|>
<|im_start|>assistant
, generate config: {'echo': False, 'max_tokens': 200, 'repetition_penalty': 1.1, 'stop': ['<|endoftext|>', '<|im_start|>', '<|im_end|>'], 'stop_token_ids': [151643, 151644, 151645], 'stream': True, 'stream_options': {'include_usage': False}, 'stream_interval': 2, 'temperature': 0.01, 'top_p': 0.95, 'top_k': 40, 'lora_name': None, 'request_id': None, 'model': 'qwen2.5-instruct'}
2024-10-08 20:51:53,189 xinference.core.model 5009 DEBUG [request d22a7aac-85f1-11ef-91a4-0242ac110004] Leave chat, elapsed time: 0 s
2024-10-08 20:51:53,190 xinference.core.model 5009 DEBUG After request chat, current serve request count: 0 for the model qwen2.5-instruct
2024-10-08 20:51:53,405 transformers.models.qwen2.modeling_qwen2 5009 WARNING We detected that you are passing past_key_values as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate Cache class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
2024-10-08 20:51:58,223 xinference.core.model 5009 ERROR Model actor is out of memory, model id: qwen2.5-instruct
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 332, in _to_generator
for v in gen:
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 255, in _to_chat_completion_chunks
for i, chunk in enumerate(chunks):
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/core.py", line 356, in generator_wrapper
for completion_chunk, completion_usage in generate_stream(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 36, in generator_context
response = gen.send(None)
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/utils.py", line 178, in generate_stream
out = model(torch.as_tensor([input_ids], device=device), use_cache=True)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1119, in forward
logits = logits.float()
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.43 GiB. GPU 0 has a total capacity of 23.50 GiB of which 9.36 GiB is free. Process 36705 has 14.13 GiB memory in use. Of the allocated memory 13.28 GiB is allocated by PyTorch, and 587.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2024-10-08 20:51:58,614 xinference.api.restful_api 1 ERROR Chat completion stream got an error: Remote server 0.0.0.0:37128 closed
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1899, in stream_results
async for item in iterator:
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 340, in anext
return await self._actor_ref.xoscar_next(self._uid)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 230, in send
result = await self._wait(future, actor_ref.address, send_message) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 115, in _wait
return await future
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 84, in _listen
raise ServerClosed(
xoscar.errors.ServerClosed: Remote server 0.0.0.0:37128 closed
2024-10-08 20:51:59,101 xinference.core.worker 140 WARNING Process 0.0.0.0:37128 is down.
2024-10-08 20:51:59,112 xinference.core.worker 140 INFO [request d5ba02a0-85f1-11ef-b572-0242ac110004] Enter terminate_model, args: <xinference.core.worker.WorkerActor object at 0x7f0105fdfdd0>,qwen2.5-instruct-1-0, kwargs: is_model_die=True
2024-10-08 20:51:59,113 xinference.core.worker 140 DEBUG Destroy model actor failed, model uid: qwen2.5-instruct-1-0, error: [Errno 111] Connection refused
2024-10-08 20:51:59,114 xinference.core.worker 140 DEBUG Remove sub pool failed, model uid: qwen2.5-instruct-1-0, error: '0.0.0.0:37128'
2024-10-08 20:51:59,114 xinference.core.worker 140 INFO [request d5ba02a0-85f1-11ef-b572-0242ac110004] Leave terminate_model, elapsed time: 0 s
2024-10-08 20:51:59,114 xinference.core.worker 140 WARNING Recreating model actor qwen2.5-instruct-1-0 ...
2024-10-08 20:51:59,115 xinference.core.worker 140 INFO [request d5ba6bbe-85f1-11ef-b572-0242ac110004] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7f0105fdfdd0>, kwargs: model_uid=qwen2.5-instruct-1-0,model_name=qwen2.5-instruct,model_size_in_billions=3,model_format=pytorch,quantization=none,model_engine=Transformers,model_type=LLM,n_gpu=2,peft_model_config=None,request_limits=None,gpu_idx=None,download_hub=None,model_path=None,max_model_len=30000
2024-10-08 20:51:59,117 xinference.core.worker 140 DEBUG GPU selected: [2, 3] for model qwen2.5-instruct-1-0
2024-10-08 20:52:05,452 xinference.model.llm.core 140 DEBUG Launching qwen2.5-instruct-1-0 with PytorchChatModel
2024-10-08 20:52:05,453 xinference.model.llm.llm_family 140 INFO Caching from Modelscope: qwen/Qwen2.5-3B-Instruct
2024-10-08 20:52:05,453 xinference.model.llm.llm_family 140 INFO Cache /root/.xinference/cache/qwen2_5-instruct-pytorch-3b exists
2024-10-08 20:52:05,596 transformers.tokenization_utils_base 5251 INFO loading file vocab.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file merges.txt
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file tokenizer.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file added_tokens.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file special_tokens_map.json
2024-10-08 20:52:05,597 transformers.tokenization_utils_base 5251 INFO loading file tokenizer_config.json
2024-10-08 20:52:05,800 transformers.tokenization_utils_base 5251 INFO Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-10-08 20:52:05,801 transformers.configuration_utils 5251 INFO loading configuration file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/config.json
2024-10-08 20:52:05,802 transformers.configuration_utils 5251 INFO Model config Qwen2Config {
"_name_or_path": "/root/.xinference/cache/qwen2_5-instruct-pytorch-3b",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 16,
"num_hidden_layers": 36,
"num_key_value_heads": 2,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "float16",
"transformers_version": "4.44.2",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}

2024-10-08 20:52:05,942 transformers.modeling_utils 5251 INFO loading weights file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/model.safetensors.index.json
2024-10-08 20:52:05,942 transformers.modeling_utils 5251 INFO Instantiating Qwen2ForCausalLM model under default dtype torch.float16.
2024-10-08 20:52:05,943 transformers.generation.configuration_utils 5251 INFO Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}

Loading checkpoint shards: 100%|█████████████████| 2/2 [00:02<00:00, 1.37s/it]
2024-10-08 20:52:09,374 transformers.modeling_utils 5251 INFO All model checkpoint weights were used when initializing Qwen2ForCausalLM.

2024-10-08 20:52:09,374 transformers.modeling_utils 5251 INFO All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /root/.xinference/cache/qwen2_5-instruct-pytorch-3b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
2024-10-08 20:52:09,378 transformers.generation.configuration_utils 5251 INFO loading configuration file /root/.xinference/cache/qwen2_5-instruct-pytorch-3b/generation_config.json
2024-10-08 20:52:09,378 transformers.generation.configuration_utils 5251 INFO Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}

2024-10-08 20:52:09,620 xinference.model.llm.transformers.core 5251 DEBUG Model Memory: 6775866368
2024-10-08 20:52:09,623 xinference.core.worker 140 INFO [request d5ba6bbe-85f1-11ef-b572-0242ac110004] Leave launch_builtin_model, elapsed time: 10 s

Expected behavior / 期待表现

正常使用gpu资源

The text was updated successfully, but these errors were encountered:

turndown · 2024-10-09T09:38:02Z

我也遇到差不多的问题，跑sd3的模型可以在gpu3设置，但是跑flux也设置gpu3就跑到gpu0去了，有点奇怪

XprobeBot added the gpu label Oct 9, 2024

XprobeBot added this to the v0.15 milestone Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

模型安装选用GPU2、3卡，但是运行总报错，提示gpu0没有资源 #2405

模型安装选用GPU2、3卡，但是运行总报错，提示gpu0没有资源 #2405

SDAIer commented Oct 9, 2024 •

edited

Loading

turndown commented Oct 9, 2024

模型安装选用GPU2、3卡，但是运行总报错，提示gpu0没有资源 #2405

模型安装选用GPU2、3卡，但是运行总报错，提示gpu0没有资源 #2405

Comments

SDAIer commented Oct 9, 2024 • edited Loading

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

完整日志如下

Expected behavior / 期待表现

turndown commented Oct 9, 2024

SDAIer commented Oct 9, 2024 •

edited

Loading