Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu 分布式 sparse_embedding 报错 #68421

Open
welsonzhang opened this issue Sep 24, 2024 · 5 comments
Open

cpu 分布式 sparse_embedding 报错 #68421

welsonzhang opened this issue Sep 24, 2024 · 5 comments
Assignees
Labels

Comments

@welsonzhang
Copy link

请提出你的问题 Please ask your question

Paddle的版本是2.4

具体错误提示:
F0924 19:29:27.496166 884 fleet.cc:680] Check failed: static_cast<int64_t>(output_len) == g_tensor->numel()
其中:
I0924 19:37:21.016360 3043 fleet.cc:683] output_len:498, g_tensor->numel():512

是paddle的输出推导有问题吗?

具体代码实现:

    def create_feeds(self):
        TEXT2TF = self.params['TEXT2TF']
        # label # 注意不能使用clk和show
        self.clk = paddle.static.data(name="click", shape=[None, 1], dtype="float32", lod_level=1) # 
        
        # single_slots
        self.single_slots = []
        self.single_slots_num = len(self.params['SLOTSA'])
        self.single_slots_dim = 1
        for i in range(self.single_slots_num):
            self.single_slots.append(paddle.static.data(name="single_slots" + str(i), shape=[None, self.single_slots_dim], dtype="int64", lod_level=1))
        # rv       
        self.rv_dim = sum([self.params['TEXT2TF'][slot_id][0] for slot_id in self.params['RV_A_SLOTS']])
        self.rv = paddle.static.data(name="rv", shape=[None, self.rv_dim], dtype="float32", lod_level=1)
        
        feed_list = []
        feed_list.append(self.clk)
        feed_list.extend(self.single_slots)
        #feed_list.append(self.single_slots)
        feed_list.append(self.rv)

        return feed_list

    def net(self, is_infer=False):
        embedding_size = self.params['EMBEDDING_SIZE']
        
        rv_final = paddle.maximum(paddle.minimum(self.rv, paddle.to_tensor(self.params['RV_BOUND'][1], dtype="float32")), paddle.to_tensor(self.params['RV_BOUND'][0], dtype="float32"))
        labels = [self.clk]
        
        self.show = paddle.ones_like(self.clk)
        entry = paddle.distributed.ShowClickEntry(self.show.name, self.clk.name)
        
        s_emb_array = []
        for slot in self.single_slots:
            s_emb_array.append( paddle.static.nn.sparse_embedding(
                    input=paddle.reshape(slot, [-1, 1]),
                    size=[1000000, embedding_size],
                    padding_idx=0,
                    entry=entry,
                    param_attr=paddle.ParamAttr(name="embedding")))
            
        s_emb_0 = paddle.concat(s_emb_array, axis =-1)
        self.s_emb = paddle.reshape(s_emb_0, [-1, self.single_slots_dim * self.single_slots_num * embedding_size])
        input = paddle.concat([self.s_emb, rv_final], axis =-1)
        
        lr = LRLayer(self.rv_dim + self.single_slots_dim * self.single_slots_num * embedding_size)
        y = lr.forward(input)
        q = paddle.nn.functional.sigmoid(y)
        
        clk_int64 = paddle.cast(self.clk, dtype='int64')
        self.auc, batch_auc, [batch_stat_pos, batch_stat_neg, self.stat_pos, self.stat_neg] = \
            paddle.static.auc(input=q, label=clk_int64, slide_steps=0)
        self.cost = paddle.sum(paddle.nn.functional.binary_cross_entropy_with_logits(logit=y, label=labels[0]))
        
        self.infer_input_var['rv'] = self.rv
        self.infer_input_var['s_emb'] = self.s_emb
        
        self.infer_output_var['clk'] = q
@wangguan1995
Copy link

请问可以给出完整的报错信息吗?希望具体了解下是那个位置报错了

@wangguan1995
Copy link

另外请补充下设备信息,cuda信息,paddle.utils.run_check()打印信息,感谢

@welsonzhang
Copy link
Author

用的是cpups的模式,使用pip install paddlepaddle==2.4.0 安装可以跑,但是自己编译的不行。具体报错信息:

F0925 12:27:50.095330 23095 fleet.cc:679] Check failed: static_cast<int64_t>(output_len) == g_tensor->numel() 
*** Check failure stack trace: ***
    @     0x7fe9157c2a3d  google::LogMessage::Fail()
    @     0x7fe9157c4dc5  google::LogMessage::SendToLog()
    @     0x7fe9157c252d  google::LogMessage::Flush()
    @     0x7fe9157c5939  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fe915b80395  paddle::distributed::FleetWrapper::PushSparseFromTensorAsync()
    @     0x7fe91705bd45  paddle::operators::DistributedPushSparseKernel<>::Compute()
    @     0x7fe91705c5cf  _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorIN3phi8CPUPlaceELb0ELm0EJNS0_9operators27DistributedPushSparseKernelINS7_10CPUContextEfEENSA_ISB_dEEEEclEPKcSG_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
    @     0x7fe918da11cf  paddle::framework::OperatorWithKernel::RunImpl()
    @     0x7fe918da26d3  paddle::framework::OperatorWithKernel::RunImpl()
    @     0x7fe918d9164c  paddle::framework::OperatorBase::Run()
    @     0x7fe918621b72  paddle::framework::details::ComputationOpHandle::RunImpl()
    @     0x7fe91878164f  paddle::framework::details::OpHandleBase::Run()
    @     0x7fe9184d170e  paddle::framework::details::ThreadedSSAGraphExecutor::RunOpSync()
    @     0x7fe9184cbc84  paddle::framework::details::ThreadedSSAGraphExecutor::RunTracedOps()
    @     0x7fe9184d3e2f  paddle::framework::details::ThreadedSSAGraphExecutor::RunImpl()
    @     0x7fe9184d0e6f  paddle::framework::details::ThreadedSSAGraphExecutor::Run()
    @     0x7fe9184c8fa6  paddle::framework::details::AsyncSSAGraphExecutor::Run()
    @     0x7fe9158eefcb  paddle::framework::ParallelExecutor::RunAndMerge()
    @     0x7fe91552e24d  _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybind20BindParallelExecutorERNS_6moduleEEUlRNS2_9framework16ParallelExecutorERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISF_EEbE83_NS_6objectEJS8_SJ_bEJNS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES12_
    @     0x7fe91523d322  pybind11::cpp_function::dispatcher()
    @     0x56288d6d4562  _PyCFunction_FastCallDict
    @     0x56288d700135  call_function
    @     0x56288d71d39f  _PyEval_EvalFrameDefault
    @     0x56288d6ad160  _PyEval_EvalCodeWithName
    @     0x56288d6ea41b  fast_function
    @     0x56288d7000f7  call_function
    @     0x56288d71ea1e  _PyEval_EvalFrameDefault
    @     0x56288d6ad160  _PyEval_EvalCodeWithName
    @     0x56288d6ea41b  fast_function
    @     0x56288d7000f7  call_function
    @     0x56288d71ea1e  _PyEval_EvalFrameDefault
    @     0x56288d6ad160  _PyEval_EvalCodeWithName

@welsonzhang
Copy link
Author

paddle 的编译命令:
git checkout -b 2.4 origin/release/2.4

cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_PYTHON=ON -DWITH_MKL=ON -DWITH_GPU=OFF -DWITH_INFERENCE_API_TEST=OFF -DWITH_FLUID_ONLY=ON -DPY_VERSION=3.6 -DWITH_DISTRIBUTE=ON -DWITH_PSLIB=OFF -DWITH_PSLIB_BRPC=OFF -DWITH_PSCORE=ON -WITH_AVX=ON -DWITH_TESTING=OFF -DWITH_MKL=ON

make -j 32

@welsonzhang
Copy link
Author

这里是我编译的命令有问题吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants