Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about text condition embedding shape of musicgen-melody in training phase #482

Open
Lonian6 opened this issue Jul 29, 2024 · 0 comments

Comments

@Lonian6
Copy link

Lonian6 commented Jul 29, 2024

Hello, I have a question about training the musicgen-melody model.
The text condition is control the result by concatenate the text embedding before the input sequence.
I tried to print the return from the ConditionFuser, the model.lm.fuser below. The Length of the "input_" tensor with shape (Batch, Length, 1536) is seem to be changeable depending on the max length of the text embedding of a batch.

Is the Length changeable when training? If not, what is the prefix length of text embedding when training?

import torch
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
from audiocraft.modules.conditioners import (
    ConditionFuser,
    ClassifierFreeGuidanceDropout,
    AttributeDropout,
    ConditioningProvider,
    ConditioningAttributes,
    ConditionType,
)

model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=8)

input_text = ['text_1', 'text_2', 'text_3']

attributes, prompt_tokens = model._prepare_tokens_and_attributes(input_text, None)
conditions = attributes
# prepare unconditional generation for cfg
null_conditions = ClassifierFreeGuidanceDropout(p=1.0)(conditions)
if conditions:
    conditions = conditions + null_conditions
    tokenized = model.lm.condition_provider.tokenize(conditions)
    cfg_conditions = model.lm.condition_provider(tokenized)

prompt = torch.zeros((10, 4, 0), dtype=torch.long, device=model.device)
prompt = torch.cat([prompt, prompt], dim=0)

input_ = sum([model.lm.emb[k](prompt[:, k]) for k in range(4)])
input_, cross_attention_input = model.lm.fuser(input_, cfg_conditions)
print(input_.shape)
@Lonian6 Lonian6 changed the title Question about text condition embedding shape of musicgen-melody in training phase? Question about text condition embedding shape of musicgen-melody in training phase Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant