-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity in MusicGen architecture #468
Comments
|
If I may have a similar question about the architecture:
Where prev_offset:offset is always of len == 1. Is this because of what authors claim in the original MusicGen paper, i.e. modelling different codebooks as conditionally independent (and later on trying to improve on in the MusicGen-MMD)? Or am I not getting this architecture right? The only difference between consecutive token generation I can see is the positional embedding change from one token to another (and obviously the codebook indices/embeddings corresponding to the previous token). In other words if, let's say, the 13th token is Any clarification about the generation process would be highly appreciated! |
I have two 3 discrepancies between what is described in the paper versus what I see in code/blog posts.
The recent publishing of MMD had this figure
which shows a concatenation operation between the audio embeddings and the output of the cross attention. I cannot find this operation in the code for the LM.
There is no linear layer after the cross attention block that I can see in the code.
The config for the small model calls for 24 layers, dim 1024, 16 heads, which when initialized, is ~420m parameters. Is the config incorrrect?
Thanks!
The text was updated successfully, but these errors were encountered: