Multimodal use of uvcgan2 ? #30

PerrinAntonin · 2024-04-23T12:57:01Z

Hello,

Congratulations for this very successful project! I wanted to ask you if you think a multimodal use of uvcgan2 is possible? In order to use it as an MUnit, where it's possible to generate different images from a single reference image. In MUnit, you simply choose a different style vector for a new generation, but in Uvcgan it's the VIT that generates it, and I was wondering how to play with that.

Sincerely,
Antonin

usert5432 · 2024-04-23T23:44:09Z

Hello @PerrinAntonin,

Thank you for your interest in our work.

I wanted to ask you if you think a multimodal use of uvcgan2 is possible?
... and I was wondering how to play with that.

I actually thought about this a bit. The short answer is that currently it is not possible.

I think, in principle, one can modify the generator architecture a bit to expose its style to the user. Then, one can implement a custom training setup, following MUnit or DRIT examples. If done correctly, I believe everything will work and make UVCGAN multimodal. All the modifications are rather straightforward, but they will take some time to implement and debug. And, unfortunately, we do not have resources to explore these modifications at the moment.

PerrinAntonin · 2024-04-25T13:10:31Z

Hi @usert5432

Thank you for your quick reply!
Ok I can see that it looks implementable but it needs a bit of time. To avoid this problem, wouldn't it be possible to use the style vector generated by an image at the output of the VIT and to reinject this vector for the reconstruction of another image? But I have the impression that there isn't really a loss linked to this style token and that it will therefore depend mainly on the image supplied the 1st time and won't work on another.

I could also see that you set a different learning rate for the discriminator and the generator. If the generator rate is smaller, is it because the generator learns too quickly compared to the discriminator?

usert5432 · 2024-04-25T23:35:20Z

To avoid this problem, wouldn't it be possible to use the style vector generated by an image at the output of the VIT and to reinject this vector for the reconstruction of another image? But I have the impression that there isn't really a loss linked to this style token and that it will therefore depend mainly on the image supplied the 1st time and won't work on another.

I cannot say definitively, since it is more of an empirical question, but my intuition matches yours. Currently, UVCGAN is not trained to work correctly with mismatching styles, so I would expect it to break if some unexpected style is substituted.

I could also see that you set a different learning rate for the discriminator and the generator. If the generator rate is smaller, is it because the generator learns too quickly compared to the discriminator?

Yes, this is my working hypothesis. Although, I am not sure it is 100% correct.

usert5432 self-assigned this Apr 23, 2024

usert5432 added the question Further information is requested label Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal use of uvcgan2 ? #30

Multimodal use of uvcgan2 ? #30

PerrinAntonin commented Apr 23, 2024

usert5432 commented Apr 23, 2024

PerrinAntonin commented Apr 25, 2024

usert5432 commented Apr 25, 2024

Multimodal use of uvcgan2 ? #30

Multimodal use of uvcgan2 ? #30

Comments

PerrinAntonin commented Apr 23, 2024

usert5432 commented Apr 23, 2024

PerrinAntonin commented Apr 25, 2024

usert5432 commented Apr 25, 2024