You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Phenaki paper, they downsample MiT dataset from 25fps to 6fps before video quantization.
Then, I wonder how to get downsampled video in preprocessing and whether input video is downsampled or not during training transformer and video generation inference.
Even if you don't upload training and dataloader code for video, I want some advices from you who should have tried to implement it.
One more, I have implemented your c-vivit code for reconstruction. Then, after I got feasible outputs, I have gotten bad results in the very next checkpoint iteration like below. The left one is GT and the right one is the output. (I set checkpoint interval as 3000.)
Could I ask you what is wrong and is it supposed to be like that early stopping is required for tokenization learning?
Thank you.
The text was updated successfully, but these errors were encountered:
@9B8DY6 the transformer trains on the quantized representation from the cvivit, so the frame rate is the same. it is fine if it is downsampled temporally, as we've seen from numerous papers that temporal upsampling (interpolation) works just fine
yea i'll get some training code down soon for phenaki, as there are a lot of details that is required for stable attention net training (as well as automating the entire adversarial training portion, which may be too complicated for the uninitiated)
@9B8DY6 in yesterday's demo they are doing upsampling with ddpm. can do this too with imagen-pytorch, once i get the logic for temporal upsampling in place
In Phenaki paper, they downsample MiT dataset from 25fps to 6fps before video quantization.
Then, I wonder how to get downsampled video in preprocessing and whether input video is downsampled or not during training transformer and video generation inference.
Even if you don't upload training and dataloader code for video, I want some advices from you who should have tried to implement it.
One more, I have implemented your c-vivit code for reconstruction. Then, after I got feasible outputs, I have gotten bad results in the very next checkpoint iteration like below. The left one is GT and the right one is the output. (I set checkpoint interval as 3000.)
Could I ask you what is wrong and is it supposed to be like that early stopping is required for tokenization learning?
Thank you.
The text was updated successfully, but these errors were encountered: