-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine Feasibility of Koala-36M method for detector #441
Comments
The table of results looks promising and the approach is interesting. They haven't published their code yet (coming soon according to their homepage). Ideally, they could also release a pre-trained SVM model that could simply be imported and used. Detection AlgorithmThe bit about this that I don't understand is how they are using temporal information. Looking through this, here is my understanding of how this algorithm works:
What I don't understand is the temporal information. They write:
So, I get that they are looking backwards X number of frames, finding the standard deviation, and calculating the current frame's Z-score. However, are they calculating the Z-score for both of the above metrics? An average of the two? Some other metric? It isn't clear. Additionally, it doesn't seem like this temporal information is used by the SVM since they explicitly say that the SVM only takes the two parameters calculated above. So, is something marked as a scene if the SVM classifies it as such or the frame's Z-score exceeds 3? I am not sure how to incorporate the temporal information here. SVM TrainingThis is an area that would be quite onerous to do on our own, so if they do release a model, it would be a tremendous help. When describing how their SVM is trained, they write (from 4.1):
I am assuming here that they are using their giant dataset for this. If I am trying to extract their data generation method from their very brief description, then it would be something like this:
The same method could be used to generate your test data as well. This would only be possible with a curated dataset like theirs that consist of single-scene videos. Their dataset seems to be a giant list of youtube videos with timestamps denoting the start and stop points of each of those youtube videos, segmenting what part of the video is included in the dataset. If we wanted to train our own SVM based on this dataset, it would be a huge task to reconstruct their dataset based off of the youtube urls and timestamps. Additionally, they don't really give any insight on the SVM parameters. I am far from an ML expert, so having some additional information on any of the options used for the SVM would be helpful for replication. |
I'm curious what it would look like if we plotted the values for
This section is also very unclear to me as well, and you raise some good questions. Hopefully they will publish some more information soon. |
I am pretty sure that In doing some searching around about this, I also ran across CLIP. This is a pre-trained deep-learning transformer that can measure similarity between two images. I found a Stack Overflow answer that has a great explanation with some examples. This might be an alternative similarity metric to SSIM. It would also require new dependencies though, and I have no idea what the computational efficiency would be. |
Koala-36M proposes a significantly improved model for scene transition detection (paper: HTML or PDF)
See section 4.1 which uses an SVM classifier. The performance degredation is just over 2x slowdown, however the accuracy, precision, and recall show marked improvements across the board that likely warrant this change for the majority of users.
The text was updated successfully, but these errors were encountered: