Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expected FPS (or per-frame inference time) difference between video and streaming settings? #593

Open
weirenorweiren opened this issue Mar 8, 2025 · 1 comment

Comments

@weirenorweiren
Copy link

I want to apply SAM2 in real-time for tracking 50 objects ideally. The performance in terms of robustness is fantastic with even tiny model for my application, however, I noticed the inference time almost increases linearly with the number of objects. If the speed difference between video and streaming settings is expected to be small, then I might think of upgrading my hardwares and then tracking less objects for my proof-of-concept.

With that, I have the following questions:

  1. Regarding the speed measurement in https://github.com/facebookresearch/sam2?tab=readme-ov-file#sam-21-checkpoints, how many objects did you track?
  2. How much difference on FPS (or per-frame inference time) would it be expected between video and streaming settings?
  3. For your speed measurement, are they done under the compilation mode? If not, what's the expected improvement on FPS (or per-frame inference time) with compilation on?

I appreciate any comments and thanks in advance!

@weirenorweiren
Copy link
Author

For Q2, please see below for more clarifications.

I am referring to the difference on FPS between a video file input and a live streaming source. My lab camera has an upper limit of 20FPS and I'd love the model to go far beyond 20FPS. If I remembered correctly, I saw somewhere in the Issues there are tricks for video input like batch processing to speed it up. For my case, since it's per-frame feed, batch processing seems not applied. So I am wondering that without those tricks, how much speed would be compromised?

That reminds me to ask that for your speed measurement, did they involve tricks for video input? If not and you happen to test the per-frame inference time by feeding video data frame by frame, I, as long as others who are interested in real-time applications, would love to see since it's a really important reference for its potential on relevant tasks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant