-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequential Topic Segmentation / Session Chapters #9
Comments
This could also be rolled in with LVN topic clustering on timeline: live example. |
And related is some very old work on "events over time": https://github.com/CouncilDataProject/seattle_v1/blob/master/projects/quick_analysis.ipynb |
More ideas copied from Slack:
|
I have a rough start to this with creating an embedding for each minutes item and each sentence in the transcript then running a moving window distance comparison. Find the collection of sentences that minimizes each moving window to minutes item distance. From there we can find the "strict boundaries" of the windows by looking for trigger words. I.e. "moving on to...", "next up...", etc. For word embeddings we have one of: |
Starting prototype work here: https://github.com/JacksonMaxfield/cue-queue |
Use Case
Please provide a use case to help us understand your request in context
YouTube has a "Video Chapters" feature that splits the timeline bar into chapters based off of timestamps found in the video description. Example:
Similarly, it would be incredibly useful to jump around a meeting video / transcript based off of the minutes items of the meeting.
Solution
Please describe your ideal solution
Going to take a lot of work on the backend side and a bit of work on the front-end.
We could be fancy and train a topic model or use some sort of seeded clustering, and we likely will at some point but as a first past implementation, it may be interesting to see how far the following gets us:
Look for common phrases: "Moving on to...", "Call the roll", "Attendance", etc. and apply breakpoints there.
Additionally, parse all the minutes item attachments (docs, presentations, etc.) for every minutes item for an event and store the list of words UNIQUE to a specific minutes item. Then compare the transcript for those words. Find the breakpoints by taking a moving window sum of the counts of each of the unique words for a given minutes item against the transcript.
I.e.
The moving window word count would be able to see that at some point we switch from using specific words found in minutes item 1 to using specific words found in minutes item 2. If we can combine that with looking for the "section splitter sequences" ("moving on", "call the roll", etc) I think it may be a good first pass, fast and cheap chapter identifier.
Then store chapter indentifiers as annotations in the transcript for the frontend to parse.
Alternatives
Please describe any alternatives you've considered, even if you've dismissed them
Topic modeling? Clustering?
Additionally, we should let whatever pipeline we create the ability to skip this if chapter starts are provided by user as Seattle Channel event descriptions have them in most cases now-a-days.
Stakeholders
Please add any individual person or team's that should be brought in for discussion on the project
Frontend to actually make the video chapters viewer.
Backend for both pipeline and transcript mutation.
Major Components
Please add any major components that need to be done for this project
Dependencies
Please add any other major or minor project dependencies here
Other Notes
Please add any extra notes here
My one concern is how to handle many-session events. We only store minutes items on the event level and not on the session level, but we will need to find a way to gracefull handle this.
The text was updated successfully, but these errors were encountered: