Video models #890

dreadatour · 2025-02-03T18:36:22Z

Alternative approach to implement video models based on this comment. Looks much cleaner.

New `VideoFile` model

class VideoFile(File):
    """
    A data model for handling video files.

    This model inherits from the `File` model and provides additional functionality
    for reading video files, extracting video frames, and splitting videos into
    fragments.
    """

    def get_info(self) -> "Video":
        """
        Retrieves metadata and information about the video file.

        Returns:
            Video: A Model containing video metadata such as duration,
                   resolution, frame rate, and codec details.
        """

    def get_frame(self, frame: int) -> "VideoFrame":
        """
        Returns a specific video frame by its frame number.

        Args:
            frame (int): The frame number to read.

        Returns:
            VideoFrame: Video frame model.
        """

    def get_frames(
        self,
        start: int = 0,
        end: Optional[int] = None,
        step: int = 1,
    ) -> "Iterator[VideoFrame]":
        """
        Returns video frames from the specified range in the video.

        Args:
            start (int): The starting frame number (default: 0).
            end (int, optional): The ending frame number (exclusive). If None,
                                 frames are read until the end of the video
                                 (default: None).
            step (int): The interval between frames to read (default: 1).

        Returns:
            Iterator[VideoFrame]: An iterator yielding video frames.

        Note:
            If end is not specified, number of frames will be taken from the video file,
            this means video file needs to be downloaded.
        """

    def get_fragment(self, start: float, end: float) -> "VideoFragment":
        """
        Returns a video fragment from the specified time range.

        Args:
            start (float): The start time of the fragment in seconds.
            end (float): The end time of the fragment in seconds.

        Returns:
            VideoFragment: A Model representing the video fragment.
        """

    def get_fragments(
        self,
        duration: float,
        start: float = 0,
        end: Optional[float] = None,
    ) -> "Iterator[VideoFragment]":
        """
        Splits the video into multiple fragments of a specified duration.

        Args:
            duration (float): The duration of each video fragment in seconds.
            start (float): The starting time in seconds (default: 0).
            end (float, optional): The ending time in seconds. If None, the entire
                                   remaining video is processed (default: None).

        Returns:
            Iterator[VideoFragment]: An iterator yielding video fragments.

        Note:
            If end is not specified, number of frames will be taken from the video file,
            this means video file needs to be downloaded.
        """

New `VideoFrame` model

One can create VideoFrame without downloading video file, since it is "virtual" frame: original VideoFile + frame number.

If physical frame image is needed, call save method, which uploads frame image into storage and returns ImageFile new model.

API:

class VideoFrame(DataModel):
    """
    A data model for representing a video frame.

    This model inherits from the `VideoFile` model and adds a `frame` attribute,
    which represents a specific frame within a video file. It allows access
    to individual frames and provides functionality for reading and saving
    video frames as image files.

    Attributes:
        video (VideoFile): The video file containing the video frame.
        frame (int): The frame number referencing a specific frame in the video file.
    """

    video: VideoFile
    frame: int

    def get_np(self) -> "ndarray":
        """
        Returns a video frame from the video file as a NumPy array.

        Returns:
            ndarray: A NumPy array representing the video frame,
                     in the shape (height, width, channels).
        """

    def read_bytes(self, format: str = "jpg") -> bytes:
        """
        Returns a video frame from the video file as image bytes.

        Args:
            format (str): The desired image format (e.g., 'jpg', 'png').
                          Defaults to 'jpg'.

        Returns:
            bytes: The encoded video frame as image bytes.
        """

    def save(self, output: str, format: str = "jpg") -> "ImageFile":
        """
        Saves the current video frame as an image file.

        If `output` is a remote path, the image file will be uploaded to remote storage.

        Args:
            output (str): The destination path, which can be a local file path
                          or a remote URL.
            format (str): The image format (e.g., 'jpg', 'png'). Defaults to 'jpg'.

        Returns:
            ImageFile: A Model representing the saved image file.
        """

New `VideoFragment` model

One can create VideoFragment without downloading video file, since it is "virtual" fragment: original video file + start/end timestamp.

If physical fragment video is needed, call save method, which uploads fragment video into storage and returns new VideoFile model.

API:

class VideoFragment(DataModel):
    """
    A data model for representing a video fragment.

    This model inherits from the `VideoFile` model and adds `start`
    and `end` attributes, which represent a specific fragment within a video file.
    It allows access to individual fragments and provides functionality for reading
    and saving video fragments as separate video files.

    Attributes:
        video (VideoFile): The video file containing the video fragment.
        start (float): The starting time of the video fragment in seconds.
        end (float): The ending time of the video fragment in seconds.
    """

    video: VideoFile
    start: float
    end: float

    def save(self, output: str, format: Optional[str] = None) -> "VideoFile":
        """
        Saves the video fragment as a new video file.

        If `output` is a remote path, the video file will be uploaded to remote storage.

        Args:
            output (str): The destination path, which can be a local file path
                          or a remote URL.
            format (str, optional): The output video format (e.g., 'mp4', 'avi').
                                    If None, the format is inferred from the
                                    file extension.

        Returns:
            VideoFile: A Model representing the saved video file.
        """

New `Video` model

Video file meta information.

class Video(DataModel):
    """
    A data model representing metadata for a video file.

    Attributes:
        width (int): The width of the video in pixels. Defaults to -1 if unknown.
        height (int): The height of the video in pixels. Defaults to -1 if unknown.
        fps (float): The frame rate of the video (frames per second).
                     Defaults to -1.0 if unknown.
        duration (float): The total duration of the video in seconds.
                          Defaults to -1.0 if unknown.
        frames (int): The total number of frames in the video.
                      Defaults to -1 if unknown.
        format (str): The format of the video file (e.g., 'mp4', 'avi').
                      Defaults to an empty string.
        codec (str): The codec used for encoding the video. Defaults to an empty string.
    """

    width: int = Field(default=-1)
    height: int = Field(default=-1)
    fps: float = Field(default=-1.0)
    duration: float = Field(default=-1.0)
    frames: int = Field(default=-1)
    format: str = Field(default="")
    codec: str = Field(default="")

for more information, see https://pre-commit.ci

* [pre-commit.ci] pre-commit autoupdate updates: - [github.com/astral-sh/ruff-pre-commit: v0.8.6 → v0.9.1](astral-sh/ruff-pre-commit@v0.8.6...v0.9.1) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Bumps [ultralytics](https://github.com/ultralytics/ultralytics) from 8.3.58 to 8.3.61. - [Release notes](https://github.com/ultralytics/ultralytics/releases) - [Commits](ultralytics/ultralytics@v8.3.58...v8.3.61) --- updated-dependencies: - dependency-name: ultralytics dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Review help/usage for cli commands The pattern followed is: - Descriptions: Complete sentences with periods - Help messages: Concise phrases without periods - Consistent terminology ("Iterative Studio") - Clear, standardized format for similar arguments * Bring uniformity for Studio mention * Override default command failure * Remove datasets from studio * Fix anon message and remove edatachain message * dirs to directories * Remove studio dataset test

* prefetching: remove prefetched item after use in udf This PR removes the prefetched item after use in the UDF. This is enabled by default on `prefetch>0`, unless `cache=True` is set in the UDF, in which case the prefetched item is not removed. For pytorch dataloader, this is not enabled by default, but can be enabled by setting `remove_prefetched=True` in the `PytorchDataset` class. This is done so because the dataset can be used in multiple epochs, and removing the prefetched item after use can cause it to redownload again in the next epoch. The exposed `remove_prefetched=True|False` setting could be renamed to some better option. Feedbacks are welcome. * close iterable properly

* added main logic for outer join * fixing filters * removign datasetquery tests and added more datachain unit tests

If usearch fails to download the extension, it will keep retrying in the future. This adds significant cost - for example, in `tests/func/test_pytorch.py` run, it was invoked 111 times, taking ~30 seconds in total. Now, we cache the return value for the whole session.

Added `isnone()` function

* move tests using cloud_test_catalog into func directory * move tests using tmpfile catalog * move long running tests that read/write from disk

…837)

updates: - [github.com/astral-sh/ruff-pre-commit: v0.9.1 → v0.9.2](astral-sh/ruff-pre-commit@v0.9.1...v0.9.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Bumps [ultralytics](https://github.com/ultralytics/ultralytics) from 8.3.61 to 8.3.64. - [Release notes](https://github.com/ultralytics/ultralytics/releases) - [Commits](ultralytics/ultralytics@v8.3.61...v8.3.64) --- updated-dependencies: - dependency-name: ultralytics dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [mkdocs-material](https://github.com/squidfunk/mkdocs-material) from 9.5.22 to 9.5.50. - [Release notes](https://github.com/squidfunk/mkdocs-material/releases) - [Changelog](https://github.com/squidfunk/mkdocs-material/blob/master/CHANGELOG) - [Commits](squidfunk/mkdocs-material@9.5.22...9.5.50) --- updated-dependencies: - dependency-name: mkdocs-material dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

for more information, see https://pre-commit.ci

codecov · 2025-02-03T18:43:43Z

Codecov Report

Attention: Patch coverage is 87.50000% with 19 lines in your changes missing coverage. Please review.

Project coverage is 87.72%. Comparing base (b95cc76) to head (94d363c).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datachain/lib/video.py	79.51%	11 Missing and 6 partials ⚠️
src/datachain/lib/file.py	97.10%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #890    +/-   ##
========================================
  Coverage   87.72%   87.72%            
========================================
  Files         129      130     +1     
  Lines       11492    11641   +149     
  Branches     1554     1579    +25     
========================================
+ Hits        10081    10212   +131     
- Misses       1022     1033    +11     
- Partials      389      396     +7

Flag	Coverage Δ
datachain	`87.64% <87.50%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mattseddon · 2025-02-04T05:17:24Z

src/datachain/lib/file.py

+
+        return video_info(self)
+
+    def get_frame(self, frame: int) -> "VideoFrame":


Minor, but should these be to_ methods to match the DataChain class?

Minor, but should these be to_ methods to match the DataChain class?

Looks reasonable 🤔 Although it is not a direct conversion ("to"), but rather getting a part of the file into another file, like "get frame from video" looks good to me, but "video to frame" looks odd. What do you think? I don't have strict opinion on this 🤔

cloudflare-workers-and-pages · 2025-02-04T16:01:13Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`94d363c`
Status:	✅ Deploy successful!
Preview URL:	https://a1a29e21.datachain-documentation.pages.dev
Branch Preview URL:	https://video-models-2.datachain-documentation.pages.dev

View logs

mattseddon · 2025-02-04T21:47:40Z

tests/unit/lib/test_video.py

+
+
+@pytest.fixture(autouse=True)
+def video_file(catalog) -> VideoFile:


[C] Some of these are probably func tests as they are writing/reading to/from disk.

src/datachain/lib/file.py

mattseddon · 2025-02-05T00:50:28Z

src/datachain/lib/file.py

+
+        return video_frame_bytes(self, format)
+
+    def save(self, output: str, format: str = "jpg") -> "ImageFile":


[C] Now that we have virtual models, we could add a video example in this repo.

src/datachain/lib/file.py

src/datachain/lib/video.py

src/datachain/lib/file.py

src/datachain/toolkit/ultralytics.py

src/datachain/toolkit/video.py

mattseddon · 2025-02-05T01:16:22Z

src/datachain/lib/video.py

+    if len(video_streams) == 0:
+        raise FileError(file, "no video streams found in video file")
+
+    video_stream = video_streams[0]


[Q] Why take the first one? Is there only ever one?

Some video container formats (such as MPEG-4) support multiple video, audio, and subtitle streams. However, in practice, video files with multiple video streams are extremely rare. Most applications, including video players and streaming platforms, are not designed to handle multiple video streams because there is little practical use for them.

I have heard of a few specialized use cases, such as movies with different aspect ratios or sports event recordings featuring multiple camera angles. However, these are rare exceptions, and I have never encountered them in real-world scenarios. In most cases, it is much simpler to provide multiple separate video files and allow users to download or stream only the one they need.

dreadatour and others added 30 commits January 13, 2025 23:48

Add video models + functions

75877d1

Code review update

031b9df

[pre-commit.ci] auto fixes from pre-commit.com hooks

548bbd5

for more information, see https://pre-commit.ci

Code review update

b55149a

Code review update

2cd6d62

Small fixes due to work on usage examples

5892ab9

Examples fixes

f3dc66a

docs(merge): add examples with Func object (#811)

65529f3

fix(tqdm): import tqdm to support jupyter (#812)

b044082

progress: remove unused logging/tqdm lock (#817)

89ee2f0

file: raise error (#820)

67beb9f

README - mistral fix (#821)

60c5848

file: support exporting files as a symlink (#819)

d3b1619

ReferenceFileSystem: use fs.open instead of fs._open (#823)

bcd95b1

Fix list of tuples. Closes #827 (#828)

dbefa5f

Added full outer join (#822)

258454e

* added main logic for outer join * fixing filters * removign datasetquery tests and added more datachain unit tests

Added isnone() function (#801)

a1a47b2

Added `isnone()` function

tests: reduce pytorch functional tests' runtime (#834)

5b2f45b

improve runtime of diff unit tests (#831)

14caa08

move functional tests out of unit test suite (#832)

746fd73

* move tests using cloud_test_catalog into func directory * move tests using tmpfile catalog * move long running tests that read/write from disk

import Int into test_datachain_merge (fix tests broken on bad merge) (#…

0fe47dd

…837)

[pre-commit.ci] pre-commit autoupdate (#836)

1598c4c

updates: - [github.com/astral-sh/ruff-pre-commit: v0.9.1 → v0.9.2](astral-sh/ruff-pre-commit@v0.9.1...v0.9.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

dreadatour and others added 7 commits January 30, 2025 00:02

Revert 'ensure_cached' test

abe39f5

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b7b829

for more information, see https://pre-commit.ci

Fix tests

55f0478

Fix tests

99b9490

Update video models

c28cd66

Merge branch 'main' into video-models

4098e8b

Update video models

0f2e12c

dreadatour requested review from shcheklein, dmpetrov, mattseddon and a team February 3, 2025 18:36

dreadatour self-assigned this Feb 3, 2025

mattseddon reviewed Feb 4, 2025

View reviewed changes

dreadatour marked this pull request as draft February 4, 2025 05:32

dreadatour added 2 commits February 4, 2025 23:00

Update video models

e247723

Merge branch 'main' into video-models-2

fcad176

dreadatour mentioned this pull request Feb 4, 2025

Add video example iterative/datachain-examples#28

Open

mattseddon approved these changes Feb 5, 2025

View reviewed changes

dreadatour added 3 commits February 5, 2025 17:34

Update video models docstrings

ea65af4

Update video models

34b135b

More Yolo toolkit functions to separate PR

198955f

dreadatour mentioned this pull request Feb 6, 2025

Add video models + functions #814

Closed

dreadatour added 2 commits February 6, 2025 23:02

Update video models

a5e2e47

Merge branch 'main' into video-models-2

94d363c

dreadatour changed the title ~~Video models (take 2)~~ Video models Feb 6, 2025

dreadatour marked this pull request as ready for review February 6, 2025 16:04

dreadatour merged commit 86fc806 into main Feb 6, 2025
36 of 37 checks passed

dreadatour deleted the video-models-2 branch February 6, 2025 16:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video models #890

Video models #890

dreadatour commented Feb 3, 2025 •

edited

Loading

codecov bot commented Feb 3, 2025 •

edited

Loading

mattseddon Feb 4, 2025

dreadatour Feb 4, 2025

cloudflare-workers-and-pages bot commented Feb 4, 2025 •

edited

Loading

mattseddon Feb 4, 2025

mattseddon Feb 5, 2025

mattseddon Feb 5, 2025

dreadatour Feb 5, 2025


		return video_info(self)

		def get_frame(self, frame: int) -> "VideoFrame":



		@pytest.fixture(autouse=True)
		def video_file(catalog) -> VideoFile:


		return video_frame_bytes(self, format)

		def save(self, output: str, format: str = "jpg") -> "ImageFile":

Video models #890

Video models #890

Conversation

dreadatour commented Feb 3, 2025 • edited Loading

New VideoFile model

New VideoFrame model

New VideoFragment model

New Video model

codecov bot commented Feb 3, 2025 • edited Loading

Codecov Report

mattseddon Feb 4, 2025

Choose a reason for hiding this comment

dreadatour Feb 4, 2025

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Feb 4, 2025 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

mattseddon Feb 4, 2025

Choose a reason for hiding this comment

mattseddon Feb 5, 2025

Choose a reason for hiding this comment

mattseddon Feb 5, 2025

Choose a reason for hiding this comment

dreadatour Feb 5, 2025

Choose a reason for hiding this comment

dreadatour commented Feb 3, 2025 •

edited

Loading

New `VideoFile` model

New `VideoFrame` model

New `VideoFragment` model

New `Video` model

codecov bot commented Feb 3, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 4, 2025 •

edited

Loading