Skip to content

Commit

Permalink
Update docs: refactor data types pages (#921)
Browse files Browse the repository at this point in the history
  • Loading branch information
dreadatour authored Feb 14, 2025
1 parent 0ed0d13 commit c889406
Show file tree
Hide file tree
Showing 14 changed files with 148 additions and 28 deletions.
3 changes: 3 additions & 0 deletions docs/references/data-types/arrowrow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# ArrowRow

::: datachain.lib.file.ArrowRow
5 changes: 5 additions & 0 deletions docs/references/data-types/bbox.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Bounding Box

::: datachain.model.bbox.BBox

::: datachain.model.bbox.OBBox
35 changes: 35 additions & 0 deletions docs/references/data-types/file.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# File

`File` is a special [`DataModel`](index.md#datachain.lib.data_model.DataModel),
which is automatically generated when a `DataChain` is created from files,
such as in [`DataChain.from_storage`](../datachain.md#datachain.lib.dc.DataChain.from_storage):

```python
from datachain import DataChain

dc = DataChain.from_storage("gs://datachain-demo/dogs-and-cats")
dc.print_schema()
```

Output:

```
file: File@v1
source: str
path: str
size: int
version: str
etag: str
is_latest: bool
last_modified: datetime
location: Union[dict, list[dict], NoneType]
```

`File` classes include various metadata fields describing the underlying file,
along with methods to read and manipulate file contents.

::: datachain.lib.file.File

::: datachain.lib.file.FileError

::: datachain.lib.file.TarVFile
15 changes: 15 additions & 0 deletions docs/references/data-types/imagefile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# ImageFile

`ImageFile` is inherited from [`File`](file.md) with additional methods for working with image files.

`ImageFile` is generated when a `DataChain` is created [from storage](../datachain.md#datachain.lib.dc.DataChain.from_storage), using `type="image"` param:

```python
from datachain import DataChain

dc = DataChain.from_storage("s3://bucket-name/", type="image")
```

::: datachain.lib.file.ImageFile

::: datachain.lib.file.Image
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# DataType
# Data Types

Data types supported by `DataChain` must be of type
[`DataType`](#datachain.lib.data_model.DataType). `DataType` includes most Python types
Expand Down
5 changes: 5 additions & 0 deletions docs/references/data-types/pose.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Pose

::: datachain.model.pose.Pose

::: datachain.model.pose.Pose3D
3 changes: 3 additions & 0 deletions docs/references/data-types/segment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Segment

::: datachain.model.segment.Segment
3 changes: 3 additions & 0 deletions docs/references/data-types/tarvfile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# TarVFile

::: datachain.lib.file.TarVFile
13 changes: 13 additions & 0 deletions docs/references/data-types/textfile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# TextFile

`TextFile` is inherited from [`File`](file.md) with additional methods for working with text files.

`TextFile` is generated when a `DataChain` is created [from storage](../datachain.md#datachain.lib.dc.DataChain.from_storage), using `type="text"` param:

```python
from datachain import DataChain

dc = DataChain.from_storage("s3://bucket-name/", type="text")
```

::: datachain.lib.file.TextFile
29 changes: 29 additions & 0 deletions docs/references/data-types/videofile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# VideoFile

`VideoFile` extends [`File`](file.md) and provides additional methods for working with video files.

`VideoFile` instances are created when a `DataChain` is initialized [from storage](../datachain.md#datachain.lib.dc.DataChain.from_storage) with the `type="video"` parameter:

```python
from datachain import DataChain

dc = DataChain.from_storage("s3://bucket-name/", type="video")
```

There are additional models for working with video files:

- `VideoFrame` - represents a single frame of a video file.
- `VideoFragment` - represents a fragment of a video file.

These are virtual models that do not create physical files.
Instead, they are used to represent the data in the `VideoFile` these models are referring to.
If you need to save the data, you can use the `save` method of these models,
allowing you to save data locally or upload it to a storage service.

::: datachain.lib.file.VideoFile

::: datachain.lib.file.VideoFrame

::: datachain.lib.file.VideoFragment

::: datachain.lib.file.Video
22 changes: 0 additions & 22 deletions docs/references/file.md

This file was deleted.

12 changes: 10 additions & 2 deletions docs/references/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,16 @@ title: API Reference
DataChain's API is organized into several modules:

- [DataChain](./datachain.md) - Core chain operations and dataset management
- [DataType](./datatype.md) - Type system and schema definitions
- [File](./file.md) - File handling and storage operations
- [Data Types](./data-types/index.md) - Supported data types and schema definitions
- [File](./data-types/file.md) - File handling and storage operations
- [TextFile](./data-types/textfile.md) - Text file
- [ImageFile](./data-types/imagefile.md) - Image file
- [VideoFile](./data-types/imagefile.md) - Video file
- [TarVFile](./data-types/tarvfile.md) - Virtual file model for files extracted from tar archives
- [ArrowRow](./data-types/arrowrow.md) - Working with Arrow-supported file
- [BBox](./data-types/bbox.md) - Bounding box data type
- [Pose](./data-types/pose.md) - Pose data type
- [Segment](./data-types/segment.md) - Segment data type
- [UDF](./udf.md) - User-defined functions and transformations
- [Functions](./func.md) - Built-in functions for data manipulation and analysis
- [Torch](./torch.md) - PyTorch data loading utilities
Expand Down
13 changes: 11 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,17 @@ nav:
- 🐍 API Reference:
- Overview: references/index.md
- DataChain: references/datachain.md
- DataType: references/datatype.md
- File: references/file.md
- Data Types:
- Overview: references/data-types/index.md
- File: references/data-types/file.md
- TextFile: references/data-types/textfile.md
- ImageFile: references/data-types/imagefile.md
- VideoFile: references/data-types/videofile.md
- TarVFile: references/data-types/tarvfile.md
- ArrowRow: references/data-types/arrowrow.md
- BBox: references/data-types/bbox.md
- Pose: references/data-types/pose.md
- Segment: references/data-types/segment.md
- UDF: references/udf.md
- Torch: references/torch.md
- Functions: references/func.md
Expand Down
16 changes: 15 additions & 1 deletion src/datachain/lib/file.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,21 @@ def resolve(cls, file: "File", location: list[dict]):


class File(DataModel):
"""`DataModel` for reading binary files."""
"""
`DataModel` for reading binary files.
Attributes:
source (str): The source of the file (e.g., 's3://bucket-name/').
path (str): The path to the file (e.g., 'path/to/file.txt').
size (int): The size of the file in bytes. Defaults to 0.
version (str): The version of the file. Defaults to an empty string.
etag (str): The ETag of the file. Defaults to an empty string.
is_latest (bool): Whether the file is the latest version. Defaults to `True`.
last_modified (datetime): The last modified timestamp of the file.
Defaults to Unix epoch (`1970-01-01T00:00:00`).
location (dict | list[dict], optional): The location of the file.
Defaults to `None`.
"""

source: str = Field(default="")
path: str
Expand Down

0 comments on commit c889406

Please sign in to comment.