Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Golden spike" PR #488

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
a859b58
Stub out index_delta(), index_lance(), index_parquet().
knighton Oct 28, 2023
bd0208a
index_backend().
knighton Oct 28, 2023
69575bc
Fix.
knighton Oct 28, 2023
42b59f1
task.py for benchmarking.
knighton Oct 28, 2023
2fb1b09
generate_datasets.py.
knighton Oct 28, 2023
11dd673
Fix.
knighton Oct 28, 2023
82737e0
Organize/divide streaming/base/util.py:
knighton Oct 28, 2023
3212f66
Completely rip out and rewrite pretty args handling:
knighton Oct 28, 2023
eb93bea
Layer several new storage APIs wrapping/complementing streaming/base/…
knighton Oct 28, 2023
23554ac
Use those APIs to index a Parquet dataset (single-threaded).
knighton Oct 28, 2023
c711567
Add cli/index_parquet.py.
knighton Oct 28, 2023
4ea01b2
Rename get_list_arg() -> parse_strs() in keeping with parse_str2str()…
knighton Oct 28, 2023
157381a
Rename parse_(args stuff) -> unpack_(args stuff).
knighton Oct 28, 2023
c72127f
Long lines.
knighton Oct 28, 2023
d2be6a0
Populate streaming/examples/ with SD subclasses, also streaming/bench…
knighton Oct 28, 2023
da6f4af
Fix.
knighton Oct 28, 2023
b0fa3d7
Move benchmarks up and out.
knighton Oct 28, 2023
4a22638
Fix.
knighton Oct 28, 2023
cb80865
Now, rename streaming/base/... -> streaming/....
knighton Oct 28, 2023
4851888
Update paths accordingly.
knighton Oct 28, 2023
1051474
Update more paths.
knighton Oct 28, 2023
65ef0de
Formatting.
knighton Oct 28, 2023
408999a
Fix.
knighton Oct 28, 2023
b38f8a3
Move examples/ to top level.
knighton Oct 28, 2023
ff90826
Update multimodal.
knighton Oct 28, 2023
a7808ae
Update vision dataset sexamples -> kwargs.
knighton Oct 28, 2023
c857ed6
Update vision datasets to use kwargs (save us from bitrot, o kwargs).
knighton Oct 28, 2023
89d5719
Generalize `keep_zip` argument to `keep_packed`.
knighton Oct 28, 2023
c09248c
Add graceful migration from keep_zip to keep_packed.
knighton Oct 28, 2023
9befaa6
First take on a MDS write_dataset().
knighton Oct 29, 2023
c4a5094
Add enough column inference to keep going.
knighton Oct 29, 2023
48dce5c
WWriting all given samples as one indexless MDS shard, returning its …
knighton Oct 29, 2023
b0d1543
Naming.
knighton Oct 29, 2023
99ad0c0
Fixes.
knighton Oct 29, 2023
b38fce0
cli/hash.py.
knighton Oct 29, 2023
7a9fc90
walk_prefix() including local fs.
knighton Oct 29, 2023
5247bfe
generate_datasets.py: Tabulator.
knighton Nov 5, 2023
6dc5e22
Fix (passing `left`, and spacing).
knighton Nov 5, 2023
18f6474
Switch to box-drawing chars in Tabulator. Example:
knighton Nov 5, 2023
52af2cb
Rewrite task.py.
knighton Nov 5, 2023
bc125b4
Fixes.
knighton Nov 5, 2023
a2ff86f
Fix.
knighton Nov 5, 2023
57e7571
Misc.
knighton Nov 5, 2023
52dcb42
Merge branch 'main' into james/proto
knighton Nov 5, 2023
f1e10bb
Split out Tabulator.
knighton Nov 5, 2023
56674e8
Merge branch 'james/proto' of github.com:mosaicml/streaming into jame…
knighton Nov 5, 2023
cbfcab3
Refactor.
knighton Nov 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ default_language_version:
python: python3
# Skip the pre-commit check for below directories to have
# a consistency with the official tfrecord preprocessing scripts
exclude: "^(streaming/text/convert/enwiki/)"
exclude: "^(examples/text/enwiki_tok/)"
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
Expand Down
14 changes: 7 additions & 7 deletions STYLE_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,10 +142,10 @@ so other contributors will know why this error was silenced.
A public API, generally speaking, can be invoked by a user without a leading underscore in any portion of the path.
The following are examples of public APIs:

* Standalone functions in public modules (e.g. `streaming.base.distributed.get_world_size`)
* Classes in public modules (e.g. `streaming.base.format.MDSWriter`)
* Public methods in public classes (e.g. `streaming.base.format.MDSWriter.write`)
* Public modules (e.g. `streaming.base.dataset`)
* Standalone functions in public modules (e.g. `streaming.distributed.get_world_size`)
* Classes in public modules (e.g. `streaming.format.MDSWriter`)
* Public methods in public classes (e.g. `streaming.format.MDSWriter.write`)
* Public modules (e.g. `streaming.dataset`)

The following rules apply to public APIs:
1. All public APIs must have a docstring (see the Documentation section below)
Expand Down Expand Up @@ -201,14 +201,14 @@ All public modules must define `__all__` to be the list of members that should b
The variable is necessary to 1) limit what `from XXX import *` imports, and 2) ensure that the documentation only
includes exported members, not unrelated re-imports.

For example, from [streaming/base/dataset.py](streaming/base/dataset.py)
For example, from [streaming/dataset.py](streaming/dataset.py)

```python
"""The :class:`Dataset` class, used for building streaming iterable datasets."""
from torch.utils.data import IterableDataset

from streaming.base.format import reader_from_json
from streaming.base.spanner import Spanner
from streaming.format import reader_from_json
from streaming.spanner import Spanner

__all__ = ["Dataset"] # export only the Dataset, not other imports like `Spanner` or `reader_from_json`

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

"""LAION dataset creation."""
"""Streaming benchmarks."""
4 changes: 4 additions & 0 deletions benchmarks/backends/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

"""Benchmarking generating/iterating datasets of different backends and formats."""
Loading