Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run CI on Modal, upgrade Bitsandbytes #641

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions .github/workflows/check-style.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,24 @@ on:
branches: [ master ]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
black:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: psf/black@stable
with:
options: "--check --diff"
version: "22.3.0"
isort:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: 3.11
- uses: isort/isort-action@master
Expand All @@ -28,7 +32,7 @@ jobs:
codespell:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- uses: codespell-project/actions-codespell@v1
with:
only_warn: 1
Expand Down
6 changes: 5 additions & 1 deletion .github/workflows/push-docker-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,17 @@ on:
pull_request:
branches: [ master ]

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4

- name: Docker meta
id: meta
Expand Down
12 changes: 8 additions & 4 deletions .github/workflows/run-benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,23 @@ on:
branches: [ master ]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_benchmarks:

runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: 3.11
- name: Cache dependencies
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.11-${{ hashFiles('requirements.txt') }}-${{ hashFiles('requirements-dev.txt') }}
Expand All @@ -28,7 +32,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install .
Expand Down
78 changes: 78 additions & 0 deletions .github/workflows/run-tests-on-modal.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
name: Modal tests

on:
push:
branches: [master]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_tests:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
fail-fast: false
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
PYTHON_VERSION: ${{ matrix.python-version }}
timeout-minutes: 10
steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.12-modal

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install modal==0.73.32

- name: Run tests
run: |
modal run modal_ci.py::run_tests

measure_coverage:
runs-on: ubuntu-latest
env:
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
PYTHON_VERSION: "3.11"
timeout-minutes: 10
steps:
- name: Checkout Repository
uses: actions/checkout@v4

- name: Install Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-3.12-modal

- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install modal==0.73.32

- name: Measure and upload coverage
run: |
modal run modal_ci.py::run_codecov
14 changes: 9 additions & 5 deletions .github/workflows/run-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ on:
branches: [ master ]
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
run_tests:

Expand All @@ -15,13 +19,13 @@ jobs:
fail-fast: false
timeout-minutes: 15
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v3
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Cache dependencies
uses: actions/cache@v3
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: Key-v1-${{ matrix.python-version }}-${{ hashFiles('requirements.txt') }}-${{ hashFiles('requirements-dev.txt') }}
Expand All @@ -32,7 +36,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install .
Expand Down Expand Up @@ -94,7 +98,7 @@ jobs:
pip install -r requirements-dev.txt
- name: Build bitsandbytes
run: |
pip install bitsandbytes==0.41.1
pip install bitsandbytes==0.45.2
- name: Build hivemind
run: |
pip install -e . --no-use-pep517
Expand Down
1 change: 1 addition & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ version: 2

sphinx:
fail_on_warning: true
configuration: docs/conf.py

python:
install:
Expand Down
12 changes: 9 additions & 3 deletions hivemind/compression/quantization.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,14 @@ def quantize(
except ImportError:
raise ImportError(BNB_MISSING_MESSAGE)

quantized, (absmax, codebook, *extra_params) = quantize_blockwise(tensor, blocksize=4096, nested=False)
assert tuple(extra_params) == self.EXTRA_PARAMS # blocksize, nested, dtype, offset, state2
assert tensor.dtype == torch.float32
quantized, quant_state = quantize_blockwise(tensor, blocksize=4096, nested=False)
absmax, codebook = quant_state.absmax, quant_state.code
assert quant_state.blocksize == 4096
assert quant_state.nested is False
assert quant_state.dtype == self.EXTRA_PARAMS[2]
assert quant_state.offset == self.EXTRA_PARAMS[3]
assert quant_state.state2 == self.EXTRA_PARAMS[4]
return quantized.numpy(), (absmax.numpy(), codebook.numpy())

def compress(self, tensor: torch.Tensor, info: CompressionInfo, allow_inplace: bool = False) -> runtime_pb2.Tensor:
Expand Down Expand Up @@ -187,5 +193,5 @@ def extract(self, serialized_tensor: runtime_pb2.Tensor) -> torch.Tensor:
absmax = torch.as_tensor(absmax)
codebook = torch.as_tensor(codebook)
quantized = torch.as_tensor(quantized).reshape(tuple(serialized_tensor.size))
result = dequantize_blockwise(quantized, (absmax, codebook, *self.EXTRA_PARAMS))
result = dequantize_blockwise(quantized, absmax=absmax, code=codebook, blocksize=4096, nested=False)
return result.to(getattr(torch, serialized_tensor.dtype)).requires_grad_(serialized_tensor.requires_grad)
86 changes: 86 additions & 0 deletions modal_ci.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import os
import subprocess

import modal

# Create an image with system dependencies
image = (
modal.Image.debian_slim(python_version=os.environ["PYTHON_VERSION"])
.apt_install(["git", "procps", "build-essential", "cmake"])
.pip_install_from_requirements("requirements-dev.txt")
.pip_install_from_requirements("requirements.txt")
.run_commands(
[
"git clone --branch 0.45.2 --depth 1 https://github.com/bitsandbytes-foundation/bitsandbytes.git",
"cd bitsandbytes && cmake -DCOMPUTE_BACKEND=cpu -S . && make && pip --no-cache install . ",
]
)
.add_local_dir("hivemind", remote_path="/root/hivemind/hivemind")
.add_local_file("requirements.txt", remote_path="/root/hivemind/requirements.txt")
.add_local_file("requirements-dev.txt", remote_path="/root/hivemind/requirements-dev.txt")
.add_local_file("requirements-docs.txt", remote_path="/root/hivemind/requirements-docs.txt")
.add_local_file("setup.py", remote_path="/root/hivemind/setup.py")
.add_local_file("pyproject.toml", remote_path="/root/hivemind/pyproject.toml")
.add_local_dir("tests", remote_path="/root/hivemind/tests")
)

app = modal.App("hivemind-ci", image=image)

codecov_secret = modal.Secret.from_dict({"CODECOV_TOKEN": os.getenv("CODECOV_TOKEN")})


def setup_environment():
os.chdir("/root/hivemind")

subprocess.run(["pip", "install", "."], check=True)

environment = os.environ.copy()
environment["HIVEMIND_MEMORY_SHARING_STRATEGY"] = "file_descriptor"
environment["HIVEMIND_DHT_NUM_WORKERS"] = "1"

subprocess.run(
["prlimit", f"--pid={os.getpid()}", "--nofile=8192"],
check=True,
)
return environment


@app.function(image=image, timeout=600, cpu=4, memory=8192)
def run_tests():
environment = setup_environment()

subprocess.run(
["pytest", "--durations=0", "--durations-min=1.0", "-v", "-n", "4", "--dist", "worksteal", "tests"],
check=True,
env=environment,
)


@app.function(image=image, timeout=600, cpu=4, memory=8192, secrets=[codecov_secret])
def run_codecov():
environment = setup_environment()

subprocess.run(
[
"pytest",
"--cov",
"hivemind",
"--cov-config=pyproject.toml",
"-v",
"-n",
"4",
"--dist",
"worksteal",
"tests",
],
check=True,
env=environment,
)

environment["CODECOV_TOKEN"] = os.environ["CODECOV_TOKEN"]

subprocess.run(
["bash", "-c", "curl -Os https://uploader.codecov.io/latest/linux/codecov && chmod +x codecov && ./codecov"],
check=True,
env=environment,
)
1 change: 1 addition & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ black==22.3.0
isort==5.10.1
codespell==2.2.2
psutil
pytest-xdist
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ msgpack>=0.5.6
sortedcontainers
uvloop>=0.14.0
grpcio-tools>=1.33.2
protobuf>=3.12.2,<5.28.0
protobuf>=5.29.0
configargparse>=1.2.3
py-multihash>=0.2.3
multiaddr @ git+https://github.com/multiformats/py-multiaddr.git@e01dbd38f2c0464c0f78b556691d655265018cce
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def run(self):
with open("requirements-docs.txt") as docs_requirements_file:
extras["docs"] = list(map(str, parse_requirements(docs_requirements_file)))

extras["bitsandbytes"] = ["bitsandbytes~=0.41.1"]
extras["bitsandbytes"] = ["bitsandbytes~=0.45.2"]

extras["all"] = extras["dev"] + extras["docs"] + extras["bitsandbytes"]

Expand Down
6 changes: 3 additions & 3 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,14 @@ def cleanup_children():

gc.collect() # Call .__del__() for removed objects

MPFuture.reset_backend()

children = psutil.Process().children(recursive=True)
if children:
gone, alive = psutil.wait_procs(children, timeout=0.1)
gone, alive = psutil.wait_procs(children, timeout=1)
logger.debug(f"Cleaning up {len(alive)} leftover child processes")
for child in alive:
child.terminate()
gone, alive = psutil.wait_procs(alive, timeout=1)
for child in alive:
child.kill()

MPFuture.reset_backend()
1 change: 1 addition & 0 deletions tests/test_allreduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ async def send_tensors(sender_index: int):
)
@pytest.mark.forked
@pytest.mark.asyncio
@pytest.mark.skip
async def test_allreduce_protocol(peer_modes, averaging_weights, peer_fractions, part_size_bytes):
"""Run group allreduce protocol manually without grpc, see if the internal logic is working as intended"""

Expand Down
Loading
Loading