Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neptune.ai logger produces lots of errors when logging "training/epoch" #19679

Open
simon-ging opened this issue Mar 20, 2024 · 1 comment
Open
Labels
bug Something isn't working help wanted Open to be worked on logger: neptune

Comments

@simon-ging
Copy link

Bug description

Neptune logger gives a lot of errors like "[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0"

Those are actually false positives, the "training/epoch" curve in the neptune UI looks fine.

similar to #2946

What version are you seeing the problem on?

v2.2

How to reproduce the bug

setup NEPTUNE_API_TOKEN and NEPTUNE_PROJECT first for a proper connection to neptune.ai


import os

import lightning as lit
import torch
from lightning.pytorch.loggers import NeptuneLogger
from torch.utils.data import Dataset, DataLoader


class DummyDataset(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        return 100

    def __getitem__(self, item):
        return {"image": torch.rand(3, 16, 16), "label": torch.randint(0, 100, (1,))}


class DummyModel(lit.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = torch.nn.Linear(3 * 16 * 16, 100)
        self.epoch_identifier = "dummy"

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch["image"], batch["label"]
        x = x.view(x.size(0), -1)
        y = y.view(-1)
        logits = self.model(x)
        loss = torch.nn.functional.cross_entropy(logits, y)
        self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

    def validation_step(self, batch, batch_idx):
        x, y = batch["image"], batch["label"]
        x = x.view(x.size(0), -1)
        y = y.view(-1)
        logits = self.model(x)
        loss = torch.nn.functional.cross_entropy(logits, y)
        self.log("val_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)

    def test_step(self, batch, batch_idx):
        return self.validation_step(batch, batch_idx)


def main():
    wlogger = NeptuneLogger(log_model_checkpoints=False)
    output_dir = "temp_lit"
    os.makedirs(output_dir, exist_ok=True)

    trainer = lit.Trainer(
        devices=1,
        default_root_dir=output_dir,
        logger=wlogger,
        max_epochs=5,
        enable_progress_bar=False,
        log_every_n_steps=5,
    )
    model = DummyModel()
    dataset = DummyDataset()
    train_loader = DataLoader(dataset, batch_size=16, num_workers=4)
    val_loader = DataLoader(dataset, batch_size=16, num_workers=4)
    trainer.fit(model, train_loader, val_loader)


if __name__ == "__main__":
    main()


### Error messages and logs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[neptune] [info ] Neptune initialized. Open in the app: https://app.neptune.ai/ [...]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

| Name | Type | Params

0 | model | Linear | 76.9 K

76.9 K Trainable params
0 Non-trainable params
76.9 K Total params
0.308 Total estimated model params size (MB)
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 6.0
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 13.0
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 20.0
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 27.0
Trainer.fit stopped: max_epochs=5 reached.
[neptune] [info ] Shutting down background jobs, please wait a moment...
[neptune] [info ] Done!
[neptune] [info ] Waiting for the remaining 17 operations to synchronize with Neptune. Do not kill this process.
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0
[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: training/epoch. Invalid point: 34.0
[neptune] [info ] All 17 operations synced, thanks for waiting!
[neptune] [info ] Explore the metadata in the Neptune app: https://app.neptune.ai/ [...]



### Environment

<details>
  <summary>Current environment</summary>

* CUDA:
	- GPU:
		- NVIDIA GeForce GTX 1070 Ti
		- Quadro P400
	- available:         True
	- version:           12.1
* Lightning:
	- lightning:         2.2.1
	- lightning-utilities: 0.11.0
	- pytorch-lightning: 2.2.1
	- torch:             2.2.1
	- torchaudio:        2.2.1
	- torchmetrics:      1.3.2
	- torchvision:       0.17.1
* Packages:
	- aiohttp:           3.9.3
	- aiosignal:         1.3.1
	- arrow:             1.3.0
	- async-timeout:     4.0.3
	- attrs:             23.2.0
	- boto3:             1.34.66
	- botocore:          1.34.66
	- bravado:           11.0.3
	- bravado-core:      6.1.1
	- brotli:            1.0.9
	- certifi:           2024.2.2
	- charset-normalizer: 2.0.4
	- click:             8.1.7
	- filelock:          3.13.1
	- fqdn:              1.5.1
	- frozenlist:        1.4.1
	- fsspec:            2024.3.1
	- future:            1.0.0
	- gitdb:             4.0.11
	- gitpython:         3.1.42
	- gmpy2:             2.1.2
	- idna:              3.4
	- isoduration:       20.11.0
	- jinja2:            3.1.3
	- jmespath:          1.0.1
	- jsonpointer:       2.4
	- jsonref:           1.1.0
	- jsonschema:        4.21.1
	- jsonschema-specifications: 2023.12.1
	- lightning:         2.2.1
	- lightning-utilities: 0.11.0
	- markupsafe:        2.1.3
	- mkl-fft:           1.3.8
	- mkl-random:        1.2.4
	- mkl-service:       2.4.0
	- monotonic:         1.6
	- mpmath:            1.3.0
	- msgpack:           1.0.8
	- multidict:         6.0.5
	- neptune:           1.9.1
	- networkx:          3.1
	- numpy:             1.26.4
	- oauthlib:          3.2.2
	- packaging:         24.0
	- pandas:            2.2.1
	- pillow:            10.2.0
	- pip:               23.3.1
	- psutil:            5.9.8
	- pyjwt:             2.8.0
	- pysocks:           1.7.1
	- python-dateutil:   2.9.0.post0
	- pytorch-lightning: 2.2.1
	- pytz:              2024.1
	- pyyaml:            6.0.1
	- referencing:       0.34.0
	- requests:          2.31.0
	- requests-oauthlib: 1.4.0
	- rfc3339-validator: 0.1.4
	- rfc3986-validator: 0.1.1
	- rpds-py:           0.18.0
	- s3transfer:        0.10.1
	- setuptools:        68.2.2
	- simplejson:        3.19.2
	- six:               1.16.0
	- smmap:             5.0.1
	- swagger-spec-validator: 3.0.3
	- sympy:             1.12
	- torch:             2.2.1
	- torchaudio:        2.2.1
	- torchmetrics:      1.3.2
	- torchvision:       0.17.1
	- tqdm:              4.66.2
	- triton:            2.2.0
	- types-python-dateutil: 2.9.0.20240316
	- typing-extensions: 4.9.0
	- tzdata:            2024.1
	- uri-template:      1.3.0
	- urllib3:           2.1.0
	- webcolors:         1.13
	- websocket-client:  1.7.0
	- wheel:             0.41.2
	- yarl:              1.9.4
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.10.13
	- release:           5.4.0-172-generic
	- version:           #190-Ubuntu SMP Fri Feb 2 23:24:22 UTC 2024

</details>

### More info

_No response_
@simon-ging simon-ging added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 20, 2024
@awaelchli awaelchli added logger: neptune help wanted Open to be worked on and removed needs triage Waiting to be triaged by maintainers labels Mar 23, 2024
@SiddhantSadangi
Copy link
Contributor

You can follow updates to this issue here: neptune-ai/neptune-client#1702

Be assured, though, that this "error" does not lead to data loss. It is merely caused by the training and validation loops trying to log the epoch number to the same namespace. Since it is already logged, and Neptune always expects the "step" to be increasing, the duplicate epoch value at the same step is dropped.

More details about this error here: https://docs.neptune.ai/help/error_step_must_be_increasing/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on logger: neptune
Projects
None yet
Development

No branches or pull requests

3 participants