Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_read_{csv,table} fail if zstandard is used with CFFI backend, on Python 3.12 #910

Closed
mgorny opened this issue Nov 19, 2023 · 21 comments · Fixed by #911
Labels

Comments

@mgorny
Copy link
Contributor

mgorny commented Nov 19, 2023

Describe the bug
When the zstandard package is using the CFFI backend, the two following tests fail on Python 3.12:

========================================================= test session starts =========================================================
platform linux -- Python 3.12.0, pytest-7.4.3, pluggy-1.3.0
rootdir: /tmp/pyfakefs
plugins: pyfakefs-5.4.dev0
collected 4 items                                                                                                                     

pyfakefs/tests/patched_packages_test.py F.F.                                                                                    [100%]

============================================================== FAILURES ===============================================================
__________________________________________________ TestPatchedPackages.test_read_csv __________________________________________________

self = <pyfakefs.tests.patched_packages_test.TestPatchedPackages testMethod=test_read_csv>

    def test_read_csv(self):
        path = "/foo/bar.csv"
        self.fs.create_file(path, contents="1,2,3,4")
>       df = pd.read_csv(path)

pyfakefs/tests/patched_packages_test.py:52: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py312/lib/python3.12/site-packages/pandas/io/parsers/readers.py:948: in read_csv
    return _read(filepath_or_buffer, kwds)
.tox/py312/lib/python3.12/site-packages/pandas/io/parsers/readers.py:611: in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
.tox/py312/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1448: in __init__
    self._engine = self._make_engine(f, self.engine)
.tox/py312/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1705: in _make_engine
    self.handles = get_handle(
.tox/py312/lib/python3.12/site-packages/pandas/io/common.py:709: in get_handle
    if _is_binary_mode(path_or_buf, mode) and "b" not in mode:
.tox/py312/lib/python3.12/site-packages/pandas/io/common.py:1171: in _is_binary_mode
    return isinstance(handle, _get_binary_io_classes()) or "b" in getattr(
.tox/py312/lib/python3.12/site-packages/pandas/io/common.py:1186: in _get_binary_io_classes
    zstd = import_optional_dependency("zstandard")
.tox/py312/lib/python3.12/site-packages/pandas/compat/_optional.py:132: in import_optional_dependency
    module = importlib.import_module(name)
/usr/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1381: in _gcd_import
    ???
<frozen importlib._bootstrap>:1354: in _find_and_load
    ???
<frozen importlib._bootstrap>:1325: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:929: in _load_unlocked
    ???
<frozen importlib._bootstrap_external>:994: in exec_module
    ???
<frozen importlib._bootstrap>:488: in _call_with_frames_removed
    ???
.tox/py312/lib/python3.12/site-packages/zstandard/__init__.py:73: in <module>
    from .backend_cffi import *
.tox/py312/lib/python3.12/site-packages/zstandard/backend_cffi.py:89: in <module>
    from ._cffi import (  # type: ignore
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pyfakefs.fake_io.FakeIoModule object at 0x7feba17ec7d0>, name = '_IOBase'

    def __getattr__(self, name):
        """Forwards any unfaked calls to the standard io module."""
>       return getattr(self._io_module, name)
E       AttributeError: module 'io' has no attribute '_IOBase'. Did you mean: 'IOBase'?

pyfakefs/fake_io.py:150: AttributeError
_________________________________________________ TestPatchedPackages.test_read_table _________________________________________________

self = <pyfakefs.tests.patched_packages_test.TestPatchedPackages testMethod=test_read_table>

    def test_read_table(self):
        path = "/foo/bar.csv"
        self.fs.create_file(path, contents="1|2|3|4")
>       df = pd.read_table(path, delimiter="|")

pyfakefs/tests/patched_packages_test.py:58: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py312/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1282: in read_table
    return _read(filepath_or_buffer, kwds)
.tox/py312/lib/python3.12/site-packages/pandas/io/parsers/readers.py:611: in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
.tox/py312/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1448: in __init__
    self._engine = self._make_engine(f, self.engine)
.tox/py312/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1705: in _make_engine
    self.handles = get_handle(
.tox/py312/lib/python3.12/site-packages/pandas/io/common.py:709: in get_handle
    if _is_binary_mode(path_or_buf, mode) and "b" not in mode:
.tox/py312/lib/python3.12/site-packages/pandas/io/common.py:1171: in _is_binary_mode
    return isinstance(handle, _get_binary_io_classes()) or "b" in getattr(
.tox/py312/lib/python3.12/site-packages/pandas/io/common.py:1186: in _get_binary_io_classes
    zstd = import_optional_dependency("zstandard")
.tox/py312/lib/python3.12/site-packages/pandas/compat/_optional.py:132: in import_optional_dependency
    module = importlib.import_module(name)
/usr/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1381: in _gcd_import
    ???
<frozen importlib._bootstrap>:1354: in _find_and_load
    ???
<frozen importlib._bootstrap>:1325: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:929: in _load_unlocked
    ???
<frozen importlib._bootstrap_external>:994: in exec_module
    ???
<frozen importlib._bootstrap>:488: in _call_with_frames_removed
    ???
.tox/py312/lib/python3.12/site-packages/zstandard/__init__.py:73: in <module>
    from .backend_cffi import *
.tox/py312/lib/python3.12/site-packages/zstandard/backend_cffi.py:89: in <module>
    from ._cffi import (  # type: ignore
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pyfakefs.fake_io.FakeIoModule object at 0x7feba16eeed0>, name = '_IOBase'

    def __getattr__(self, name):
        """Forwards any unfaked calls to the standard io module."""
>       return getattr(self._io_module, name)
E       AttributeError: module 'io' has no attribute '_IOBase'. Did you mean: 'IOBase'?

pyfakefs/fake_io.py:150: AttributeError
========================================================== warnings summary ===========================================================
.tox/py312/lib/python3.12/site-packages/dateutil/tz/tz.py:37
  /tmp/pyfakefs/.tox/py312/lib/python3.12/site-packages/dateutil/tz/tz.py:37: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
    EPOCH = datetime.datetime.utcfromtimestamp(0)

pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_read_excel
pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_read_excel
pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_write_excel
pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_write_excel
pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_write_excel
  /tmp/pyfakefs/.tox/py312/lib/python3.12/site-packages/openpyxl/packaging/core.py:99: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
    now = datetime.datetime.utcnow()

pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_write_excel
  /tmp/pyfakefs/.tox/py312/lib/python3.12/site-packages/openpyxl/writer/excel.py:292: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
    workbook.properties.modified = datetime.datetime.utcnow()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================= short test summary info =======================================================
FAILED pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_read_csv - AttributeError: module 'io' has no attribute '_IOBase'. Did you mean: 'IOBase'?
FAILED pyfakefs/tests/patched_packages_test.py::TestPatchedPackages::test_read_table - AttributeError: module 'io' has no attribute '_IOBase'. Did you mean: 'IOBase'?
=============================================== 2 failed, 2 passed, 7 warnings in 1.58s ===============================================

How To Reproduce

In a Python 3.12 venv:

pip install pytest pandas zstandard cffi
export PYTHON_ZSTANDARD_IMPORT_POLICY=cffi
pytest pyfakefs/tests/patched_packages_test.py::TestPatchedPackages

Your environment

$ python -c "import platform; print(platform.platform())"
Linux-6.6.1-gentoo-dist-x86_64-AMD_Ryzen_5_3600_6-Core_Processor-with-glibc2.38
$ python -c "import sys; print('Python', sys.version)"
Python 3.12.0 (main, Oct 25 2023, 07:20:32) [GCC 13.2.1 20231014]
$ python -c "from pyfakefs import __version__; print('pyfakefs', __version__)"
pyfakefs 5.4.dev0
$ python -c "import pytest; print('pytest', pytest.__version__)"
pytest 7.4.3

Reproduced on 8c7a99c.

@mrbean-bremen
Copy link
Member

Thanks for the report! Did you test this also with other Python versions, or only with 3.12? I assume that pandas and openpyxl had the version as defined in extra_requirements.txt, e.g. 2.1.3 and 3.1.2, is this correct?

@mgorny
Copy link
Contributor Author

mgorny commented Nov 19, 2023

Thanks for the report! Did you test this also with other Python versions, or only with 3.12?

I did with 3.10, 3.11 and PyPy3.10. The issue seems to be specific to 3.12.

I assume that pandas and openpyxl had the version as defined in extra_requirements.txt, e.g. 2.1.3 and 3.1.2, is this correct?

Yes, this is the versions I've originally used. The "reproducer" above doesn't install openpyxl at all, the issue happens regardless of whether it's installed or not.

I'm pretty sure it's somehow related to CFFI — site-packages/_cffi_backend.cpython-312-x86_64-linux-gnu.so is the only file featuring the string _IOBase.

@mgorny
Copy link
Contributor Author

mgorny commented Nov 19, 2023

FWICS cffi expects to be able to import _io._IOBase here:

https://github.com/python-cffi/cffi/blob/49127c6929bfc7186fbfd3819dd5e058ad888de4/src/c/file_emulator.h#L6-L17

It seems that for some reason pyfakefs intercepts that.

@mgorny
Copy link
Contributor Author

mgorny commented Nov 19, 2023

Oh, that's probably because pyfakefs overrides _io in 3.12 but not earlier versions:

if IS_PYPY or sys.version_info >= (3, 12):
# in PyPy and later cpython versions, the module is referenced as _io
self._fake_module_classes["_io"] = fake_io.FakeIoModule

@mgorny
Copy link
Contributor Author

mgorny commented Nov 19, 2023

FWICS all Base classes are prefixed with _ in the _io module, so pyfakefs probably needs to mirror that:

>>> _io._
_io._BufferedIOBase()  _io._BytesIOBuffer()   _io._IOBase()          _io._RawIOBase()       _io._TextIOBase()

@mrbean-bremen
Copy link
Member

Ah ok - thank you, that makes sense! Is there any possibility to reproduce this in a docker image which uses cffi? I don't know cffi (or gentoo, for that matter), so I'm not sure how I would test that...

@mrbean-bremen
Copy link
Member

Ok, thinking about this I probably should just use a separate fake wrapper class for _io, not the same as for io, as this currently probably leads to it looking up _IOBase in io instead of _io. I had the impression that _io is just an alias for io, but obviously this is not completely true.

@mrbean-bremen
Copy link
Member

I will put something together a bit later today.

@mgorny
Copy link
Contributor Author

mgorny commented Nov 19, 2023

Ah ok - thank you, that makes sense! Is there any possibility to reproduce this in a docker image which uses cffi? I don't know cffi (or gentoo, for that matter), so I'm not sure how I would test that...

I think so. The reproducer I gave above should work in Python3.12 venv anywhere.

@mrbean-bremen
Copy link
Member

Yes, sorry - I had misread the issue (for some reason thought that a system package was involved).
It is even reproducible under Windows.

mrbean-bremen added a commit to mrbean-bremen/pyfakefs that referenced this issue Nov 19, 2023
- in Python 3.12, _io is not a simple alias for io
- fixes pytest-dev#910
mrbean-bremen added a commit to mrbean-bremen/pyfakefs that referenced this issue Nov 19, 2023
- in Python 3.12, _io is not a simple alias for io
- fixes pytest-dev#910
mrbean-bremen added a commit that referenced this issue Nov 19, 2023
- in Python 3.12, _io is not a simple alias for io
- fixes #910
@mrbean-bremen
Copy link
Member

@mgorny - shall be fixed in main now, please check!
Do you need a patch release?

@mgorny
Copy link
Contributor Author

mgorny commented Nov 19, 2023

@mgorny - shall be fixed in main now, please check!

It's green now. Thanks!

Do you need a patch release?

I don't think there's a need for an urgent release. I don't think this problem is likely to affect many people in the wild, and I've just deselected the two tests from the current version of Gentoo ebuild.

@mrbean-bremen
Copy link
Member

Ok, in this case I'll wait until something else comes up, thanks!

@mrbean-bremen
Copy link
Member

FYI: A new patch release is now out.

@mgorny
Copy link
Contributor Author

mgorny commented Dec 1, 2023

Thanks for the ping! I've already added the new version to Gentoo, and ofc I forgot to reenable the test ;-).

@sodul
Copy link

sodul commented Dec 2, 2023

I do not know if it is related but our internal CI is getting OOM killed apparently on a unittest that uses pyfakefs. The memory goes from 1.6GB usage with python 3.11.6 and pyfakefs 5.3.1 to 2.5GB (and OOM killed) with python 3.12.0 and pyfakefs 5.3.2. The container runs on EKS and is based off Ubuntu 22.04.

I'm testing various variations to identify the exact test that triggers the memory leak and see if the problem is with the pyfakefs version.

@mrbean-bremen
Copy link
Member

If it is indeed related to pyfakefs, it should have to do with Python 3.12. There are a few changes in patching (mostly pathlib and io) specifically for Python 3.12, though I have no idea how these could be cause that increased memory usage. It is unlikely that the change appeared in pyfakefs 5.3.2 (this can easilly be verified by using 5.3.1 with Python 3.12).
It could also be that the change has to do with Python 3.12 unrelated to pyfakefs, though I don't know how best to test this...

@mrbean-bremen
Copy link
Member

@sodul - any luck with finding the cause of the problem or a reproducible example ? If yes, please write a new issue for this.

@sodul
Copy link

sodul commented Dec 20, 2023

@mrbean-bremen I could not get a true root cause unfortunately.

One of the issues was that we use EKS (Kubernetes in AWS) and our running nodes switched from cgroup v1 to v2 which meant that we ended up with much larger parallelism than before since v2 does not expose the CPU limits of the container. We hardcoded the parallelism to a reasonable amount, but the memory is still larger than before, but sufficiently under control to no longer get OOM killed.

@mrbean-bremen
Copy link
Member

One thing I could think of is if you are mapping large files into the fake system and reading them. They will be read into memory at first access and stay there, as long as the fake fs instance lives. Not sure if you are doing something like this, but I can't think of anything else pyfakefs-related that would use noticable amounts of RAM.

@sodul
Copy link

sodul commented Dec 29, 2023

@mrbean-bremen Unfortunately I do not really have much more details. It is possible that some of the files we read are somewhat large, we do have a folder with under 800kB of yaml data loaded by some legacy code we know to be scaling inefficiently, but it hardly explains the increase in memory usage we have noticed. If we do get more specific details we'll make sure to post them here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants