Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

free() invalid pointer #3777

Closed
Slyne opened this issue Jan 7, 2022 · 23 comments
Closed

free() invalid pointer #3777

Slyne opened this issue Jan 7, 2022 · 23 comments
Labels
bug Something isn't working investigating The developement team is investigating this issue

Comments

@Slyne
Copy link

Slyne commented Jan 7, 2022

Description
When I shut down triton inference server, there's one line:
3067267c406779d44c4cda84e61911b

Triton Information
What version of Triton are you using? 21.12

Are you using the Triton container or did you build it yourself?
Here's the dockerfile:

FROM nvcr.io/nvidia/tritonserver:21.12-py3
LABEL maintainer="NVIDIA"
LABEL repository="tritonserver"

RUN apt-get update && apt-get -y install swig && apt-get -y install python3-dev && apt-get install -y cmake
RUN pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
RUN pip3 install -v kaldifeat

Here's the model.py.

import kaldifeat

class TritonPythonModel:

    def initialize(self, args):
        pass

    def execute(self, requests):
        pass

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

config.pbtxt

name: "model"
backend: "python"
max_batch_size: 64

input [
  {
    name: "wav"
    data_type: TYPE_FP32
    dims: [-1]
  },
  {
    name: "wav_lens"
    data_type: TYPE_INT32
    dims: [1]
  }
]

output [
  {
    name: "speech"
    data_type: TYPE_FP16
    dims: [-1, 80]  # 80
  },
  {
    name: "speech_lengths"
    data_type: TYPE_INT32
    dims: [1]
  }
]

dynamic_batching {
    preferred_batch_size: [ 16, 32 ]
  }
instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

To Reproduce

  1. Build docker based on the above dockerfile.
  2. Run the model_repo with model.py in it.
  3. Shut down triton by 'ctrl-c'

Expected behavior
Expect no such line.

I test on 2 different machine. Both will give this error? warning? One will not generate core, and another will generate a core file.

@jbkyang-nvi
Copy link
Contributor

Hey I see you have passed the execute step. The user must return a list of InferResponse objects of the exact same size of requests

@Slyne
Copy link
Author

Slyne commented Jan 9, 2022

Hey I see you have passed the execute step. The user must return a list of InferResponse objects of the exact same size of requests

@jbkyang-nvi This is the minimum code to reproduce this issue. I removed the part in exectute function to simplify it. Just add the config file. Can you reproduce this issue ?

@jbkyang-nvi
Copy link
Contributor

Hey I see you have passed the execute step. The user must return a list of InferResponse objects of the exact same size of requests

@jbkyang-nvi This is the minimum code to reproduce this issue. I removed the part in exectute function to simplify it. Just add the config file. Can you reproduce this issue ?

Can you try to add the InferResponse to your minimum example and see if the free pointer issue persists?

@Slyne
Copy link
Author

Slyne commented Jan 11, 2022

Hey I see you have passed the execute step. The user must return a list of InferResponse objects of the exact same size of requests

@jbkyang-nvi This is the minimum code to reproduce this issue. I removed the part in exectute function to simplify it. Just add the config file. Can you reproduce this issue ?

Can you try to add the InferResponse to your minimum example and see if the free pointer issue persists?

Yes, I can see the issue. In fact, the original codes are quite long, that's why I delete all of them.

@deadeyegoodwin
Copy link
Contributor

Can you run triton with valgrind? That should produce a stack trace showing exactly where the invalid pointer error is occurring.

@deadeyegoodwin
Copy link
Contributor

Because you are not even sending any requests to the server, and because your model is stripped down, I would assume that you don't need to dockerfile build step. Do you see the same failure even if you run with the NGC Tritonserver image? We will try your steps and see if we can reproduce.

@tanmayv25
Copy link
Contributor

tanmayv25 commented Jan 25, 2022

@Slyne I was able to reproduce the bug with your steps. The most interesting thing is that if I remove import kaldifeat from the python model.py then I don't see this error. This appears to be something wrong with unloading kaldifeat module?
We will investigate further and let you know.

@tanmayv25 tanmayv25 added investigating The developement team is investigating this issue bug Something isn't working labels Jan 25, 2022
@Slyne
Copy link
Author

Slyne commented Jan 26, 2022

@Slyne I was able to reproduce the bug with your steps. The most interesting thing is that if I remove import kaldifeat from the python model.py then I don't see this error. This appears to be something wrong with unloading kaldifeat module? We will investigate further and let you know.

Maybe. I don't know how to read the valgrind log and didn't find any keywords related to kaldifeat.
Thank you again!
During inference (after completing the initialize, execute, finalize part), everything seems fine and I can also perf_analyze the application without seeing any errors. This issue only occurs when shutting down triton.

@tanmayv25
Copy link
Contributor

I am certain that the issue is specific to kaldifeat and it's use within triton's python infrastructure. I have tried various other python modules and they all work perfectly fine. Looking more closely into the kaldifeat module, we can see that it compiles shared object files.
Looking at the valgrind logs, the issue of invalid reads is coming when Triton is trying to unload the model instance.


==21418== Thread 32:
==21418== Invalid read of size 8
==21418== at 0x4CCA20F: nvidia::inferenceserver::RateLimiter::ModelInstanceContext::RequestRemoval() (in\
/opt/tritonserver/lib/libtritonserver.so)
==21418== by 0x4CCA285: nvidia::inferenceserver::RateLimiter::ModelInstanceContext::WaitForRemoval() (in\
/opt/tritonserver/lib/libtritonserver.so)
==21418== by 0x4CD1480: nvidia::inferenceserver::RateLimiter::UnregisterModel(nvidia::inferenceserver::T\
ritonModel const*) (in /opt/tritonserver/lib/libtritonserver.so)
==21418== by 0x4DC498A: nvidia::inferenceserver::TritonModel::~TritonModel() (in /opt/tritonserver/lib/l\
ibtritonserver.so)
==21418== by 0x4DC504C: nvidia::inferenceserver::TritonModel::~TritonModel() (in /opt/tritonserver/lib/l\
ibtritonserver.so)
==21418== by 0x4C61E46: std::thread::_State_impl<std::thread::_Invoker<std::tuple<nvidia::inferenceserve\
r::(anonymous namespace)::ModelDeleter::operator()(nvidia::inferenceserver::Model*)::{lambda()#1}> > >::M\
run() (in /opt/tritonserver/lib/libtritonserver.so)
==21418== by 0x5BE3DE3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==21418== by 0x578A608: start_thread (pthread_create.c:477)
==21418== by 0x5F7D292: clone (clone.S:95)
==21418== Address 0xf3d5d650 is 80 bytes inside a block of size 312 free'd
==21418== at 0x483F651: operator delete(void*) (vg_replace_malloc.c:923)
==21418== by 0x4DC48F0: nvidia::inferenceserver::TritonModel::~TritonModel() (in /opt/tritonserver/lib/l\
ibtritonserver.so)
==21418== by 0x4DC504C: nvidia::inferenceserver::TritonModel::~TritonModel() (in /opt/tritonserver/lib/l\
ibtritonserver.so)
==21418== by 0x4C61E46: std::thread::_State_impl<std::thread::_Invoker<std::tuple<nvidia::inferenceserve\
r::(anonymous namespace)::ModelDeleter::operator()(nvidia::inferenceserver::Model*)::{lambda()#1}> > >::M\
run() (in /opt/tritonserver/lib/libtritonserver.so)
==21418== by 0x5BE3DE3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==21418== by 0x578A608: start_thread (pthread_create.c:477)
==21418== by 0x5F7D292: clone (clone.S:95)

Because the model with kaldifeat is running in a forked process we can not see those logs.
Running valgrind with --trace-children=yes option will provide more details to go on. There might be some memory leaks or corruption in the kaldifeat module when trying to unload it.

@deadeyegoodwin
Copy link
Contributor

So the python stub process may have some leaks or other issues due to kaldifeat, but that should not cause the tritonserver process to have an invalid free, should it? Perhaps there is something wrong in the cleanup logic of the python backend?

@Slyne
Copy link
Author

Slyne commented Jan 27, 2022

Opened an issue in kaldifeat project for tracking: csukuangfj/kaldifeat#26

@tanmayv25
Copy link
Contributor

tanmayv25 commented Jan 27, 2022

Thanks @Slyne for filing the issue. We will continue investigating on our side and let you know if I find something strange.

Perhaps there is something wrong in the cleanup logic of the python backend?

The clean up logic looks ok to me but I haven't got a chance to do a deep dive. The respective finalize functions are being called appropriately. Will do a deep dive once I get some cycles to work on it.

@tanmayv25
Copy link
Contributor

PR #3868 fixes ==21418== Invalid read of size 8 described at the top. However, I still see the free() invalid pointer.
Running valgrind with --trace-children=yes I see bunch of other errors.

@tanmayv25
Copy link
Contributor

The module kaldifeat has lots of leaks and invalid read/writes on import.

This can be verified using:

valgrind python3 -c "import kaldifeat; print(kaldifeat.__version__)"

However, we do not see the free() invalid pointer error in this case. Running Triton in valgrind with --trace-children=yes gives more details about the invalid free:

==16111== Invalid free() / delete / delete[] / realloc()
==16111==    at 0x483CFBF: operator delete(void*) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16111==    by 0x13E724: pybind11::finalize_interpreter() (in /tmp/host/model_repo/test_model/triton_python_backend_stub)
==16111==    by 0x11C363: main (in /tmp/host/model_repo/test_model/triton_python_backend_stub)
==16111==  Address 0x44eedf48 is 24 bytes inside a block of size 65 alloc'd
==16111==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16111==    by 0x5219378: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16111==    by 0x521A271: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16111==    by 0x521A327: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16111==    by 0x521A5E1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16111==    by 0x4B03EF77: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16111==    by 0x4B041147: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16111==    by 0x4B0301D8: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16111==    by 0x4B02941F: PyInit__kaldifeat (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16111==    by 0x4D7C095: _PyImport_LoadDynamicModuleWithSpec (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16111==    by 0x4D7E104: ??? (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16111==    by 0x4E34526: ??? (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16111== 

The trace demostrates the free() invalid pointer originates in pybind11::finalize_interpreter() clean-up. The issue comes up when importing kaldifeat with pybind11. A simple reproducer is described below:

main.cpp :

#include <pybind11/embed.h> // everything needed for embedding
#include <iostream>

namespace py = pybind11;

int main() {
    py::scoped_interpreter guard{}; // start the interpreter and keep it alive
    py::module_ kaldifeat = py::module_::import("kaldifeat");
    std::cerr << "Module Loaded" << std::endl;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.17)
project(example)

include(FetchContent)

FetchContent_Declare(
  pybind11
  GIT_REPOSITORY "https://github.com/pybind/pybind11"
  GIT_TAG "v2.6"
  GIT_SHALLOW ON
)
FetchContent_MakeAvailable(pybind11)


add_executable(example main.cpp)
target_link_libraries(example PRIVATE pybind11::embed)
~                                                          

In the directory with these file run the following commands:

cmake .
make example
./example

When running the example we see the below issue:

./example 
Module Loaded
free(): invalid pointer
Aborted (core dumped)

The backtrace for the Invalid free for example:

==16174== Invalid free() / delete / delete[] / realloc()
==16174==    at 0x483CFBF: operator delete(void*) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16174==    by 0x129EA3: void __gnu_cxx::new_allocator<std::_Fwd_list_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::destroy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (in /tmp/host/py_invalid_free/example)
==16174==    by 0x1251CC: void std::allocator_traits<std::allocator<std::_Fwd_list_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::destroy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::allocator<std::_Fwd_list_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (in /tmp/host/py_invalid_free/example)
==16174==    by 0x120B06: std::_Fwd_list_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_erase_after(std::_Fwd_list_node_base*, std::_Fwd_list_node_base*) (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11CF41: std::_Fwd_list_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~_Fwd_list_base() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11C9DB: std::forward_list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~forward_list() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11246C: pybind11::detail::internals::~internals() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11C05E: pybind11::finalize_interpreter() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11C14B: pybind11::scoped_interpreter::~scoped_interpreter() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x10E5A5: main (in /tmp/host/py_invalid_free/example)
==16174==  Address 0x20f9cce8 is 24 bytes inside a block of size 65 alloc'd
==16174==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16174==    by 0x4EA2378: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16174==    by 0x4EA3271: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16174==    by 0x4EA3327: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16174==    by 0x4EA35E1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16174==    by 0x47D5AF77: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16174==    by 0x47D5D147: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16174==    by 0x47D4C1D8: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16174==    by 0x47D4541F: PyInit__kaldifeat (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16174==    by 0x4A05095: _PyImport_LoadDynamicModuleWithSpec (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16174==    by 0x4A07104: ??? (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16174==    by 0x4ABD526: ??? (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)

As you can see the free() invalid pointer is raised even when running outside Triton Python Backend. It is coming from pybind11::finalize_interpreter() when running both within Triton and outside Triton. I have tried the latest pybind11 v2.9.0, it gives the same issue.

Closing the issue as the issue is reproducible outside Triton and is shown to manifest when importing kaldifeat within pybind11 interpreter.

@csukuangfj
Copy link

csukuangfj commented Feb 7, 2022

For the given demo main.cpp:

#include <pybind11/embed.h> // everything needed for embedding
#include <iostream>

namespace py = pybind11;

int main() {
    py::scoped_interpreter guard{}; // start the interpreter and keep it alive
    py::module_ kaldifeat = py::module_::import("kaldifeat");
    std::cerr << "Module Loaded" << std::endl;
}

If you change

    py::module_ kaldifeat = py::module_::import("kaldifeat");

to

    py::module_ kaldifeat = py::module_::import("torch");

It will still abort when running ./example. Therefore, this issue is not unique to kaldifeat.


valgrind python3 -c "import kaldifeat; print(kaldifeat.__version__)"

produces as many warnings as

valgrind python3 -c "import torch; print(torch.__version__)"

Also,

valgrind python3 -c "import sys"

produces many warnings.

I am not an expert about valgrind and am not sure whether the above invocation of valgrind is the correct approach to detect memory leak when the Python interpreter comes into scope.

@Slyne
Copy link
Author

Slyne commented Feb 7, 2022

@csukuangfj Thank you!

Did you see the free() invalid pointer issue when using the below code:

py::module_ kaldifeat = py::module_::import("torch");

It's quite strange when we import torch in triton, we didn't see this issue. But when it comes to kaldifeat, the issue occurs...

@csukuangfj
Copy link

@csukuangfj Thank you!

Did you see the free() invalid pointer issue when using the below code:

py::module_ kaldifeat = py::module_::import("torch");

It's quite strange when we import torch in triton, we didn't see this issue. But when it comes to kaldifeat, the issue occurs...

Yes, it is reproducible.

Screen Shot 2022-02-07 at 8 41 58 PM

@tanmayv25
Copy link
Contributor

tanmayv25 commented Feb 9, 2022

@csukuangfj @Slyne That is very strange. I changed kaldifeat to torch in reproducer:

py::module_ kaldifeat = py::module_::import("torch");

And I don't see the free() invalid any more.

root@c24bff507b14:/tmp/host/py_invalid_free# ./example 
Module Loaded

However, for kaldifeat I see:

root@c24bff507b14:/tmp/host/py_invalid_free# ./example 
Module Loaded
free(): invalid pointer
Aborted (core dumped)

The kaldifeat version:

root@c24bff507b14:/tmp/host/py_invalid_free# pip show kaldifeat          
Name: kaldifeat
Version: 1.12

The torch version:

pip show torch    
Name: torch
Version: 1.10.1+cu113

I tried numpy too and it runs smoothly. @Slyne Can you try confirming whether torch core dumps for you?
I understand running valgrind like that is not correct, and the leak information may not be accurate. Discussion here would be helpful to get accurate information.
https://stackoverflow.com/questions/20112989/how-to-use-valgrind-with-python

@csukuangfj
Copy link

csukuangfj commented Feb 14, 2022

I just created a GitHub repo to reproduce the core dump issue by changing import kaldifeat to import torch.
Please see https://github.com/csukuangfj/memory-leak-example

You can see the output from GitHub actions at
https://github.com/csukuangfj/memory-leak-example/runs/5179267107?check_suite_focus=true

A screenshot of the output is given below:
Screen Shot 2022-02-14 at 1 41 34 PM

kaldifeat uses PyTorch C++ API and it is the responsibility of PyTorch to manage the memory.


[edited]: So memory issues with kaldifeat should be reproducible by replacing kaldifeat with torch.

@Slyne
Copy link
Author

Slyne commented Feb 23, 2022

@csukuangfj @Slyne That is very strange. I changed kaldifeat to torch in reproducer:

py::module_ kaldifeat = py::module_::import("torch");

And I don't see the free() invalid any more.

root@c24bff507b14:/tmp/host/py_invalid_free# ./example 
Module Loaded

However, for kaldifeat I see:

root@c24bff507b14:/tmp/host/py_invalid_free# ./example 
Module Loaded
free(): invalid pointer
Aborted (core dumped)

The kaldifeat version:

root@c24bff507b14:/tmp/host/py_invalid_free# pip show kaldifeat          
Name: kaldifeat
Version: 1.12

The torch version:

pip show torch    
Name: torch
Version: 1.10.1+cu113

I tried numpy too and it runs smoothly. @Slyne Can you try confirming whether torch core dumps for you? I understand running valgrind like that is not correct, and the leak information may not be accurate. Discussion here would be helpful to get accurate information. https://stackoverflow.com/questions/20112989/how-to-use-valgrind-with-python

@tanmayv25 Are you running in a docker environment ? One of our colleague also found this issue and he didn't use kaldifeat. He only imports pytorch and tensorrt in python backend.

@tanmayv25
Copy link
Contributor

Yes. I am compiling and running the reproducer in docker container.

@ZR19971001
Copy link

Hello, I would like to ask a question. I am using triton 22.04-py3 version of docker. When the specified backend is python, the free problem also occurs when unloading the model. Is it because of kaldifeat?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working investigating The developement team is investigating this issue
Development

No branches or pull requests

6 participants