Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in client when using python wrapper to send request to service #822

Closed
crazyhank opened this issue Sep 14, 2021 · 16 comments
Closed
Labels
more-information-needed Further information is required

Comments

@crazyhank
Copy link

crazyhank commented Sep 14, 2021

I found this problem in latest Galactic release, it is simple to reproduce, write a simple service(C++) and a client (Python), memory leak will definitly happen. It will not happen when using C++ in client.

  1. Data struct:
int32 seq
uint64 time
byte[1048576] input_tensor
---
int32 seq
uint64 time
byte[1048576] output_tensor

  1. Client side code:
import sys

from my_struct.srv import InferService
import rclpy
from rclpy.node import Node

import time


class MinimalClientAsync(Node):

    def __init__(self):
        super().__init__('minimal_client_async')
        self.cli = self.create_client(InferService, 'infer')
        while not self.cli.wait_for_service(timeout_sec=1.0):
            self.get_logger().info('service not available, waiting again...')
        self.req = InferService.Request()

    def send_request(self):
        self.req.seq = 2000
        self.req.time = 200
        self.future = self.cli.call_async(self.req)


def main(args=None):
    rclpy.init(args=args)

    minimal_client = MinimalClientAsync()
    for _ in range(100000):
        minimal_client.send_request()

        while rclpy.ok():
            rclpy.spin_once(minimal_client)
            if minimal_client.future.done():
                try:
                    response = minimal_client.future.result()
                except Exception as e:
                    minimal_client.get_logger().info(
                        'Service call failed %r' % (e,))
                else:
                    minimal_client.get_logger().info(
                        'Result of inference: resp seq %d' % (response.seq))
                break
            #time.sleep(0.5)


    minimal_client.destroy_node()
    rclpy.shutdown()


if __name__ == '__main__':
    main()

You can write a simple service code, just receive the request and do nothing and send a response to client.

@aprotyas
Copy link
Member

Do you mind explicating why you think there will be a memory leak?

@fujitatomoya fujitatomoya added the more-information-needed Further information is required label Sep 14, 2021
@crazyhank
Copy link
Author

This problem is found in our project, I need time to seperate test code to reproduce it. As you see, there is a 1MB data member in response data struct, each time when I receive a response from servcer, I see a about 1MB memory used for client node, so a memory usage is increased as test continues.
I seems that reponse buffer is not released for python wrapper, no problem is found when I switch to C++ code.

@aprotyas
Copy link
Member

Thanks for the explanation! It would be great if you could provide a self-contained example that could be used for stress-testing.

@crazyhank
Copy link
Author

Hi, I list the key information, I think it enough and easy to reproduce the problem.

1、Request/Response data struct:

int32 seq
---
int32 seq
byte[1048576] output_tensor

2、Service code(C++):

#include "rclcpp/rclcpp.hpp"
#include "my_struct/srv/infer_service.hpp"
#include <memory>

void do_infer(const std::shared_ptr<my_struct::srv::InferService::Request> request,
                                                                        std::shared_ptr<my_struct::srv::InferService::Response> response)
{
        static int index = 0;
        RCLCPP_INFO(rclcpp::get_logger("server"), "Incoming request: seq = %d", request->seq);

        response->seq           = index++;
}

int main(int argc, char **argv)
{
        rclcpp::init(argc, argv);

        std::shared_ptr<rclcpp::Node> node = rclcpp::Node::make_shared("infer_service");

        rclcpp::Service<my_struct::srv::InferService>::SharedPtr service = node->create_service<my_struct::srv::InferService>("infer", &do_infer);

        RCLCPP_INFO(rclcpp::get_logger("server"), "Ready to receive infer request ...");

        rclcpp::spin(node);
        rclcpp::shutdown();
}

3、Client Code (Python):

import sys

from my_struct.srv import InferService
import rclpy
from rclpy.node import Node

import time


class MinimalClientAsync(Node):

    def __init__(self):
        super().__init__('minimal_client_async')
        self.cli = self.create_client(InferService, 'infer')
        while not self.cli.wait_for_service(timeout_sec=1.0):
            self.get_logger().info('service not available, waiting again...')
        self.req = InferService.Request()
        self.index = 0

    def send_request(self):
        self.index += 1
        self.req.seq = self.index
        self.future = self.cli.call_async(self.req)


def main(args=None):
    rclpy.init(args=args)

    minimal_client = MinimalClientAsync()
    for _ in range(100000):
        minimal_client.send_request()

        while rclpy.ok():
            rclpy.spin_once(minimal_client)
            if minimal_client.future.done():
                try:
                    response = minimal_client.future.result()
                except Exception as e:
                    minimal_client.get_logger().info(
                        'Service call failed %r' % (e,))
                else:
                    minimal_client.get_logger().info(
                        'Result of inference: resp seq %d' % (response.seq))
                break
            time.sleep(0.5)


    minimal_client.destroy_node()
    rclpy.shutdown()


if __name__ == '__main__':
    main()

BTW, python wrapper performance is very slow when your data struct include a big byte array as you see in the above example, but it is another problem.
Hope to get your feedback and use ROS2 in our project successfully!

@aprotyas
Copy link
Member

aprotyas commented Sep 15, 2021

BTW, python wrapper performance is very slow when your data struct include a big byte array as you see in the above example, but it is another problem.

Yeah, that's a known problem: ros2/rosidl_python#134 (Edit: looks like you're aware of this already)

I will try the example that you provided. Thanks!

@crazyhank
Copy link
Author

Hi, have you reproduced the problem?

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Sep 21, 2021
@fujitatomoya
Copy link
Collaborator

problem confirmed, in the process space there is a lot of heap memory area mapped. as long as client/service running, virtual/physical memory increases.
i created the reproducible sample program, https://github.com/fujitatomoya/ros2_test_prover/tree/master/prover_rclpy.

under colcon envirnoment,

colcon build --symlink-install --packages-select prover_interfaces prover_rclpy
source install/local_setup.bash
ros2 run prover_rclpy rclpy_server_822
ros2 run prover_rclpy rclpy_client_822

CC: @Barry-Xu-2018 @iuhilnehc-ynos could you take a look if you have time? i guess this is memory leak, if i am not mistaken...

@aprotyas
Copy link
Member

aprotyas commented Sep 21, 2021

I can confirm the reported issue as well using the sample program linked above. For discussion/debugging convenience, I've produced a couple of charts that show what's happening. Use this script to reproduce said charts when demonstrating a fix.
test_client
test_service


I won't have the bandwidth to return to this issue for a while, but from a cursory overview it does look like a memory leak in the client.

@iuhilnehc-ynos
Copy link
Contributor

I'd like to share something about this issue.

__convert_to_py(void * raw_ros_message) doesn't own the raw_ros_message.

service server

Service::service_take_request
	auto taken_request = create_from_py(pyrequest_type);   // allocate a buffer
	...
	result_tuple[0] = convert_to_py(taken_request.get(), pyrequest_type);

	taken_request.release();  // Delete this line because this function have the responsibility to deallocate the buffer
	...

convert_to_py (using PyBytes_FromStringAndSize, Py_BuildValue, etc) copy data from the raw message instead of owning it.

same for service client

@llapx
Copy link
Contributor

llapx commented Sep 28, 2021

unique_ptr's release() just release the ownership, but does not free the memory which point to, we can replace it with reset(nullptr).
service,
client.

@aprotyas
Copy link
Member

unique_ptr's release() just release the ownership, but does not free the memory which point to

@llapx you are right. For reference, [std::unique_ptr<T,Deleter>::release] "The caller is responsible for deleting the object."

I believe replacing with just reset() would suffice too.

@iuhilnehc-ynos
Copy link
Contributor

You don't need to do reset() manually, just let the unique_ptr with its destroy_ros_message_function do the magic.

Refer to

#define TAKE_SERVICE_RESPONSE(Type) \
/* taken_msg is always destroyed in this function */ \
auto taken_msg = create_from_py(pymsg_type); \
rmw_request_id_t header; \
rcl_ret_t ret = rcl_action_take_ ## Type ## _response( \
rcl_action_client_.get(), &header, taken_msg.get()); \
int64_t sequence = header.sequence_number; \
/* Create the tuple to return */ \
if (RCL_RET_ACTION_CLIENT_TAKE_FAILED == ret || RCL_RET_ACTION_SERVER_TAKE_FAILED == ret) { \
return py::make_tuple(py::none(), py::none()); \
} else if (RCL_RET_OK != ret) { \
throw rclpy::RCLError("Failed to take " #Type); \
} \
return py::make_tuple(sequence, convert_to_py(taken_msg.get(), pymsg_type)); \
py::tuple
ActionClient::take_goal_response(py::object pymsg_type)
{
TAKE_SERVICE_RESPONSE(goal)
}

@crazyhank
Copy link
Author

@iuhilnehc-ynos
After deleting these two release lines, problem disappeared.

In my understanding, release function will take the owership of buffer from rclcpp to caller, right? but python wrapper does not get the unique pointer, so this buffer will never have a chance to be freed. Correct me if I am wrong!

@fujitatomoya
Copy link
Collaborator

@iuhilnehc-ynos @llapx

either of you, can you make PR against this issue? let's review and fix the problem in the mainline.

@fujitatomoya
Copy link
Collaborator

okay, i see #828, one step behind... 😢 sorry!

@fujitatomoya
Copy link
Collaborator

i will go ahead to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more-information-needed Further information is required
Projects
None yet
Development

No branches or pull requests

5 participants