Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add a thread checker #47966

Merged
merged 2 commits into from
Oct 12, 2024
Merged

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Oct 10, 2024

As @rynewang mentioned at comment #47793 (comment), ray core uses ASIO io context for event scheduling and callback execution.

Merely reading through the code, it's not always easy to justify correct threading usage, whether data race.

TSAN is a great tool to verify against data race and invalid memory access at runtime, but is generally not enough, since

  • It completely relies on the coverage for unit tests, which is hard to enforce along the codebase;
  • Certain tests cannot run with TSAN, for example, if we have non-default memory allocator (i.e. tcmalloc); or large tests which exceeds binary size (for static link), or too long runtime (usually for dynamic link)

ThreadChecker is a cheap runtime checker, which only takes two atomic operations, which

  • guards against threading invariants;
  • serves for better code readability, so code readers and developers are easy to understand thread execution status

Intended usage for GcsJobManager:

void CreateJob() {
  auto callback = [this]() {
    RAY_CHECK(this->thread_checker.ok());
  };
}

Signed-off-by: dentiny <[email protected]>
@dentiny dentiny mentioned this pull request Oct 10, 2024
8 tasks
@@ -41,3 +41,10 @@ cc_library(
"@nlohmann_json",
],
)

cc_library(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to propose we separate libraries and binaries into separate targets, as I mentioned here: #47928 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand thread_checker is duplicate in two targets: utils and thread_checker, since utils simply declares all header/impls as its dependency; I will make another PR to cleanup.


namespace ray {

bool ThreadChecker::ok() const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall runtime cost is ignorably small -- I'm pretty confident I could squeeze these CPU cycles from elsewhere.

If developers are really concerns about it (which means they care about nano seconds latency), they could use RAY_DCHECK to ignore at release.

@dentiny
Copy link
Contributor Author

dentiny commented Oct 10, 2024

@rynewang / @jjyao would appreciate a code review :)

@rynewang
Copy link
Contributor

hmm. I'm ok with this since std::this_thread::get_id() overhead is arguably small, as compared to the lots of "generous" copies sprinkled around the code base.

However let's come up with a good name. ThreadChecker::ok() is kinda vague - there are a lot of possible thread properties. Maybe: SingleThreadAccess::check()? or ThreadChecker::CheckOnSameThread()? or something understandable by looking at callsite.

Next, I don't think we need the mutable and const ok(). We can just expose it as a non-const method and it's totally fine.

Next. do we want it copyable/movable? needs some thoughts.

  • maybe we need copyable, because we want to check 1 callsite affines with 2 callsites in 2 callbacks.
  • maybe we should disable move, since it's easy to accidentally use the moved-from one, leading to false negative.

thoughts?

@dentiny
Copy link
Contributor Author

dentiny commented Oct 10, 2024

Next. do we want it copyable/movable? needs some thoughts.
maybe we need copyable, because we want to check 1 callsite affines with 2 callsites in 2 callbacks.
maybe we should disable move, since it's easy to accidentally use the moved-from one, leading to false negative.
thoughts?

Atomic variable is not copiable or movable, so same as ThreadChecker.

@dentiny
Copy link
Contributor Author

dentiny commented Oct 10, 2024

However let's come up with a good name. ThreadChecker::ok() is kinda vague - there are a lot of possible thread properties. Maybe: SingleThreadAccess::check()? or ThreadChecker::CheckOnSameThread()? or something understandable by looking at callsite.

Sounds good! Updated to IsOnSameThread, hopefully it reads better.

@rynewang
Copy link
Contributor

Next. do we want it copyable/movable? needs some thoughts.
maybe we need copyable, because we want to check 1 callsite affines with 2 callsites in 2 callbacks.
maybe we should disable move, since it's easy to accidentally use the moved-from one, leading to false negative.
thoughts?

Atomic variable is not copiable or movable, so same as ThreadChecker.

then we'd have some bad times. Most of our usage is about "this function and its callbacks are on same thread". if we can't copy/move the atomic to lambda callback we can't really use it.

@dentiny
Copy link
Contributor Author

dentiny commented Oct 10, 2024

then we'd have some bad times. Most of our usage is about "this function and its callbacks are on same thread". if we can't copy/move the atomic to lambda callback we can't really use it.

No worries at all, generally thread checker should be data member, a this pointer is enough.

Checkout current code impl,

auto on_done = [this, job_id, mutable_job_table_data, reply, send_reply_callback](

the callback is already capturing this.

@dentiny
Copy link
Contributor Author

dentiny commented Oct 11, 2024

@rynewang Let me clarify more

Intended use case is

class GcsJobManager {
 public:
  void HandleDone() {
    auto on_done = [this]() {
      thread_checker_.CheckThread();
    };
  }

 private:
  ThreadChecker thread_checker_;
};

No copy or move is needed.

Let me know if you have another thought, thanks.

@rynewang
Copy link
Contributor

Makes sense. I found actually we already have something similar to yours - with mutex. Once this PR is merged we can replace that thing with yours.

void ThreadCheck() const {

@dentiny
Copy link
Contributor Author

dentiny commented Oct 11, 2024

Makes sense. I found actually we already have something similar to yours - with mutex. Once this PR is merged we can replace that thing with yours.

Sounds good, I could followup with another PR after this one merged.

@dentiny
Copy link
Contributor Author

dentiny commented Oct 12, 2024

@rynewang Could you please help me to merge the PR? Thanks!

@rynewang rynewang enabled auto-merge (squash) October 12, 2024 18:19
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 12, 2024
@rynewang rynewang merged commit b539440 into ray-project:master Oct 12, 2024
7 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
`ThreadChecker` is a cheap runtime checker, which only takes two atomic
operations, which
- guards against threading invariants;
- serves for better code readability, so code readers and developers are
easy to understand thread execution status

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
`ThreadChecker` is a cheap runtime checker, which only takes two atomic
operations, which
- guards against threading invariants;
- serves for better code readability, so code readers and developers are
easy to understand thread execution status

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
`ThreadChecker` is a cheap runtime checker, which only takes two atomic
operations, which
- guards against threading invariants;
- serves for better code readability, so code readers and developers are
easy to understand thread execution status

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants