-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs fix and improvements to Node.Start() and Node.Stop() #414
base: main
Are you sure you want to change the base?
Conversation
- Create Meaningful error macros - Make IsRunning flag atomic for safety purpose. - Fix deso-protocol#413 - Add utility functions for change the status of the node.
Unit tests for the following utility functions to change the status of the node. - *Node.IsRunning() - *Node.StatusStartToStop() - *Node.StatusStopToStart() - changeRunningStatus
Duplicating two test utility functions to avoid the cyclic issues. Cannot import integration_testing in cmd/ packager, this creates a cycle.
Changing the names of the test utility functions to improve the readability. 1. Changing getDirectory to getTestDirectory. 2. Changing generateConfig to GenerateTestConfig()
Updating test utility function names to match the latest updates.
Wow great work! We're reviewing and should get back shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome Karthic, thank you so much for this contribution! I've added some small comments during my review. It would be great if we could incorporate them into the PR before we merge. Let me know what you think 😀
cmd/node_test.go
Outdated
|
||
// expected running status should be false before the node is started | ||
expectedRunningStatus := false | ||
actualRunningstatus := testNode.IsRunning() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we camel case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm particular about these things too, missed it by mistake. Thanks for pointing out.
cmd/node.go
Outdated
// set to false after Stop() is called. Mainly used in testing. | ||
// Use the convinience methods changeRunningStatus/StatusStartToStop/StatusStopToStart to change node status | ||
// Use *Node.IsRunning() to retreive the node running status. | ||
running *atomic.Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea; however, I believe atomic.Bool
was added in 1.19. Our code is currently only compatible with 1.18, so I think sadly we'll have to revert running
to a bool. While we're at it, there is a chance that in the future we might want to have more statuses than "running" or "not running", and it looks like the functionality you've written already accounts for that which is great. In light of that, could we actually rename this running
variable to status
with a custom byte type NodeStatus
that will, for now, take values of stopped
(NodeStatus
is byte(0)
), running
(NodeStatus
is byte(1)
). If we feel like it, optionally we could add statuses of new
, restarting
, etc. feel free to get creative lol. Let's make sure that the status updating functions you wrote also reflect this change. Lastly, since we don't use the atomic package, could we use the runningMutex
to nail all the race conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Is the Go 1.18 Compatibility documented anywhere in the docs? If not I'll make a documentation/README update.
Hmm, Atomic makes code much cleaner, but yes, will wait for a future 1.19 bump! Your recommendation sounds perfect, I'll soon ship the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AeonSw4n I've incorporated the changes. Let me know what you think.
integration_testing/tools.go
Outdated
// It's usually the first step to starting a node. | ||
func generateConfig(t *testing.T, port uint32, dataDir string, maxPeers uint32) *cmd.Config { | ||
func GenerateTestConfig(t *testing.T, port uint32, dataDir string, maxPeers uint32) *cmd.Config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way we can deduplicate this and keep only a single GenerateTestConfig
in cmd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. I'll do this change
- Moving GenerateTestConfig to cmd package to avoid issues related to cyclic usage of the test utility inside the cmd package.
- Updating testconfig in blocksync tests. GenerateTestConfig is moved to cmd package.
- changing node status to custom NodeStatus byte type - Setting valid status codes - Utility functions to validate and change the status codes - Using new status locks for safely reading and writing the node status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey man, just reviewed your changes. Really solid code overall! I think before we're ready to merge this boyo into our codebase, I have to be confident we won't get screwed by the additional complexity entailed by the status changes and locking/unlocking of the statusMutex
. We gotta make sure everything is intuitively forward compatible, and no dev in the future can break things too easily. Left you some suggestions, let me know what you think! I'd say after we address these things, we'll be ready to sync some nodes and make sure everything runs smooth as butter.
cmd/node.go
Outdated
@@ -273,7 +358,9 @@ func (node *Node) Start(exitChannels ...*chan struct{}) { | |||
case <-syscallChannel: | |||
} | |||
|
|||
node.statusMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like you're using statusMutex
very similarly to how runningMutex
is used. How about we just get rid of runningMutex
and replace it with the statusMutex
? In this case, we would also get rid of node.statusMutex.Lock()/Unlock()
in here, since the mutex will be held inside of node.Stop()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with all feedback from @AeonSw4n . Main input from me is to move all the mutex handling into the helper functions that handle stopping and updating status so the caller doesn't have to worry about it.
Agreed.
cmd/node.go
Outdated
// Loads the status of Node. | ||
// Modifying LoadStatus() and the validateStatus() functions and their tests are hard requirements | ||
// to add new status codes for the node. | ||
func (node *Node) LoadStatus() (NodeStatus, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find "get/set" more intuitive than "load/change" (though it's just my personal preference). Since this getter isn't holding a mutex, maybe we rename it to GetStatusWithoutLock
-- this way other developers will immediately know that this function is supposed to be called with a status mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also maybe we make this method private and wrap it with some other GetStatus()
that does actually hold the status mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be best if all the mutex logic lives within these functions instead of handled by the callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
cmd/node.go
Outdated
|
||
// changes the running status of the node | ||
// Always use this function to change the status of the node. | ||
func (node *Node) changeNodeStatus(newStatus NodeStatus) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the setter for the node status
, similarly to my comment about the getter LoadStatus()
, maybe we rename it to setNodeStatusWithoutLock()
?
cmd/node.go
Outdated
switch operation { | ||
case lib.NodeRestart: | ||
// Using Mutex while accessing the Node status to avoid any race conditions | ||
node.statusMutex.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move the node.statusMutex.Lock()/Unlock()
and the if !node.IsRunning(){...}
condition inside the node.Stop()
. Since Stop()
should never be called when node is already stopped it feels reasonable to isolate everything inside Stop
. Same goes for the NodeErase
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes^ - big piece from my review is that we don't want to have these mutexes flying around all the calling functions, but should instead handle the lock/unlock in the helper functions as much as we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/node.go
Outdated
// status is nil when a NewNode is created, it is initialized and set to RUNNING [byte(1)] on node.Start(), | ||
// set to STOPPED [byte(2)] after Stop() is called. | ||
|
||
// Use the convinience methods changeNodeStatus/UpdateStatusRunning/UpdateStatusStopped to change node status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: convenience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Use the convinience methods changeNodeStatus/UpdateStatusRunning/UpdateStatusStopped to change node status. | |
// Use the convenience methods changeNodeStatus/UpdateStatusRunning/UpdateStatusStopped to change node status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/node.go
Outdated
// Use *Node.IsRunning() to check if the node is running. | ||
// Use *Node.LoadStatus() to retrieve the status of the node. | ||
status *NodeStatus | ||
// Held whenver the status of the node is read or altered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: whenever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Held whenver the status of the node is read or altered. | |
// Held whenever the status of the node is read or altered. |
cmd/node.go
Outdated
|
||
node.statusMutex.Lock() | ||
// Load the node status. | ||
// This is identify whether the node is initialized for the first time or it's a restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to identify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This is identify whether the node is initialized for the first time or it's a restart. | |
// This is to identify whether the node is initialized for the first time or it's a restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with all feedback from @AeonSw4n . Main input from me is to move all the mutex handling into the helper functions that handle stopping and updating status so the caller doesn't have to worry about it.
cmd/node.go
Outdated
// ErrAlreadyStopped is returned when somebody tries to stop an already | ||
// stopped service (without resetting it). | ||
ErrAlreadyStopped = errors.New("already stopped") | ||
// ErrNodeNeverStarted is returned when somebody tries to find or changge the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ErrNodeNeverStarted is returned when somebody tries to find or changge the | |
// ErrNodeNeverStarted is returned when somebody tries to find or change the |
cmd/node.go
Outdated
// NEVERSTARTED is the default status, cannot set deliberately. | ||
ErrCannotSetToNeverStarted = errors.New("cannot set the node status to neverstarted") | ||
// Error if the atomic compare of swap of the node status fails | ||
ErrFailedtoChangeNodeStatus = errors.New("failed to change the status of the node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable does not appear to be used - if so, can we please remove it? If we do keep it, let's fix the capitalization on the name
ErrFailedtoChangeNodeStatus = errors.New("failed to change the status of the node") | |
ErrFailedToChangeNodeStatus = errors.New("failed to change the status of the node") |
cmd/node.go
Outdated
// status is nil when a NewNode is created, it is initialized and set to RUNNING [byte(1)] on node.Start(), | ||
// set to STOPPED [byte(2)] after Stop() is called. | ||
|
||
// Use the convinience methods changeNodeStatus/UpdateStatusRunning/UpdateStatusStopped to change node status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Use the convinience methods changeNodeStatus/UpdateStatusRunning/UpdateStatusStopped to change node status. | |
// Use the convenience methods changeNodeStatus/UpdateStatusRunning/UpdateStatusStopped to change node status. |
cmd/node.go
Outdated
// Use *Node.IsRunning() to check if the node is running. | ||
// Use *Node.LoadStatus() to retrieve the status of the node. | ||
status *NodeStatus | ||
// Held whenver the status of the node is read or altered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Held whenver the status of the node is read or altered. | |
// Held whenever the status of the node is read or altered. |
cmd/node.go
Outdated
|
||
node.statusMutex.Lock() | ||
// Load the node status. | ||
// This is identify whether the node is initialized for the first time or it's a restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This is identify whether the node is initialized for the first time or it's a restart. | |
// This is to identify whether the node is initialized for the first time or it's a restart. |
cmd/node_test.go
Outdated
expectedRunningStatus = true | ||
actualRunningStatus = testNode.IsRunning() | ||
require.Equal(t, actualRunningStatus, expectedRunningStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedRunningStatus = true | |
actualRunningStatus = testNode.IsRunning() | |
require.Equal(t, actualRunningStatus, expectedRunningStatus) | |
require.True(t, testNode.IsRunning(), expectedRunningStatus) |
cmd/node_test.go
Outdated
expectedRunningStatus = false | ||
actualRunningStatus = testNode.IsRunning() | ||
require.Equal(t, actualRunningStatus, expectedRunningStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedRunningStatus = false | |
actualRunningStatus = testNode.IsRunning() | |
require.Equal(t, actualRunningStatus, expectedRunningStatus) | |
require.False(t, testNode.IsRunning()) |
cmd/node_test.go
Outdated
expectedRunningStatus := false | ||
actualRunningStatus := testNode.IsRunning() | ||
require.Equal(t, actualRunningStatus, expectedRunningStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedRunningStatus := false | |
actualRunningStatus := testNode.IsRunning() | |
require.Equal(t, actualRunningStatus, expectedRunningStatus) | |
require.False(t, testNode.IsRunning()) |
cmd/node_test.go
Outdated
actualErr = testNode.UpdateStatusRunning() | ||
require.NoError(t, actualErr) | ||
// Once the running flag is changed, the isRunning function should return true | ||
require.Equal(t, true, testNode.IsRunning()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Equal(t, true, testNode.IsRunning()) | |
require.True(t, testNode.IsRunning()) |
- replacing a panic with glag.Fatalf - simplifying the logic of node.IsRunning() - Deduplicating the glog messages, shifting the responsibility of logging to the caller.
- Removing runningMutex and replacing it with statusMutex - Adding statusMutex inside *Node.Start() and *Node.Stop() - Changing LoadStatus to GetStatus - Creating private functions getStatusWithoutLock and updateStatusWithoutLock to change node status without holding statusLocks. - Creating wrapper functions GetStatus and UpdateStatus to fetch and update node status while holding statusMutex locks.
- Updating function names to match the updates to utility functions in node.go - Adding test to verify graceful shutdown of the node.
@AeonSw4n @lazynina : I made the changes as per your suggestions, thank you for the great inputs and the timely reviews. Here's the summary of my changes.
Changed LoadStatus->GetStatus and ChangeStatus-> SetStatus
created private functions with suffix
Got rid of
Node.Stop() automatically takes care of validating the current status of the node before stopping. Hence, removed the
Renamed
Mutexes don't fly anymore :) 🚀 |
@hackintoshrao thanks for making the updates! Just wanted to make a quick update on my end and let you know reviewing and testing all your code is a high priority for me, and will get to it once I finish being tunnel-visioned on access groups & group chats. |
Thanks @AeonSw4n . Just that this PR will enable me to do tons of contributions. I'm sweeping through the codebase writing tests, but I can't move forward till this is merged. May be I can help with access groups and chats by writing tests and manually testing them. |
Oh, that's so awesome @hackintoshrao I'm so excited to see your other contributions! Okay, I'll look into your code in the evening. Speaking of tests, I've actually been working on a new transaction unit testing framework lately. I'll PR it out of my current code so it's easier to read lol -- figured that might be interesting to you. So cool to see your readiness to help with access groups! If you feel like checking out the code the latest branch is in this PR #412. Let's wrap this PR, and I'll think of what could be the best use of your time. :) |
Thanks @AeonSw4n . Getting this merged will also help me fix tests for the #412 PR. Eager to check out the transaction test framework. (On a different note, so cool to see Badger! I used to work at Dgraph, the company behind Badger.) I like the idea of the transaction test suite. I've been thinking of a framework with parallel tests to benchmark and stress test the core's performance. But, many things need to clean up before getting there. Hence, I'm running across the codebase with tests. I'm curious to monitor the Go routine count and graceful exits while stress-testing the transactions, I have some early suspicions of Go-routine leaks. PS: I have OCD for an unusually excessive amount of testing😉 Overall this new transaction test framework can be used for testing a lot of things around leaks, performance, and correctness. Keen on checking it out and contributing to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed your latest code, and I've got some minor change requests, but giving you an approve because your code is really awesome! One more comment I have is, do we need these .DS_Store
files? We didn't have them before and they feel a bit redundant.
Once again thank you for this contribution and your flexibility when faced with our feedback 😃 @lazynina @diamondhands0 can you guys also take a look so we can merge?
Thank you @AeonSw4n :) I'll remove the .DS files, they are redundant.
I learned quite a bit from the reviews :) |
- Removing the .DS files - Adding error wrapper %w - Removing the redundant lock
@AeonSw4n Found the transaction test framework inside the Is there a better channel to communicate while I work on the codebase than Github comments😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hackintoshrao - looking very good. I've opened a PR against your fork with some changes that would simplify the diff between your fork and the existing code in core. I think it'll be faster for us to get it through with a smaller diff - we can keep the existing function names and move all the testing logic into the integration_testing
package.
Ln/pr 414 suggested changes
Renaming setstatus functions with a suffix of without locks.
Prallelizing tests/transactions could be fun @hackintoshrao :) Though keep in mind they will always be processed sequentially in the mempool. We are planning to build actual prallelization for transaction processing a little further on the horizon -- probably after/concurrently with PoS. I've also refactored the testing framework into another PR #422. Before it's merged I'm planning on adding a state verification for before & after connect-disconnect to Postgres tests (currently this only works for Badger tests) and just documenting everything a little better. Re communication channel: certainly! I think you were chatting with Dylan over Discord, is that a good channel? I'll reach out to you soon (almost done with access groups 🤞). |
@lazynina The PR is ready as per the suggestions, do let me know if there's anything more needs to be done for this to be merged :) |
@hackintoshrao I have been testing this out by running and stopping a node, but ran into a case where one of the status change validations returns an error and the node will not switch to running. Trying to figure out exactly what's going on and to test out more stop + restart situations to make sure we don't have nodes that get stuck. I have only reproduced it once but it's a little unclear what is happening. As I have more details, I'll share them |
Thank you @lazynina . Please do report the edge scenarios when you have them. |
I can't reproduce it - maybe it happened in a dream! @diamondhands0 for final review. |
Thanks @lazynina . Will wait for the review by @diamondhands0 . |
@hackintoshrao - can you provide more details on how to reproduce the bug ( #413 ) and what specifically in this PR is fixing that bug? @diamondhands0 |
The bug occurs when the In this scenario, Hence, instead of safely recovering, it panics here. Here is a quick screen recording explainer https://www.loom.com/share/410232d7a6ab4d1e8610c9259de1cbf6 . |
Context: I love the project and what it stands for. It's very important for an impactful and complex project like this to have great coverage of tests. I started working on writing unit and integration tests for the entire codebase. While I work on tests I'm fixing things as I find them. Here is the summary of the PR
node.nodeMessageChan <- lib.NodeErase
message is sent for the first time #413