Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mirror): Pass epoch config overrides on forknet init #12421

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

VanBarbascu
Copy link
Contributor

Context

Forknet is initialized by provisioning a host with a setup directory containing the source data folder. The following commands are executed to prepare the network:

  • neard init – generates the configs and keys.
  • neard fork-network set-validators – amends the database, inserts the specified set of validators, and creates the genesis config for the new network.
  • neard fork-network finalize – completes the forking process.

In this PR, we focus on the first two steps. If we need to override the epoch configs, they will be generated by neard init and then modified by the neard-runner process with the provided overrides. We first apply the general overrides under the "all" key, followed by protocol version-specific ones.

When overriding an epoch config that does not exist, the contents are copied from the previously available epoch config and then overwritten. i.e. let's assume that mainnet only has epoch configs for protocol 72 and 74. If you want to add overrides for p.v. 75, they will be applied on top of the original epoch config for p.v. 74.

When creating the genesis in set-validators, we read the epoch config corresponding to the genesis protocol version.

How to use it

When calling mirror new-test add the --epoch-config-overrides.
This will accept a JSON formatted string that has the protocol version as keys and a list of jq styled overrides sepparated by |.

By protocol version

For example, if you need to update the num_block_producer_seats, num_chunk_validator_seats and num_chunk_producer_seats for protocol version 73, you would use the following argument: { "74": ".num_block_producer_seats = 3 | .validator_selection_config.num_chunk_validator_seats = 22 | .validator_selection_config.num_chunk_producer_seats = 3"}

With all key

A protocol version override only affects a specific epoch config. If you want to apply a change to all epoch configs, add that change under the all key instead.

Priority

Use the all key to apply overrides across all configs simultaneously. Protocol version-specific overrides will then be applied afterward.

Testing

For testing, I used a 6 node forknet with different scenarios:

For backwards compatibility: old binary(2.3.0), new neard-runner

mirror new-test --epoch-length 50 --genesis-protocol-version 71 --num-validators 6 --num-seats 6 --stateless-setup --new-chain-id modknet

Result: network progessed ok with all 6 ndoes as validators.

New binary, new neard-runner, with overrides

mirror new-test --epoch-length 20 --genesis-protocol-version 73 --num-validators 6 --num-seats 6 --stateless-setup --new-chain-id modknet --epoch-config-overrides '{"all": ".validator_selection_config.num_chunk_validator_seats = 22 | .validator_selection_config.num_chunk_producer_seats = 3 | .num_block_producer_seats = 3 | .validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true"}'
Result:
image
This was for testing purposes. The same result could have been achieved by using the --num-seats 3 parameter and overriding only the shuffle_shard_assignment_for_chunk_producers parameter.

New binary, new neard-runner, with overrides on all and on version 73

mirror new-test --epoch-length 20 --genesis-protocol-version 71 --num-validators 6 --num-seats 6 --stateless-setup --new-chain-id modknet --epoch-config-overrides '{"all": ".validator_selection_config.num_chunk_validator_seats = 22 | .validator_selection_config.num_chunk_producer_seats = 3 | .num_block_producer_seats = 3", "73": ".validator_selection_config.num_chunk_producer_seats = 6"}'

Result:
image
The network starts at protocol version 71. In the image, it is in the first epoch after the genesis epoch. In the next epoch, protocol version 73 will take effect, along with the num_chunk_producer_seats = 6 override.

Copy link

codecov bot commented Nov 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.68%. Comparing base (6ccbd62) to head (4d2d660).

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12421      +/-   ##
==========================================
+ Coverage   71.16%   71.68%   +0.51%     
==========================================
  Files         843      843              
  Lines      170788   170788              
  Branches   170788   170788              
==========================================
+ Hits       121547   122432     +885     
+ Misses      43913    42993     -920     
- Partials     5328     5363      +35     
Flag Coverage Δ
backward-compatibility 0.16% <ø> (?)
db-migration 0.16% <ø> (ø)
genesis-check 1.27% <ø> (ø)
integration-tests 39.35% <ø> (+1.00%) ⬆️
linux 70.97% <ø> (+<0.01%) ⬆️
linux-nightly 71.26% <ø> (+7.32%) ⬆️
macos 50.65% <ø> (-0.01%) ⬇️
pytests 1.57% <ø> (ø)
sanity-checks 1.38% <ø> (ø)
unittests 64.26% <ø> (+<0.01%) ⬆️
upgradability 0.21% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

remove the epoch config dir on new_test
@@ -141,6 +143,77 @@ class TestState(Enum):
}"""


class EpochConfigOverrides:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: feel free to keep it if it's more your style and you want it that way, but why a class? The only place it's used is in the single statement EpochConfigOverrides(...).override(...), which to me is an indication you could just keep it simpler and write this as a normal function


existing_epoch_config_files = os.listdir(epoch_config_dir)
existing_protocol_versions = set([
int(re.match(r'(\d+)\.json', f).group(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit fragile in that if someone ever adds a random file to this directory maybe while debugging some problem with the forknet setup or something, we will crash here instead of just ignoring it. Maybe not super likely but I think it's prob best to be more robust and skip filenames that dont match

int(re.match(r'(\d+)\.json', f).group(1))
for f in existing_epoch_config_files
])
previous_protocol_version_file = min(existing_protocol_versions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This min() will throw an exception if it's empty. Even if it's guaranteed not to be after neard init --dump-epoch-config, it's prob still not really needed for this program to crash here rather than just reporting an error, or maybe even just continuing without epoch config modifications but printing an error or something

also nit: previous_protocol_version_file seems to refer to the number, not the filename, so maybe just previous_protocol_version?

self.neard_init(rpc_port, protocol_port, validator_id,
genesis_protocol_version,
epoch_config_overrides)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to catch a more specific set of exceptions, or to express the logic here in some other way if you want. Catching all exceptions is not great because some of them should really just cause us to exit if they indicate a bug in the program rather than some expected thing

see "Never use catch-all except: statements" here

@@ -591,6 +594,7 @@ def __call__(self, parser, namespace, values, option_string=None):
new_test_parser.add_argument('--genesis-protocol-version', type=int)
new_test_parser.add_argument('--stateless-setup', action='store_true')
new_test_parser.add_argument('--gcs-state-sync', action='store_true')
new_test_parser.add_argument('--epoch-config-overrides', type=json.loads)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add some help text with an example like in the PR description? Also might be good to warn the user in this help text that this won't work unless the binary has the --dump-epoch-config option in neard init

])
previous_protocol_version_file = min(existing_protocol_versions)
for protocol_version in sorted(self.new_protocol_versions):
if protocol_version in existing_protocol_versions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check will not work because self.new_protocol_versions contains strings. Also note that self.epoch_config_overrides has string keys. So this needs to all be handled uniformly somehow

@marcelo-gonzalez
Copy link
Contributor

I know you've already made some progress on the different things that go into making this epoch config stuff work, but if I may, I have a few thoughts about how this might be made simpler, since some of it seems a bit strange to me. What I'm saying here isn't blocking this PR (I would have approved if not for the bug and except Exception as e:), but I think this is prob worth pointing out.

So when we run a forknet with this --epoch-config-overrides flag set, neard_runner.py will initialize the home directory by running neard init with the --dump-epoch-config config option you added. This command will look through the mainnet epoch configs between --genesis-protocol-version and the node's current stable protocol version and write them to files in ~/.near/epoch_configs. Where does it get these mainnet epoch configs? It gets them from the ones checked into git. Without even continuing to think about the flow here, this to me already indicates that something is not quite right. We're adding a flag to neard init to dump files checked into git so that we can then use them in forknet tests? Why not just upload the files we want? We have them checked into git. This is also true of testnet.

It seems much simpler to me to just directly upload the epoch config files. You could easily include them in the list of files we upload to the neard runner's directory here. And in light of that, maybe consider just reverting the change that added this neard init --dump-epoch-config flag? We already have the exact files we wanted it to dump ready to be uploaded wherever we want

So why not just do this? If the user gives --epoch-config-overrides, have neard_runner.py make the epoch_configs dir after running neard init, and copy over any needed mainnet or testnet epoch configs to it (these can all be uploaded when initializing a forknet like in the link I gave above). Then for any protocol version explicitly given in the --epoch-config-overrides flag, just do the thing in this PR where we copy the epoch config with the highest version less than or equal to that one, and make the edits we want

json_data = json.load(f)

# Apply the overrides
updated_json = jq.compile(overrides).input(json_data).first()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of jq, did you consider just having the --epoch-config-overrides flag take a path to a json file where the wanted changes are? It might be a bit easier if someone wants to make lots of edits (imagine editing the fields you included in the PR description, along with editing shard_layout and num_block_producer_seats_per_shard or something). In that case a command line arg with a bunch of big jq programs might be kind of unwieldy

@Longarithm
Copy link
Member

I have a system design concern that we are building separate logic to handle epoch configs in Python.

This feels more fragile than handling most of it in Rust nearcore code, which already knows EpochConfig structure. If we change EpochConfig, forknet will silently break and we'd need to find a fix an issue.

My another perspective is that nearcore development heavily relies on forknet, as it shows whether new features are ready for mainnet. Because of that, we should aim to minimise number of dependencies and assumptions made. Then, practically, using jq and re seems like an overkill - we should do something similar to RuntimeConfigStore, it already handles upgrades quite well.

Alternatively, it could be better to refactor EpochConfig by replacing validator_selection_config.X fields with just X - it would make structure more flat. Or add TODO, and just assume that if field X is not present, assume that it is inside validator_selection_config. With that we could use simple overrides like key_string -> ParameterValue. ParameterValue is an actual struct from RuntimeConfigStore, we can make our own one. Yeah we have stuff like num_block_producer_seats_per_shard - but we can use a moment and deprecate it as we had to long time ago!

Then, we could make it a command like neard fork-network init --overrides LIST where LIST is "74 num_block_producer_seats 3" "75 num_chunk_validator_seats = 22" "76 num_chunk_producer_seats 3" (I don't know how to pass a list in bash).

Let me know if you have questions. I think it's quite crucial to discuss to make forknet more reliable.

github-merge-queue bot pushed a commit that referenced this pull request Nov 13, 2024
…2420)

When setting up a forknet, the `neard fork-network set-validators`
command is run after `neard init`. If `init` was executed with
`--dump-epoch-config`, the `epoch_configs` directory is populated with
the `mainnet` epoch configs.

Additionally, the mirror tool now includes an option to provide
overrides to the epoch configurations.

 ### For further details, check this PR (#12421).

If epoch config files are present, they will be utilized to create the
genesis configuration for the fork network.
@VanBarbascu
Copy link
Contributor Author

have a system design concern that we are building separate logic to handle epoch configs in Python.

This feels more fragile than handling most of it in Rust nearcore code, which already knows EpochConfig structure. If we change EpochConfig, forknet will silently break and we'd need to find a fix an issue.

Thanks for taking the time to look over this!

The logic introduced here only depends on the fact that EpochConfig is represented as a JSON. All the overrides are provided by the user that has the knowledge of the EpochConfig structure. So changing the EpochConfig struct will not break forknet.

@Longarithm
Copy link
Member

All the overrides are provided by the user that has the knowledge of the EpochConfig structure.

I think it's also reasonable to have a goal of running forknet in a single command.
With that in mind, forknet init is likely to be scripted, and I expect init to be hidden from the end user.
Another case: I easily see some parameters as overridden in the same way all the time (num_block|chunk_producer_seats = 10, 20... if we want chunk validator-only seats), and again, I think the end user wouldn't like to always have that in mind :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants