Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: describe and implement tests suite #7

Open
maxux opened this issue Apr 6, 2021 · 5 comments
Open

tests: describe and implement tests suite #7

maxux opened this issue Apr 6, 2021 · 5 comments
Assignees

Comments

@maxux
Copy link
Collaborator

maxux commented Apr 6, 2021

To get stabilization and reliability, we need tests.

  • We need to define and describe what we expect and what we support
  • We need to write tests scenario doing expected stuff

Better is to automate that with actions here.

@maxux maxux added this to the next milestone Apr 6, 2021
@LeeSmet
Copy link

LeeSmet commented Apr 6, 2021

I'd like to have at least a full integration test, looking a bit like this:

  • Set up everything: some 0-dbs (can be local), metastor, configs, monitor, and 0-db-fs
    • Limit the size of the zdbfs-data dir to 5GiB in the monitor config
  • Write 50 GiB worth of data on the filesystem
    • Monitor size of local 0-db data namespace, and check that it stays within the configured 5GiB limit
    • Untill this feature is implemented, size will become larger but should eventually settle below 5GiB
  • Wait for 15 or so minutes after everything is written (so 0-db can upload the dirty data file if there is one)
  • Simulate a node crash by killing the filesystem itself (0-db-fs), and removing the backing 0-db (the one used by 0-db-fs directly). Also kill the monitor
  • Set up the local 0-db again (simulate setup on a new node)
  • Start 0-db-fs, verify that it has a clean (i.e. fs is empty) state
  • Start monitor
    • Monitor should rebuild the index files of 0-db-fs, verify this
  • Verify that all files can be seen as expected
  • Verify that the content of the files is readable and correct (md5sum?)
  • Simulate a backend data storage loss (withing the limits of the configured redundancy) by killing one of the 0-db serving as storage backend
  • Start and add a new 0-db to the zstor config
  • Check and verify that data stored on the removed 0-db is being rebuild on live ones
  • Wait for rebuild to finish
  • Simulate more storage node loss, such that all data should still be readable according to the new zstor config after the rebuild, but not according to the old one. Example: if redundancy in zstor is set as 2 data 1 parity shards, the previous test will have removed 1 0-db and a new one is added. Now this test case will remove another one of the ORIGINAL 0-dbs. As a result, the data will be on 1 of the original and the new 0-db. This means there are 2 live shards, sufficient to have all the data readable, however, there is only 1 original 0-db left, so if the repair somehow failed the data will be lost.
  • Again verify that all data is still correct and reachable

The above should be able to test all essential features already, and reasonably isolated (i.e if there is an issue, we should easily be able to identify what is not working based on which part of the test fails). For data, about 50GiB should be suffcient to already get a good idea of performance and stability. It would also be interesting to see if the solution can handle different scenarios, such as:

  • a single 50 GiB file
  • a large amount of small files totalling 50GiB when summed (small being a couple KiB each)
  • Something like a postgress database using 50GiB of backing storage.

For writing, we know there is a potential issue in the current flow, in that 0-stor can be called multiple times concurrently which might lead to index corruption. We should test both slow writing (therefore removing this issue), and fast writing (so we can see the actual real impact of this issue).

The above should already provide a large amount of data regarding stability and performance, and if this manages to work bug free we have a solid foundation / first alpha version I'd say

@maxux
Copy link
Collaborator Author

maxux commented Apr 6, 2021

  • Simulate a node crash by killing the filesystem itself (0-db-fs), and removing the backing 0-db (the one used by 0-db-fs directly). Also kill the monitor
  • Set up the local 0-db again (simulate setup on a new node)
  • Start 0-db-fs, verify that it has a clean (i.e. fs is empty) state
  • Start monitor
    • Monitor should rebuild the index files of 0-db-fs, verify this
  • Verify that all files can be seen as expected

In this workflow, do you expect to start the filesystem on a fresh 0-db then restore 0-db content from backup ?

@LeeSmet
Copy link

LeeSmet commented Apr 6, 2021

Yeah, although the reverse (restore index with monitor first) should also work right? Need to identify which one is best.

And indeed, start filesystem on completely fresh 0-db

@maxux
Copy link
Collaborator Author

maxux commented Apr 9, 2021

I'm doing some local test following scenarios, and first impression I get out:

  • Consistency seems good
    • 0-db data stay at ~5 GB, file are deleted and restored correctly on demand
    • Size goes up to 7 GB before detection delete files
    • md5sum of files are consistent across multiple test
  • Performance:
    • Writing: 8388608000 bytes (8.4 GB, 7.8 GiB) copied, 156.055 s, 53.8 MB/s
    • Reading the 8 GB file at ~35 MB/s (with mostly all parts fetched from backend)

In comparaison, disabling hooks, to keep everything locally without any zstor usage:

  • Writing: 8388608000 bytes (8.4 GB, 7.8 GiB) copied, 133.118 s, 63.0 MB/s
  • Reading (md5sum): 8000M in 22 sec (~ 363 MB/s)

Settings:

  • Test made using distant zdb backend (same LAN, 1 Gbps link).
  • 0-db splits data files after 16 MB
  • Reading urandom to /dev/null goes at 236 MB/s
  • Writing the 8 GB file using 32 MB data split, writes at 106 MB/s

Bugs:
I got some weird bug where zstor is not pushing data to zdb before restarting zdb, couple of time. But could not reproduce all the time yet.

@scottyeager
Copy link
Contributor

I've done quite a bit of non systematic testing of storing data, removing the zdb data files, and comparing checksums when the files are restored. My method for simulating node failure, in case it's helpful, is adding an iptables rule like follows, targeting the port number:

sudo iptables -I OUTPUT -p tcp -d [insert IP] -j REJECT --reject-with tcp-reset --dport [insert port]

Then remove it when done with -D instead of -I.

From what I've seen, zstor reliably rebuilds data files with only two of three of the used backend nodes available, using default redundancy settings. I've played a bit with tweaking the redundancy settings and results are as expected.

I also noticed in the latest version that zstor no longer waits on a timeout when a node it's trying to retrieve from is unreachable 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants