tests: describe and implement tests suite #7

maxux · 2021-04-06T13:13:43Z

To get stabilization and reliability, we need tests.

We need to define and describe what we expect and what we support
We need to write tests scenario doing expected stuff

Better is to automate that with actions here.

LeeSmet · 2021-04-06T21:11:09Z

I'd like to have at least a full integration test, looking a bit like this:

Set up everything: some 0-dbs (can be local), metastor, configs, monitor, and 0-db-fs
- Limit the size of the zdbfs-data dir to 5GiB in the monitor config
Write 50 GiB worth of data on the filesystem
- Monitor size of local 0-db data namespace, and check that it stays within the configured 5GiB limit
- Untill this feature is implemented, size will become larger but should eventually settle below 5GiB
Wait for 15 or so minutes after everything is written (so 0-db can upload the dirty data file if there is one)
Simulate a node crash by killing the filesystem itself (0-db-fs), and removing the backing 0-db (the one used by 0-db-fs directly). Also kill the monitor
Set up the local 0-db again (simulate setup on a new node)
Start 0-db-fs, verify that it has a clean (i.e. fs is empty) state
Start monitor
- Monitor should rebuild the index files of 0-db-fs, verify this
Verify that all files can be seen as expected
Verify that the content of the files is readable and correct (md5sum?)
Simulate a backend data storage loss (withing the limits of the configured redundancy) by killing one of the 0-db serving as storage backend
Start and add a new 0-db to the zstor config
Check and verify that data stored on the removed 0-db is being rebuild on live ones
Wait for rebuild to finish
Simulate more storage node loss, such that all data should still be readable according to the new zstor config after the rebuild, but not according to the old one. Example: if redundancy in zstor is set as 2 data 1 parity shards, the previous test will have removed 1 0-db and a new one is added. Now this test case will remove another one of the ORIGINAL 0-dbs. As a result, the data will be on 1 of the original and the new 0-db. This means there are 2 live shards, sufficient to have all the data readable, however, there is only 1 original 0-db left, so if the repair somehow failed the data will be lost.
Again verify that all data is still correct and reachable

The above should be able to test all essential features already, and reasonably isolated (i.e if there is an issue, we should easily be able to identify what is not working based on which part of the test fails). For data, about 50GiB should be suffcient to already get a good idea of performance and stability. It would also be interesting to see if the solution can handle different scenarios, such as:

a single 50 GiB file
a large amount of small files totalling 50GiB when summed (small being a couple KiB each)
Something like a postgress database using 50GiB of backing storage.

For writing, we know there is a potential issue in the current flow, in that 0-stor can be called multiple times concurrently which might lead to index corruption. We should test both slow writing (therefore removing this issue), and fast writing (so we can see the actual real impact of this issue).

The above should already provide a large amount of data regarding stability and performance, and if this manages to work bug free we have a solid foundation / first alpha version I'd say

maxux · 2021-04-06T21:54:56Z

Simulate a node crash by killing the filesystem itself (0-db-fs), and removing the backing 0-db (the one used by 0-db-fs directly). Also kill the monitor

Set up the local 0-db again (simulate setup on a new node)

Start 0-db-fs, verify that it has a clean (i.e. fs is empty) state

Start monitor

Monitor should rebuild the index files of 0-db-fs, verify this

Verify that all files can be seen as expected

In this workflow, do you expect to start the filesystem on a fresh 0-db then restore 0-db content from backup ?

LeeSmet · 2021-04-06T22:24:30Z

Yeah, although the reverse (restore index with monitor first) should also work right? Need to identify which one is best.

And indeed, start filesystem on completely fresh 0-db

maxux · 2021-04-09T01:09:09Z

I'm doing some local test following scenarios, and first impression I get out:

Consistency seems good
- 0-db data stay at ~5 GB, file are deleted and restored correctly on demand
- Size goes up to 7 GB before detection delete files
- md5sum of files are consistent across multiple test
Performance:
- Writing: 8388608000 bytes (8.4 GB, 7.8 GiB) copied, 156.055 s, 53.8 MB/s
- Reading the 8 GB file at ~35 MB/s (with mostly all parts fetched from backend)

In comparaison, disabling hooks, to keep everything locally without any zstor usage:

Writing: 8388608000 bytes (8.4 GB, 7.8 GiB) copied, 133.118 s, 63.0 MB/s
Reading (md5sum): 8000M in 22 sec (~ 363 MB/s)

Settings:

Test made using distant zdb backend (same LAN, 1 Gbps link).
0-db splits data files after 16 MB
Reading urandom to /dev/null goes at 236 MB/s
Writing the 8 GB file using 32 MB data split, writes at 106 MB/s

Bugs:
I got some weird bug where zstor is not pushing data to zdb before restarting zdb, couple of time. But could not reproduce all the time yet.

scottyeager · 2021-04-16T22:24:30Z

I've done quite a bit of non systematic testing of storing data, removing the zdb data files, and comparing checksums when the files are restored. My method for simulating node failure, in case it's helpful, is adding an iptables rule like follows, targeting the port number:

sudo iptables -I OUTPUT -p tcp -d [insert IP] -j REJECT --reject-with tcp-reset --dport [insert port]

Then remove it when done with -D instead of -I.

From what I've seen, zstor reliably rebuilds data files with only two of three of the used backend nodes available, using default redundancy settings. I've played a bit with tweaking the redundancy settings and results are as expected.

I also noticed in the latest version that zstor no longer waits on a timeout when a node it's trying to retrieve from is unreachable 🙂

maxux added the type_feature label Apr 6, 2021

maxux added this to the next milestone Apr 6, 2021

LeeSmet mentioned this issue Jul 8, 2021

QSFS additional stability improvements threefoldtech/home#1101

Closed

4 tasks

sasha-astiadi assigned maxux Jul 19, 2021

robvanmieghem removed this from the next milestone Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: describe and implement tests suite #7

tests: describe and implement tests suite #7

maxux commented Apr 6, 2021

LeeSmet commented Apr 6, 2021

maxux commented Apr 6, 2021

LeeSmet commented Apr 6, 2021

maxux commented Apr 9, 2021 •

edited

Loading

scottyeager commented Apr 16, 2021

tests: describe and implement tests suite #7

tests: describe and implement tests suite #7

Comments

maxux commented Apr 6, 2021

LeeSmet commented Apr 6, 2021

maxux commented Apr 6, 2021

LeeSmet commented Apr 6, 2021

maxux commented Apr 9, 2021 • edited Loading

scottyeager commented Apr 16, 2021

maxux commented Apr 9, 2021 •

edited

Loading