Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QSFS: Performance numbers are not matching the requirements #24

Open
mohamedamer453 opened this issue May 24, 2022 · 5 comments
Open

QSFS: Performance numbers are not matching the requirements #24

mohamedamer453 opened this issue May 24, 2022 · 5 comments

Comments

@mohamedamer453
Copy link

mohamedamer453 commented May 24, 2022

According to TC343 & REQ177 the min performance for big files (1GB) should be 100 MB/S but the actual figure is much lower

image

TC344 & REQ178 the min performance for mid files (1M) should be 100 MB/S but the actual figure is lower

image

TC345 & REQ179 the min performance for small files (1K) should be 1 MB/S but the actual figure is lower

image

TC346 & REQ180 states that it should be possible to create 10 million small files on a single qsfs. but when i tried to do that with the following script i lost connection to the vm

for i in {0..10000000}; do dd if=/dev/urandom of="File$(printf "%03d" "$i").txt" bs=1K count=1; done;

2022-05-23_16-19

however when the number was changed to 1 million instead of 10 million i was able to create the files as it can be seen in the metrics

2022-05-24_12-46

and i tried to create another 1 million files after the first 1 mil were done, it didn't crash or lose connection but the process is much slower than the first 1 mil

The same issues are also available in these test cases and requirements TC347, TC348, TC349 & REQ181, REQ182, REQ183 these are the same the same scenarios as mentioned above but with S3 Minio and the results are also very similar to the previous scenarios.

2022-05-23_16-14

2022-05-23_16-12

Config

  • main.tf
terraform {
  required_providers {
    grid = {
      source = "threefoldtech/grid"
    }
  }
}

provider "grid" {
}

locals {
  metas = ["meta1", "meta2", "meta3", "meta4"]
  datas = ["data1", "data2", "data3", "data4",
  "data5", "data6", "data7", "data8",
  "data9", "data10", "data11", "data12",
  "data13", "data14", "data15", "data16",
  "data17", "data18", "data19", "data20",
  "data21", "data22", "data23", "data24"]
}

resource "grid_network" "net1" {
    nodes = [7]
    ip_range = "10.1.0.0/16"
    name = "network"
    description = "newer network"
}

resource "grid_deployment" "d1" {
    node = 7
    dynamic "zdbs" {
        for_each = local.metas
        content {
            name = zdbs.value
            description = "description"
            password = "password"
            size = 10
            mode = "user"
        }
    }
    dynamic "zdbs" {
        for_each = local.datas
        content {
            name = zdbs.value
            description = "description"
            password = "password"
            size = 1
            mode = "seq"
        }
    }
}

resource "grid_deployment" "qsfs" {
  node = 7
  network_name = grid_network.net1.name
  ip_range = lookup(grid_network.net1.nodes_ip_range, 7, "")
  qsfs {
    name = "qsfs"
    description = "description6"
    cache = 10240 # 10 GB
    minimal_shards = 16
    expected_shards = 20
    redundant_groups = 0
    redundant_nodes = 0
    max_zdb_data_dir_size = 512 # 512 MB
    encryption_algorithm = "AES"
    encryption_key = "4d778ba3216e4da4231540c92a55f06157cabba802f9b68fb0f78375d2e825af"
    compression_algorithm = "snappy"
    metadata {
      type = "zdb"
      prefix = "hamada"
      encryption_algorithm = "AES"
      encryption_key = "4d778ba3216e4da4231540c92a55f06157cabba802f9b68fb0f78375d2e825af"
      dynamic "backends" {
          for_each = [for zdb in grid_deployment.d1.zdbs : zdb if zdb.mode != "seq"]
          content {
              address = format("[%s]:%d", backends.value.ips[1], backends.value.port)
              namespace = backends.value.namespace
              password = backends.value.password
          }
      }
    }
    groups {
      dynamic "backends" {
          for_each = [for zdb in grid_deployment.d1.zdbs : zdb if zdb.mode == "seq"]
          content {
              address = format("[%s]:%d", backends.value.ips[1], backends.value.port)
              namespace = backends.value.namespace
              password = backends.value.password
          }
      }
    }
  }


  vms {
    name = "vm"
    flist = "https://hub.grid.tf/tf-official-apps/threefoldtech-ubuntu-20.04.flist"
    cpu = 2
    memory = 1024
    entrypoint = "/init.sh"
    planetary = true
    env_vars = {
      SSH_KEY = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC533B35CELELtgg2d7Tsi5KelLxR0FYUlrcTmRRQuTNP9arP01JYD8iHKqh6naMbbzR8+M0gdPEeRK4oVqQtEcH1C47vLyRI/4DqahAE2nTW08wtJM5uiIvcQ9H2HMzZ3MXYWWlgyHMgW2QXQxzrRS0NXvsY+4wxe97MMZs9MDs+d+X15DfG6JffjMHydi+4tHB50WmHe5tFscBFxLbgDBUxNGiwi3BQc1nWIuYwMMV1GFwT3ndyLAp19KPkEa/dffiqLdzkgs2qpXtfBhTZ/lFeQRc60DHCMWExr9ySDbavIMuBFylf/ZQeJXm9dFXJN7bBTbflZIIuUMjmrI7cU5eSuZqAj5l+Yb1mLN8ljmKSIM3/tkKbzXNH5AUtRVKTn+aEPvJAEYtserAxAP5pjy6nmegn0UerEE3DWEV2kqDig3aPSNhi9WSCykvG2tz7DIr0UP6qEIWYMC/5OisnSGj8w8dAjyxS9B0Jlx7DEmqPDNBqp8UcwV75Cot8vtIac= root@mohamed-Inspiron-3576"
    }
    mounts {
        disk_name = "qsfs"
        mount_point = "/qsfs"
    }
  }
}
output "metrics" {
    value = grid_deployment.qsfs.qsfs[0].metrics_endpoint
}
output "ygg_ip" {
    value = grid_deployment.qsfs.vms[0].ygg_ip
}

@ramezsaeed ramezsaeed transferred this issue from threefoldtech/test_feedback Jun 1, 2022
@maxux
Copy link
Collaborator

maxux commented Jun 1, 2022

Can you first check how fast is urandom ? This can be slow.

dd if=/dev/urandom of=/dev/null bs=1M count=1000

Can you show how zdbfs is started ?

About the 10 millions crash, I'll open an issue on zdbfs and see if I can reproduce.

For the slowdown, this can happen because adding 1 million file in a single directory is a really bad thing to do, even on real filesystem. But I could notice already a huge drop down after reaching some point, but could not reproduce yet, I'll open an issue for that as well and will keep here notified.

@maxux
Copy link
Collaborator

maxux commented Jun 1, 2022

After investigating and some debugging, I think the 10 millions files problem you had (connection closed or vm killed) is an out-of-memory issue and not a qsfs/zdbfs issue. When inserting lot of files in the same directory, memory grows up (800+ MB on my test) which is probably out of the limit you allowed.

I have a 7+ millions files in a single directory without crash (but it's slow).

I'm looking to improve that memory usage.

@mohamedamer453
Copy link
Author

So for the performance numbers urandom was indeed slow and i tried testing with the command you mentioned and i got much faster results.

dd if=/dev/urandom of=/dev/null bs=1M count=1000
  • For large files (1GB) the speed averaged at around 190 MB/s

    image

  • For med files (1MB) the speed averaged at around 160 MB/s

    image

  • For small files (1K) the speed averaged at around 1 MB/s

    image

Can you show how zdbfs is started ?

The zdbfs started as part of a qsfs deployment from terraform grid provider and the full setup can be seen in the included main.tf

@maxux
Copy link
Collaborator

maxux commented Jun 2, 2022

You misunderstood the test regarding urandom, the command I asked was just to ensure you can reach at least 100 MB/s by reading urandom (which seems to be good), since this command write to /dev/null this won't reach qsfs, it was just to confirm. Thanks :p

For the memory usage, I guess /qsfs mounted in the VM have zdbfs running inside the VM.
You can confirm by executing ps aux | grep zdbfs. If that's true, your VM have only 1 GB of memory, which could be the issue for 10 millions files in a single directory. Try to increase VM memory to like 4 GB or 8 GB and see if you can reproduce the crash :)

@mohamedamer453
Copy link
Author

You misunderstood the test regarding urandom, the command I asked was just to ensure you can reach at least 100 MB/s by reading urandom (which seems to be good), since this command write to /dev/null this won't reach qsfs, it was just to confirm. Thanks :p

Yep my bad :D i got confused there for a second.

For the memory usage, I guess /qsfs mounted in the VM have zdbfs running inside the VM. You can confirm by executing ps aux | grep zdbfs. If that's true, your VM have only 1 GB of memory, which could be the issue for 10 millions files in a single directory. Try to increase VM memory to like 4 GB or 8 GB and see if you can reproduce the crash :)

Indeed /qsfs was mounted in a VM

image

and the result from executing ps aux | grep zdbfs in the VM is

root@vm:~# ps aux | grep zdbfs
root       172  1.0  0.0   5196  1508 pts/0    S+   10:24   0:00 grep --color=auto zdbfs

and by increasing the memory to 4GB the command for the 10 million files didn't crash the VM and by checking the metrics the files were being created. but still not sure if it will be able to complete the process of creating 10 million files or not.

@LeeSmet LeeSmet transferred this issue from threefoldtech/0-stor_v2 Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants