Skip to content

Latest commit

 

History

History
163 lines (140 loc) · 6.12 KB

NOTES.md

File metadata and controls

163 lines (140 loc) · 6.12 KB

= Notes / To Do

== Common

== Per Service

  • Check packages installed
  • Check services running
  • Check configuration files - at least for basic things
  • Check iptables / firewalld rules
  • Verify connection from string in config to DB
    • Verify DB has tables
  • Confirm user/pw auth thru keystone
    • Confirm endpoints?
    • Confirm service?
  • Test main list functionality (As shown below in basic functionality test)
  • Check number of workers per component? Like number of api_workers for glance?
    == High Level Functionality
  • Add ceph node checks
  • Add ceph backend checks (Verify pools, auth, etc)
  • Add compute node checks
  • Add block storage node checks
  • Add swift storage node checks
  • Add/Document some way to enable/disable checks. No need to check cinder backup if its not configured... or swift...
  • Check for deleted objects in databases?
    • Check for cron jobs where appropriate?
  • Detect which openstack version somehow... RPM name?

== Heat

  • Simplifying nested stack troubleshooting with Director

== Ceilometer

== Neutron

== Nova

  • Look for 'XXX' in nova-manage service list
  • Create a playbook that tracks an instance across logs. (Basically, by UUID see all log messages for an instance?)

== Cinder

  • Verify api and scheduler are up
  • Verify there is space left on the backing storage (if possible)
  • LVM backend
    • In logs: volume group cinder-volumes does not exist
    • In logs: AMQP server on ... ECONNREFUSED. Trying again in xx seconds.

== Glance

  • ADD IPTABLES CHECKS? - For example... 9292 for glance api
    • 9191 for glance registry
    • But no external communication needed to glance registry if colocated with api
  • VALIDATE BACKEND CHECKS
  • Finish backend Checks (Note likely different from kilo to liberty to mitaka)
    • RBD
      • Validate ceph auth correct (Capture what the log should look like if this is not)
      • Validate settings in conf file
    • File (NFS)
      • Validate settings in conf file
      • Validate write permissions on backend
      • Validate mounts on correct servers
    • Swift?
    • Cinder?

== Swift

== RabbitMQ

  • In individual modules, the rabbit connection check doesn't print an 'OK' message if successful. Ideally it would so you know it was checked.
  • Make certain the config specifies rabbit and not another backend
  • Check for AMQP connections for applicable services
  • check that management plugin enabled
    • rabbitmq-plugins enable rabbitmq_management (on port 15672)
  • Watch queue levels. Set alerts on queues backing up
  • rabbitmqctl status
    • File Descriptors (Total + Sockets)
    • Processes - limit vs used
  • Check nodes / cluster partition
  • Check number of messages_ready in the queue (via localhost:15672/api/queues) and alert if > a specific number... Suggesting the queue is backing up?

== Keystone

  • Validate users in correct roles? (member or admin?)
  • Validate users belong to projects?

== HAProxy

  • Check maxconn?

== Galera / MariaDB

== Pacemaker

  • Some help troubleshooting would be nice. For instance, when I tried manually removing pacemaker remote all of a sudden my cluster crashed and wouldn't restart. In this case, answers were in /var/log/messages

== HA tests

  • Check mysql data replication
  • verify amount of tables in DB is same on each node
  • Check galera envionment state
  • RabbitMQ availability
  • RabbitMQ replication
  • check pacemaker status / health

== Functional Tests / Smoke Test

  • Make this optional or a separate role
  • Run 'list' commands per service
  • Check that required services are running (and enabled via systemd or pacemaker)
  • Functional tests (like cinder create/delete, glance image-create, download, etc) - A minimal subset of what Tempest can test
  • Glance Functional Tests Example

== Create testing framework (so we can validate all the checks work) Thoughts of this...

  • Glance
    • Process not running
    • DB auth incorrect
    • DB not initialized
    • Rabbit auth incorrect
    • Create RBD test case (ceph auth incorrect)
    • Create RBD test case (no pool defined)
    • Create RBD test case (can't reach storage network)
    • Create NFS test case (failing perms)
    • Create NFS test case (not mounted)
    • Create NFS test case (can't reach storage network)
  • Create tests
    • playbooks to break the deployment
    • Validate the break
    • Maybe even do a random break that people can use for training purposes outside of using the playbooks to detect problems
  • Potential deployment test scenarios
    • All in one
    • Distributed HA (standard 3-node control plane)
    • Fully distributed components (maybe achieved via Kolla)
      • DB not on same nodes as APIs
      • RabbitMQ not on same nodes as APIs

Nova compute error message: libvirtError: invalid argument: could not find capabilities for domaintype=kvm

== Longer term

  • Test against Kolla - Some things like systemctl may have to change
  • Test aginst containerized compute