Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge master into feature/pool-licensing #6234

Conversation

changlei-li
Copy link
Contributor

@changlei-li changlei-li commented Jan 17, 2025

No conflict
$ git show
commit 939a3da (HEAD -> private/changleli/pool-licensing, github/private/changleli/pool-licensing)
Merge: 47df335 43d01ca
Author: Changlei Li [email protected]
Date: Fri Jan 17 13:53:31 2025 +0800

Merge remote-tracking branch 'github/master' into private/changleli/pool-licensing

psafont and others added 30 commits December 19, 2024 10:51
Historically parallel operations were run in tasks as part of serial
operations, and serial tasks were not run as part of parallel ones. This
changed recently, causing some timeouts that did not happen before. To
mitigate this issue, now the timeouts for tasks are 20 minutes per single
serialized operation, instead of 20 minutes per task.

Signed-off-by: Pau Ruiz Safont <[email protected]>
The latest changes in the metrics daemon changes how archived and live metrics
are synchronised, and archived sources are created less often. This meant that
on some cases, like SR.forget, the operations failed because they could query
for existing sources, but the call to remove them failed, because the metrics
only exist in live form, not archived one.

(what happens with the live one, do they disappear when the SR is forgotten?)

Signed-off-by: Pau Ruiz Safont <[email protected]>
Also removes some traversals when reading from XML data

Signed-off-by: Pau Ruiz Safont <[email protected]>
This has two advantages:
1. Always non-negative: they represent absolute differences in time
2. Forces users to define the units of time, allowing to read the time in
   minutes, when appropriate

Signed-off-by: Pau Ruiz Safont <[email protected]>
The xapi on a supporter host would restart when it received HTTP error
from the xapi on the coordinator host.

This breaks the pool.designate_new_master use case for a big pool, e.g.
64-host pool. In this case, some supporters may restart unexpectedly
within the phase of committing new coordinator due to the logic above.

Additionally, the purpose of this logic, explained by the error message,
is not correct also. Not all HTTP errors are caused by "our master
address is wrong".

On the other hand, if a use case requires to restart the xapi, an more
explicit logic should ensure that, instead of leveraging an implicit
HTTP error code. Furhtermore, if a supporter indeed is connecting to a
wrong coordinator, this should be a bug and can be recovered manually.

Based on above arguments, the restarting xapi after receiving HTTP error
is removed. This follows the TODO concluded in CA-36936 as well.

Signed-off-by: Ming Lu <[email protected]>
Signed-off-by: Ming Lu <[email protected]>
FCoE support will be removed from XS9 and dom0 will no longer
provide fcoe_driver. However, XS8 still actively support it.

Xapi checks the existence of fcoe_driver, and thinks all PIFs
no longer support FCoE if fcoe_driver not found

Signed-off-by: Lin Liu <[email protected]>
…project#6192)

Historically parallel operations were run in tasks as part of serial
operations, and serial tasks were not run as part of parallel ones. This
changed recently, causing some timeouts that did not happen before. To
mitigate this issue, now the timeouts for tasks are 20 minutes per
single
serialized operation, instead of 20 minutes per task.

Passed one of the VDI / VM scalibility tests that previously failed
reliably: Job 4182581
…ect#6201)

The xapi on a supporter host would restart when it received HTTP error
from the xapi on the coordinator host.

This breaks the pool.designate_new_master use case for a big pool, e.g.
64-host pool. In this case, some supporters may restart unexpectedly
within the phase of committing new coordinator due to the logic above.

Additionally, the purpose of this logic, explained by the error message,
is not correct also. Not all HTTP errors are caused by "our master
address is wrong".

On the other hand, if a use case requires to restart the xapi, an more
explicit logic should ensure that, instead of leveraging an implicit
HTTP error code. Furhtermore, if a supporter indeed is connecting to a
wrong coordinator, this should be a bug and can be recovered manually.

Based on above arguments, the restarting xapi after receiving HTTP error
is removed. This follows the TODO concluded in CA-36936 as well.
Minor style.
"uuid" is always converted to string, avoid doing it every time it's used.

Signed-off-by: Frediano Ziglio <[email protected]>
Minor style.

`uuid` is always converted to string, avoid doing it every time it's
used.
…project#6199)

The latest changes in the metrics daemon changes how archived and live
metrics
are synchronised, and archived sources are created less often. This
meant that
on some cases, like SR.forget, the operations failed because they could
query
for existing sources, but the call to remove them failed, because the
metrics
only exist in live form, not archived one.

(what happens with the live one, do they disappear when the SR is
forgotten?)

Signed-off-by: Pau Ruiz Safont <[email protected]>
The drawback of fuser is that it gets too many things involved. E.g. it
is observed that it got stuck on cifs kernel module.

This change uses a cleaner way to remember the stunnel client proxy.
Even when the xapi restarted unexpectedly, it can stop the remnant
stunnel proxy and start a new one.

Signed-off-by: Ming Lu <[email protected]>
Making it configurable can avoid the situation when the port conflicts
with others, e.g. an external program from users.

Signed-off-by: Ming Lu <[email protected]>
Helpful for debugging pem-parsing issues

Signed-off-by: Pau Ruiz Safont <[email protected]>
…rtificates

This is needed in order to be compliant with RFC 7468:
https://datatracker.ietf.org/doc/html/rfc7468#section-2

Data before the encapsulation boundaries are permitted, and parsers MUST NOT
malfunction when processing such data.

Signed-off-by: Pau Ruiz Safont <[email protected]>
- Add logs when fcoe_driver does not exist
- Use List.assoc_opt instead of try catch
- Add __FUNCTION__ to the logs

Signed-off-by: Lin Liu <[email protected]>
…rtificates (xapi-project#6207)

CA-404236, gencert: when parsing pems, ignore data between key and
crrtificates

This is needed in order to be compliant with RFC 7468:
https://datatracker.ietf.org/doc/html/rfc7468#section-2

Data before the encapsulation boundaries are permitted, and parsers MUST
NOT
malfunction when processing such data.

Signed-off-by: Pau Ruiz Safont <[email protected]>
Co-authored-by: Edwin Török <[email protected]>
Signed-off-by: Vincent Liu <[email protected]>
Signed-off-by: Konstantina Chremmou <[email protected]>
- Moved certificate methods to the Connect-XenServer cmdlet and refactored them to avoid multiple loads of the global variable KnownServerCertificatesFilePath.
- Fixed accessibility of CommonCmdletFunctions members.

Signed-off-by: Konstantina Chremmou <[email protected]>
… version is known.

Signed-off-by: Konstantina Chremmou <[email protected]>
…oject#6202)

FCoE support will be removed from XS9 and dom0 will no longer provide
fcoe_driver. However, XS8 still actively support it.

Xapi checks the existence of fcoe_driver, and thinks all PIFs no longer
support FCoE if fcoe_driver not found
Christian Lindig and others added 27 commits January 8, 2025 16:07
Be explicit about the file system of an update ISO. Remove dead code.

Signed-off-by: Christian Lindig <[email protected]>
rrd2csv itself prints out a list of comma-separated metrics names, but could't
accept this list as command line arguments:

```
$ rrd2csv
.....memory_total_kib, memory_free_kib, ....

$ rrd2csv memory_total_kib, memory_free_kib
WARNING: Requested metric AVERAGE:host:05e817e2-3a65-484d-b0da-a7163f9ffc12:memory_total_kib,
is disabled or non-existant
timestamp, AVERAGE:host:05e817e2-3a65-484d-b0da-a7163f9ffc12:memory_free_kib
2025-01-07T14:06:45Z, 30042000
```

Now this works just fine:
```
$ rrd2csv memory_total_kib, memory_free_kib
timestamp, AVERAGE:host:92bc3b1e-e0a3-49ba-8994-fc305ff882b7:memory_total_kib, AVERAGE:host:92bc3b1e-e0a3-49ba-8994-fc305ff882b7:memory_free_kib
2025-01-07T15:04:50Z, 33350000, 30023000
```

Signed-off-by: Andrii Sultanov <[email protected]>
rrd2csv itself prints out a list of comma-separated metrics names, but
could't accept this list as command line arguments:

```
$ rrd2csv
.....memory_total_kib, memory_free_kib, ....

$ rrd2csv memory_total_kib, memory_free_kib
WARNING: Requested metric AVERAGE:host:05e817e2-3a65-484d-b0da-a7163f9ffc12:memory_total_kib,
is disabled or non-existant
timestamp, AVERAGE:host:05e817e2-3a65-484d-b0da-a7163f9ffc12:memory_free_kib
2025-01-07T14:06:45Z, 30042000
```

Now this works just fine:
```
$ rrd2csv memory_total_kib, memory_free_kib
timestamp, AVERAGE:host:92bc3b1e-e0a3-49ba-8994-fc305ff882b7:memory_total_kib, AVERAGE:host:92bc3b1e-e0a3-49ba-8994-fc305ff882b7:memory_free_kib
2025-01-07T15:04:50Z, 33350000, 30023000
```
…#6215)

The point of using try_lock is to not get lock while trying to hold the
mutex.
Releasing it and blocking to lock it defeats the purpose of using
try_lock.
Reorganise the sequence to read and copy all the rrds first while under
the
locked mutex, release it, and then archive the copies.

This doesn't fix CA-404013 directly, so this can wait until the release
is cut. I'm opening otherwise I'll forget to open it when things are
unlocked.

I've run the smoke and validation tests, as well as checking that the
rrd files in the bugtools are well formed for hosts and vms (they are
created suing this function)
…ect#6197)

Two changes in the PR:
**Drop the usage of fuser in stunnel client proxy**
The drawback of fuser is that it gets too many things involved. E.g. it
is observed that it got stuck on cifs kernel module.

This change uses a cleaner way to remember the stunnel client proxy.
Even when the xapi restarted unexpectedly, it can stop the remnant
stunnel proxy and start a new one.

**Make the stunnel proxy local port configurable**
Making it configurable can avoid the situation when the port conflicts
with others, e.g. an external program from users.
Commits best reviewed separately.

- CP-53003: Use JsonRpc v1.0 by default and switch to v2.0 once the API
version is known.
- Removal of deprecated methods and unused internal methods.
- Refactoring to fix some inefficiencies when loading and saving
certificates in PS.
The new clustering interface uses a constructor `Extended` address when
a cluster hots is trying to join a cluster. This causes problems during
upgrades as a newly upgraded host might send out the new address format
to old hosts, which do not understand this format, causing the new hosts
not able to join.

The fix would be to use a new cluster_address feature flag/pool
restrictions to control the use of the new clustering interface. This makes
sure that the new interface would only be used if all of the hosts
understand this new interface, i.e. have this feature enabled. The
cluster_address feature is controlled by v6d and is pool-wide, therefore
the new interface would only be enabled if all v6ds are updated to the
correct level, which also implies that the accompanying xapi are updated
to the correct level.

Signed-off-by: Vincent Liu <[email protected]>
…ject#6217)

The new clustering interface uses a constructor `Extended` address when
a cluster hots is trying to join a cluster. This causes problems during
upgrades as a newly upgraded host might send out the new address format
to old hosts, which do not understand this format, causing the new hosts
not able to join.

The fix would be to use the cluster_health feature flag/pool
restrictions to gate the use of the new clustering interface. This makes
sure that the new interface would only be used if all of the hosts
understand this new interface, i.e. have this feature enabled. The
cluster_health feature is controlled by v6d and is pool-wide, therefore
the new interface would only be enabled if all v6ds are updated to the
correct level, which also implies that the accompanying xapi are updated
to the correct level.
Be explicit about the file system of an update ISO. Remove dead code.
This has two advantages:
1. Always non-negative: they represent absolute differences in time
2. Forces users to define the units of time, allowing to read the time
in minutes, when appropriate

Passed the toolstack's smoke and validation tests: Job 209833
The package dune has been replaced with ocaml-dune

Signed-off-by: Pau Ruiz Safont <[email protected]>
The package dune has been replaced with ocaml-dune

> Package dune is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  ocaml-dune
Ubuntu's dune is way too old for the version generally used. On top of that the
command doesn't fail when this happens, making the setup brittle. Instead write
the version variable to config.mk and run make.

Signed-off-by: Pau Ruiz Safont <[email protected]>
Ubuntu's dune is way too old for the version generally used. On top of
that the command doesn't fail when this happens, making the setup
brittle. Instead write the version variable to config.mk and run make.

You can see this change running as expected in this run:
https://github.com/xapi-project/xen-api/actions/runs/12792091034/job/35661725368
We have so far hard-coded the exectation that in a PEM file a private
key is followed by a certficate but this is actually not required by the
PEM standard and let to a failure in XSI-1781.

This is a simple fix first collects all keys and certificates while
skipping over other content and the uses the first key and certificate.

Signed-off-by: Christian Lindig <[email protected]>
Add back a unit test.

Signed-off-by: Christian Lindig <[email protected]>
The integer values that OCaml uses for signals should never be printed as they
are. They can cause confusion because they don't match the C POSIX values.

Change the unixext function that converts them to string to stop building a
list and finding a value in the list to instead use pattern-matching. Also
added some more values that got introduced in OCaml 4.03, and return a more
compact value for unknown signals, following the same format as Fmt.Dump.signal

Signed-off-by: Pau Ruiz Safont <[email protected]>
When signals are are written to logs, the POSIX name should be used to minimize
confusion. It makes sense that the function that does this is in the logging
library instead of the unix one, as most users will be already be using the
logging library, but not all the unix one.

Moving it there also allows for a more ergonomic usage with the logging
functions.

Signed-off-by: Pau Ruiz Safont <[email protected]>
…6227)

We have so far hard-coded the expectation that in a PEM file a private
key is followed by a certificate but this is actually not required by
the PEM standard and let to a failure in XSI-1781.

This is a simple fix that permits both. However, it would still fail if
we have a certificate with multiple certificates followed by a key.
Another strategy could be to parse all blocks in a PEM file and to pick
out a key and certificate later rather than expecting a particular order
up front.

Added new test data for unit tests; fail-01.pem now becomes a passing
test case.
…es correctly

Other unit tests only verify the interoperability of the RRDs - dumping them to
JSON/XML and reading back in, verifying that the same data was decoded. We're
now seeing a problem where Gauge data sources, which should be absolute values
provided by the plugin, fluctuate wildly when processed by the RRD library.

Ensure we have an easy way to test this for both Gauge and Absolute data
sources - these values should be passed as-is by the RRD library, without any
time-based transformations.

This test currently fails and will be passing with the fix commits.

Signed-off-by: Andrii Sultanov <[email protected]>
The integer values that OCaml uses for signals should never be printed
as integers. They can cause confusion because they don't match the C
POSIX values.

Change the unixext function that converts them to string to stop
building a list and finding a value in the list to instead use
pattern-matching. Also added some more values that got introduced in
OCaml 4.03, and return a more compact value for unknown signals,
following the same format as Fmt.Dump.signal.

Typically, engineers see signal -11 and assume it's SIGSEGV, when it's
SIGTERM.

Fixes xapi-project#6225
Some recent changes related to RRDs likely exposed a long-standing latent issue
where the RRD library would process the passed-in values for Gauge and Absolute
data sources incorrectly leading to constant values changing from update to
update, for example:
```
$ rrd2csv memory_total_kib
timestamp, AVERAGE:host:8b533333-91e1-4698-bd17-95b9732ffbb6:memory_total_kib
2025-01-15T08:41:40Z, 33351000
2025-01-15T08:41:45Z, 33350000
2025-01-15T08:41:50Z, 33346000
2025-01-15T08:41:55Z, 33352000
```

Instead of treating Gauge and Absolute data sources as a
variation on the rate-based Derive data source type, expecting time-based
calculations to cancel each other out, do not undertake any calculations on
non-rate data sources at all.

This makes the unit test added in the previous commit pass.

Signed-off-by: Andrii Sultanov <[email protected]>
…sources (xapi-project#6233)

Some recent changes related to RRDs likely exposed a long-standing
latent issue
where the RRD library would process the passed-in values for Gauge and
Absolute
data sources incorrectly leading to constant values changing from update
to
update, for example:
```
$ rrd2csv memory_total_kib
timestamp, AVERAGE:host:8b533333-91e1-4698-bd17-95b9732ffbb6:memory_total_kib
2025-01-15T08:41:40Z, 33351000
2025-01-15T08:41:45Z, 33350000
2025-01-15T08:41:50Z, 33346000
2025-01-15T08:41:55Z, 33352000
```

Instead of treating Gauge and Absolute data sources as a
variation on the rate-based Derive data source type, expecting
time-based
calculations to cancel each other out, do not undertake any calculations
on
non-rate data sources at all.

First commit adds a failing unit test, second makes it pass.

===

I've verified these changes through manual testing, they've also passed
the testcases that discovered this issue: SNMP memory testcases (JobIDs
4197305, 4196759, 4196744) and ShimMemory testcase (4197050). This
branch also passed Ring3 BST+BVT (210577)
@changlei-li changlei-li merged commit 735e5a4 into xapi-project:feature/pool-licensing Jan 17, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.