Skip to content

Conversation

@richm
Copy link
Contributor

@richm richm commented Apr 8, 2025

There is a new QEMU based test which uses the qemu/kvm capability of
github action runners. This is the basis for new bootc/image mode tests
which we will be rolling out in the near future.

ansible-lint requires that the collection path is set so that the requirements
it installs are installed in the correct place.

There has been some general github action deprecation of python versions and
ubuntu versions that we have had to fix.

Remove CONTRIBUTOR from the list of users who can trigger citest.

For more information, see

@richm richm self-assigned this Apr 8, 2025
@martinpitt
Copy link
Contributor

Uh, qemu tests run into ENOSPC on the host. I suppose we can tone down the space requirements in the test and make smaller images etc., but I haven't looked yet. I can if you want me to.

@richm richm force-pushed the changes-20250408 branch 4 times, most recently from 5a76e5b to 8b2088f Compare April 11, 2025 00:09
@martinpitt martinpitt marked this pull request as draft April 11, 2025 07:50
@martinpitt martinpitt self-assigned this Apr 11, 2025
@martinpitt
Copy link
Contributor

martinpitt commented Apr 11, 2025

On F42, the runner did not ENOSPC, it just failed the tests. The first failed test on CI is tests/tests_luks.yml. tox -e qemu-ansible-core-2.17 -- --image-name fedora-42 --log-level=debug --debug tests/tests_luks.yml by itself passes, so this is some complication in batch mode.

In F42, I see a large "/dev/dm-1 4.0G 110M 3.9G 3% /opt/test1" which is being created during the test in the VM (PV on /dev/sda), but it does get cleaned up again. I'm running watch -n1 du -hsc /tmp/tmp* on the host to check the usage of the sparse backend files, but curiously they always stay at 0 bytes, i.e. aren't even being used; they are also cleaned up properly after the test ends.

I cannot reproduce the "Unable to find enough unused disks. Exiting playbook." either.

For the ENOSPC on C9, I ran tox -e qemu-ansible-core-2.16 -- --image-name centos-9 --log-level=debug --debug tests/tests_filesystem_one_disk.yml locally, and that also passed and didn't leave cruft behind.

Moving this to draft, most likely we'll just drop these tests on GH for storage, they are just too big. But I want to experiment with this a bit on my fork, with getting an interactive shell on a gh runner. I'll see if running a subset of tests works, or what is actually ENOSPCing.

My current idea is to generalize --skip-tags tests::infiniband to --skip-tags tests::no-github, and mark tests accordingly. An initial run with most tests skipped looks quite promising (the c10s failure is a VM download glitch, unrelated).

Next attempt, run with skipping tests::nvme,tests::scsi

…rsions, ubuntu versions

There is a new QEMU based test which uses the qemu/kvm capability of
github action runners.  This is the basis for new bootc/image mode tests
which we will be rolling out in the near future.

ansible-lint requires that the collection path is set so that the requirements
it installs are installed in the correct place.

There has been some general github action deprecation of python versions and
ubuntu versions that we have had to fix.

Remove `CONTRIBUTOR` from the list of users who can trigger citest.

For more information, see

* linux-system-roles/.github#98
* linux-system-roles/.github#94
* linux-system-roles/.github#93
* linux-system-roles/.github#92
* linux-system-roles/.github#91

Signed-off-by: Rich Megginson <rmeggins@redhat.com>
@martinpitt
Copy link
Contributor

@richm the run on my fork is pretty okay -- some "honest" failures on f42 which I haven't yet looked into. But C9/C10 pass now, and it's a sane set of tests now. This includes the change from linux-system-roles/.github#99 but if you prefer I'm happy to remove that here. Or feel free to remove it yourself if you'd like to land this soon otherwise.

@martinpitt martinpitt marked this pull request as ready for review April 14, 2025 17:20
@martinpitt
Copy link
Contributor

martinpitt commented Apr 15, 2025

f42 failures:

✅ Unused disk

Many many tests fail with

Unable to find enough unused disks. Exiting playbook

This must be some leakage in batch mode. Running these individually works fine. I already investigated this a bit above, but no result yet. This will be my task today.

Fixed in "tests: Fix stratis block device clean up" commit.

tests_stratis

This has an "honest" failure:

The conditional check 'storage_test_pool.name in _stratis_pool_info.pools[0]['blockdevs']['datadevs'][0]['key_description']' failed. The error was: error while evaluating conditional (storage_test_pool.name in _stratis_pool_info.pools[0]['blockdevs']['datadevs'][0]['key_description']): 'dict object' has no attribute 'key_description'

This could be related to stratis-storage/stratisd#3805 and be due to an API change. This reproduces locally.

tests_resize

failure:

TASK [linux-system-roles.storage : Manage the pools and volumes to match the specified state] ***
volume 'foo-test1' cannot be resized from 9 GiB to 5 GiB: new size same as old size

This does not reproduce in isolation with tox -e qemu-ansible-core-2.17 -- --image-name fedora-42 --log-level=debug tests/tests_resize.yml

tests_lvm_pool_pv_grow.yml

failure

/dev/foo: already exists in filesystem

This smells like another cleanup failure. Now, just about every test uses "foo" as name, but I figure it's one of the other failures. I'll go through the "Clean up" actions once again with a fine comb and see which ones need fixing.

When I run it locally in isolation, it fails differently:

TASK [Verify each PV size] ***************************************************************************************************************************
task path: /var/home/martin/upstream/lsr/storage/tests/verify-pool-member-pvsize.yml:18
Tuesday 15 April 2025  10:45:41 +0200 (0:00:00.413)       0:00:36.427 ********* 
fatal: [/var/home/martin/.cache/linux-system-roles/fedora-42.qcow2]: FAILED! => {
    "assertion": "(dev_size.bytes - actual_pv_size.stdout | int) <= (4194304 + pv_pe_start.stdout | int)",
    "changed": false,
    "evaluated_to": false
}

MSG:

Unexpected difference between PV size and block device size: (device size: 1099511627776) (actual PV size:   8585740288)

#519 should fix CI run to fail "correctly".

✅ tests_remove_mount.yml

failure

cannot remove members 'sda' from pool 'foo' in safe mode

This also smells like cleanup failure, doesn't fail in isolation. Fixed by #519.

✅ tests_existing_lvm_pool.yml

failure

cannot remove members 'sda' from pool 'foo' in safe mode

Exactly the same failure as tests_remove_mount.yml, so cleanup failure. Also passes locally in isolation.

Fixing tests_resize cleanup didn't fix the follow-ups above. Testing run without stratis, and run without resize tests, to see which one breaks stuff. Update: neither change helped at all, the failures are independent.

Fixed by #519.

Always run the "Clean up" tasks, so that the volumes, pool, and most
importantly block devices are removed also when a failure occurs. This
unbreaks subsequent tests.
Comment on lines -147 to +71
- name: Create encrypted Stratis pool
- name: Create one Stratis pool with one volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: git and github make an utter mess of the diff view of "tests: Fix stratis block device clean up". What this does is to wrap the steps of a test into block: ... always: Cleanup. This changes the indendation, and due to the repetitive nature of the test the diff view locks onto" the wrong context. I didn't change any logic.

@martinpitt
Copy link
Contributor

OK, this run is down to 5 failures, from the previous 26. I.e. the "Unable to find enough unused disks" are gone with the stratis cleanup fix. This may be good enough to land now, also to avoid regressing the c9/c10 tests?

@vojtechtrefny
Copy link
Collaborator

tests_stratis

This has an "honest" failure:

The conditional check 'storage_test_pool.name in _stratis_pool_info.pools[0]['blockdevs']['datadevs'][0]['key_description']' failed. The error was: error while evaluating conditional (storage_test_pool.name in _stratis_pool_info.pools[0]['blockdevs']['datadevs'][0]['key_description']): 'dict object' has no attribute 'key_description'

This could be related to stratis-storage/stratisd#3805 and be due to an API change. This reproduces locally.

This is caused by the changes to Stratis metadata and encryption in Stratis 3.8.0, but the problem should be only in tests in verify-pool-stratis, not in the role itself (or blivet). I'll try to find a way to verify the test results that works with both old and new Stratis.

@richm
Copy link
Contributor Author

richm commented Apr 15, 2025

OK, this run is down to 5 failures, from the previous 26. I.e. the "Unable to find enough unused disks" are gone with the stratis cleanup fix. This may be good enough to land now, also to avoid regressing the c9/c10 tests?

Yes. Once you push the ansible-lint fix, we can merge this.

- name: Verify role results
include_tasks: verify-role-results.yml

always:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should add this style of cleanup to all tests, but that's definitely out of scope for this PR.

Always run the "Clean up" tasks, even on failure, so that the used safe
mode disks don't leak into the following tests.
Some roles produce ansible/test logs which are too large to
sensibly/comfortably view in the main workflow log view. Show only the
interesting failures there (the lines around `fatal:`) and upload the
full logs as artifact.
Running all tests is too much for the GitHub runner capacity, and many
of them fail.
@martinpitt
Copy link
Contributor

Pushed the ansible lint fix. I was trying to make some more progress on the other failures, but they are fighting back well. So yeah, let's address these in follow-ups.

@richm richm merged commit ae35218 into main Apr 15, 2025
18 of 19 checks passed
@richm richm deleted the changes-20250408 branch April 15, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants