Skip to content

Az implementation #4061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: az-integration
Choose a base branch
from
Open

Az implementation #4061

wants to merge 9 commits into from

Conversation

Sploder12
Copy link
Contributor

@Sploder12 Sploder12 commented Apr 24, 2025

This PR adds implementations for making VMs unavailable and available. Also has unavailable VMs ignore commands that change anything related to that VM. There are a few known race conditions that are outside the scope of AZs to fix.

MULTI-1789

Copy link

codecov bot commented Apr 24, 2025

Codecov Report

Attention: Patch coverage is 23.72881% with 45 lines in your changes missing coverage. Please review.

Project coverage is 89.04%. Comparing base (a9be08a) to head (d597d38).

Files with missing lines Patch % Lines
src/daemon/daemon.cpp 23.33% 23 Missing ⚠️
.../platform/backends/shared/base_virtual_machine.cpp 13.33% 13 Missing ⚠️
...latform/backends/shared/base_availability_zone.cpp 18.18% 9 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                 @@
##           az-integration    #4061      +/-   ##
==================================================
- Coverage           89.30%   89.04%   -0.26%     
==================================================
  Files                 264      264              
  Lines               14953    15004      +51     
==================================================
+ Hits                13354    13361       +7     
- Misses               1599     1643      +44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Sploder12 Sploder12 marked this pull request as ready for review April 24, 2025 18:22
@Sploder12 Sploder12 force-pushed the az-implementation branch 2 times, most recently from d7eb32d to 94e31e3 Compare May 12, 2025 11:17
@Sploder12 Sploder12 force-pushed the az-implementation branch from 94e31e3 to d597d38 Compare May 14, 2025 09:09
@@ -2028,6 +2030,12 @@ try // clang-format on
continue;
}

if (vm->current_state() == VirtualMachine::State::unavailable)
{
mpl::log(mpl::Level::info, name, "Ignoring mount since instance unavailable.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I think we need a better way to represent "unavailability". My main concern is that this approach is error-prone and repetitive. The problem is that "unavailable" instances are still a part of the operative instances, which is wrong. An unavailable instance is not operative since it can't be put into an operative state unless it's made available, so an unavailable instance shouldn't be a part of the "operative" set of VMs in the first place.

Right now, we have two sets of VMs: "operative" and "deleted". I think what we need here is a third category, "unavailable". This way, we can just update the error messages as does not exist or unavailable and call it a day, until we overhaul the VM management logic. @ricab, what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point. I haven't thought a lot about it, but I think I agree with you @xmkg. We need to define also what happens at the intersection of deleted/unavailable i.e., deleted instances that belong to an unavailable AZ. I guess we just declare "deleted" takes precedence over "unavailable"? We need to answer a few questions:

  • does deleting and instance unsubscribe it from the AZ?
  • does enabling/disabling an AZ affect deleted instances? (I hope not)
  • what happens on recover of an instance whose AZ is unavailable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ricab, good questions. Regarding the deleted + unavailable at the same time, I think it's reasonable to make deleted take precedence over unavailable. Another avenue might be to consolidate all instances to a single container and represent the "availability" state in a bit flag. This way, an instance can be deleted & unavailable at the same time.

Consolidating all instances into a single container carries a cost: lookup and iteration speed. This can be troublesome for users with many instances. Otherwise, std::ranges filtering would work just fine. If iteration speed matters, another option might be using a multi-index container, which allows for different indices over a set of values instead of just one.

Regarding the last three questions, those should be answered in the AZ spec itself, if not already. @Sploder12 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xmkg I fully agree we need a better way to represent "unavailability", it is known that my current approach is terrible. But, my issue with using operative instances and a new unavailable table is that we'd be moving even more responsibility into the daemon. Also operative instances is the most unsafe thing in our entire code base at the moment. It guarantees race conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does deleting and instance unsubscribe it from the AZ?

Based on spec and implementation, no. Although it'd be nice if this were possible.

does enabling/disabling an AZ affect deleted instances? (I hope not)

Yes, it is necessary for recover in the current implementation :)

what happens on recover of an instance whose AZ is unavailable?

This is unaccounted for by the spec. Implementation recovers the instance in an unavailable state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants