Skip to content

fix: get the right plan if there were several attempts #607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

michael-todorovic
Copy link
Contributor

This PR fixes #606

I created 1000 random_pet layers, added them all at once to stress the system and observed the behavior.

To force retries, I randomly crashed running pods on-purpose:

╰ while [ true ] ; do sleep 3; kubectl get pods | awk '/test-layer.*Running/{print $1}' | parallel -j20 'kubectl delete pod {}'; done
pod "test-layer-937-plan-bqcpl" deleted
pod "test-layer-502-plan-88vtd" deleted
pod "test-layer-502-plan-429nd" deleted
[...]

So I got my retries as expected

╰ kubectl get terraformrun
NAME                        STATE       RETRIES   CREATED ON             RUNNER POD
test-layer-787-plan-rs42h   Succeeded   4         2025-05-25T08:05:34Z   test-layer-787-plan-9ggmd
test-layer-788-plan-zv6c8   Succeeded   0         2025-05-25T08:06:14Z   test-layer-788-plan-cp876
test-layer-840-plan-8zcph   Succeeded   1         2025-05-25T08:06:24Z   test-layer-840-plan-ctf7v
test-layer-880-plan-s4q89   Succeeded   0         2025-05-25T08:04:44Z   test-layer-880-plan-b8k77
test-layer-889-plan-c6fmt   Succeeded   0         2025-05-25T08:06:24Z   test-layer-889-plan-tfplx
test-layer-937-plan-bq9st   Succeeded   2         2025-05-25T08:06:44Z   test-layer-937-plan-m5rf2
test-layer-951-plan-hlqrk   Succeeded   0         2025-05-25T08:08:35Z   test-layer-951-plan-rs42h

I checked datastore logs and saw that now, reconciliation autocorrects the attempt, thus the status changes from 404 to 200 😄

time="2025-05-25T08:07:04Z" level=info bytes_in= bytes_out=24 error="<nil>" latency="29.305µs" method=GET remote_ip=10.244.0.225 service_account="system:serviceaccount:burrito-system:burrito-controllers" status=404 uri="/api/plans?attempt=1&format=short&layer=test-layer-937&namespace=burrito-project&run=test-layer-937-plan-bq9st"
time="2025-05-25T08:07:08Z" level=info bytes_in=798 bytes_out=0 error="<nil>" latency="980.393µs" method=PUT remote_ip=10.244.0.14 service_account="system:serviceaccount:burrito-project:burrito-runner" status=200 uri="/api/plans?attempt=2&format=pretty&layer=test-layer-937&namespace=burrito-project&run=test-layer-937-plan-bq9st"

@LucasMrqes
Copy link
Collaborator

During your debugging, did you figure out why the initial behavior doesn't reliably work ? Calling the datastore with an empty string as attempts parameter is supposed to make the datastore list the bucket contents and fetch the latest attempt on the external storage, and just by reading the code, I can't see how would that fail. My guess is there might be a caching issue when listing the bucket contents ?
Retrieving the latest attempt client-side by checking the TerraformRun’s status essentially implements the same behavior, but it relies on a different source of truth for the list of attempts.

@corrieriluca
Copy link
Member

@LucasMrqes @michael-todorovic I just checked the datastore code, when attempts is empty it indeeds call GetLatestPlan, and computes the last attempt by listing all objects by prefix:

func (s *Storage) GetAttempts(namespace string, layer string, run string) (int, error) {
key := fmt.Sprintf("%s/%s/%s/%s", LayersPrefix, namespace, layer, run)
attempts, err := s.Backend.List(key)
return len(attempts), err
}

A bug might be in this function 🤔

@michael-todorovic
Copy link
Contributor Author

michael-todorovic commented May 26, 2025

Indeed, it seems like the List method is buggy. In some cases, burrito can't store attempts because nothing ran. When the next try succeeds, retries is 1 so the attempt is stored at /layers/ns/layer/run/1/.
List returns the number of keys under /layers/ns/layer/run/ instead of the biggest number there.
I'll update the PR accordingly

@corrieriluca
Copy link
Member

corrieriluca commented May 26, 2025

List returns the number of keys under /layers/ns/layer/run/ instead of the biggest number there.

Yes exactly! And "number of keys" is not "number of attempts" because with 2 attempts the number of keys may be more. Example on S3 with first attempt that fails:

/layers/ns/layer/run/0/run.log
/layers/ns/layer/run/1/short.diff
/layers/ns/layer/run/1/plan.bin
/layers/ns/layer/run/1/run.log

Here S3 will return 4 keys for 2 attempts, which ultimately results in the wrong subsequent call to S3: Burrito will try to fetch the last plan short diff of attempt 3 (keys-1)!

Signed-off-by: Michael Todorovic <[email protected]>
@michael-todorovic
Copy link
Contributor Author

Actually, List isn't recursive so we only get

/layers/ns/layer/run/1/
/layers/ns/layer/run/2/

I changed a bit the logic to get and address directly the right latest attempt

@corrieriluca
Copy link
Member

Is it not recursive across all backend implementations (GCS, Azure & S3)?

@michael-todorovic
Copy link
Contributor Author

You're right, Azure looks to be recursive by default, gcs isn't because of Delimiter being used. I'll adjust again

Signed-off-by: Michael Todorovic <[email protected]>
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.13%. Comparing base (765acea) to head (bff7694).
Report is 1 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #607      +/-   ##
==========================================
+ Coverage   44.93%   45.13%   +0.20%     
==========================================
  Files          79       79              
  Lines        5759     5780      +21     
==========================================
+ Hits         2588     2609      +21     
  Misses       2956     2956              
  Partials      215      215              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines 133 to 134
// Azure returns the full path, so we need to split by "/"
attemptId := strings.Split(attemptStr, "/")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making custom code for Azure here, why not changing the List implementation of the Azure backend?

I've checked GCS and S3 implementations, they seem to build a list of prefixes whereas the Azure one (see below) seems to build a list of filenames 🤔

func (a *Azure) List(prefix string) ([]string, error) {
keys := []string{}
marker := ""
pager := a.Client.NewListBlobsFlatPager(a.Config.Container, &container.ListBlobsFlatOptions{
Prefix: &prefix,
Marker: &marker,
})
for pager.More() {
resp, err := pager.NextPage(context.TODO())
if err != nil {
return nil, err
}
for _, blob := range resp.Segment.BlobItems {
keys = append(keys, *blob.Name)
}
}
return keys, nil
}

Copy link
Contributor Author

@michael-todorovic michael-todorovic May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred a systemic fix for all buckets types to make sure we get the right stuff. For example, GCS could list recursively just like Azure if this line gets removed (during a future refacto)

I found it safer to deal with all cases in a single location and still avoid a regex to extract the attempt id :) Maybe the comment is misleading though and could be adjusted, which way would you prefer?

Copy link
Member

@corrieriluca corrieriluca May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I agree on the safeness of this systematic fix. The comment is indeed misleading and should be generic yes 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be better now.
I also changed GetAttempts to use []int{} internally so we can sort as integers (so 10 comes after 2) and deduplicate

Signed-off-by: Michael Todorovic <[email protected]>
@AlanLonguet
Copy link
Collaborator

I see the issue thx @michael-todorovic to have found it, the function just made a big hypothesis that all attempts would be stored in the datastore but we can't rely on that however I feel we need to put less logic in the generic function that interacts with the backend and make providers compliant with the output we want to obtain from a List function

@michael-todorovic
Copy link
Contributor Author

I see the issue thx @michael-todorovic to have found it, the function just made a big hypothesis that all attempts would be stored in the datastore but we can't rely on that however I feel we need to put less logic in the generic function that interacts with the backend and make providers compliant with the output we want to obtain from a List function

@corrieriluca suggested it as well, I can adjust accordingly
Though I don't have azure nor gcp setups to really test so I would use ephemeral containers (azurite, localstack, gcs emulator, minio, ceph) that could be used for unit tests as well. I would do it via docker-compose as it's well-known, rather than testcontainers. This is more work but no more surprises :)
Wdyt?

@AlanLonguet
Copy link
Collaborator

If that's ok for you, even if it's a bit of extra work, I'd advocate for that in favor of keeping "sanitizing" logic in the generic GetAttempts. This makes the GetAttempts function more easily testable and delegate bugs to the underlying provider implementation.

@michael-todorovic
Copy link
Contributor Author

Sure, we saw in #602 that it could be useful as well 😄
I'll then proceed with:

  • List should return only files/folders under prefix, without being recursive
  • Implement a ListRecursive later on, when needed. It would show anything under prefix

@michael-todorovic michael-todorovic marked this pull request as draft May 27, 2025 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error getting last Result on many layers
5 participants