fix: get the right plan if there were several attempts #607

michael-todorovic · 2025-05-25T08:25:03Z

This PR fixes #606

I created 1000 random_pet layers, added them all at once to stress the system and observed the behavior.

To force retries, I randomly crashed running pods on-purpose:

╰ while [ true ] ; do sleep 3; kubectl get pods | awk '/test-layer.*Running/{print $1}' | parallel -j20 'kubectl delete pod {}'; done
pod "test-layer-937-plan-bqcpl" deleted
pod "test-layer-502-plan-88vtd" deleted
pod "test-layer-502-plan-429nd" deleted
[...]

So I got my retries as expected

╰ kubectl get terraformrun
NAME                        STATE       RETRIES   CREATED ON             RUNNER POD
test-layer-787-plan-rs42h   Succeeded   4         2025-05-25T08:05:34Z   test-layer-787-plan-9ggmd
test-layer-788-plan-zv6c8   Succeeded   0         2025-05-25T08:06:14Z   test-layer-788-plan-cp876
test-layer-840-plan-8zcph   Succeeded   1         2025-05-25T08:06:24Z   test-layer-840-plan-ctf7v
test-layer-880-plan-s4q89   Succeeded   0         2025-05-25T08:04:44Z   test-layer-880-plan-b8k77
test-layer-889-plan-c6fmt   Succeeded   0         2025-05-25T08:06:24Z   test-layer-889-plan-tfplx
test-layer-937-plan-bq9st   Succeeded   2         2025-05-25T08:06:44Z   test-layer-937-plan-m5rf2
test-layer-951-plan-hlqrk   Succeeded   0         2025-05-25T08:08:35Z   test-layer-951-plan-rs42h

I checked datastore logs and saw that now, reconciliation autocorrects the attempt, thus the status changes from 404 to 200 😄

time="2025-05-25T08:07:04Z" level=info bytes_in= bytes_out=24 error="<nil>" latency="29.305µs" method=GET remote_ip=10.244.0.225 service_account="system:serviceaccount:burrito-system:burrito-controllers" status=404 uri="/api/plans?attempt=1&format=short&layer=test-layer-937&namespace=burrito-project&run=test-layer-937-plan-bq9st"
time="2025-05-25T08:07:08Z" level=info bytes_in=798 bytes_out=0 error="<nil>" latency="980.393µs" method=PUT remote_ip=10.244.0.14 service_account="system:serviceaccount:burrito-project:burrito-runner" status=200 uri="/api/plans?attempt=2&format=pretty&layer=test-layer-937&namespace=burrito-project&run=test-layer-937-plan-bq9st"

Signed-off-by: Michael Todorovic <[email protected]>

LucasMrqes · 2025-05-25T09:56:05Z

During your debugging, did you figure out why the initial behavior doesn't reliably work ? Calling the datastore with an empty string as attempts parameter is supposed to make the datastore list the bucket contents and fetch the latest attempt on the external storage, and just by reading the code, I can't see how would that fail. My guess is there might be a caching issue when listing the bucket contents ?
Retrieving the latest attempt client-side by checking the TerraformRun’s status essentially implements the same behavior, but it relies on a different source of truth for the list of attempts.

corrieriluca · 2025-05-25T17:50:28Z

@LucasMrqes @michael-todorovic I just checked the datastore code, when attempts is empty it indeeds call GetLatestPlan, and computes the last attempt by listing all objects by prefix:

burrito/internal/datastore/storage/common.go

Lines 123 to 127 in 847331e

    
           func (s *Storage) GetAttempts(namespace string, layer string, run string) (int, error) { 
        
           	key := fmt.Sprintf("%s/%s/%s/%s", LayersPrefix, namespace, layer, run) 
        
           	attempts, err := s.Backend.List(key) 
        
           	return len(attempts), err 
        
           }

A bug might be in this function 🤔

michael-todorovic · 2025-05-26T07:03:57Z

Indeed, it seems like the List method is buggy. In some cases, burrito can't store attempts because nothing ran. When the next try succeeds, retries is 1 so the attempt is stored at /layers/ns/layer/run/1/.
List returns the number of keys under /layers/ns/layer/run/ instead of the biggest number there.
I'll update the PR accordingly

corrieriluca · 2025-05-26T09:37:49Z

List returns the number of keys under /layers/ns/layer/run/ instead of the biggest number there.

Yes exactly! And "number of keys" is not "number of attempts" because with 2 attempts the number of keys may be more. Example on S3 with first attempt that fails:

/layers/ns/layer/run/0/run.log
/layers/ns/layer/run/1/short.diff
/layers/ns/layer/run/1/plan.bin
/layers/ns/layer/run/1/run.log

Here S3 will return 4 keys for 2 attempts, which ultimately results in the wrong subsequent call to S3: Burrito will try to fetch the last plan short diff of attempt 3 (keys-1)!

Signed-off-by: Michael Todorovic <[email protected]>

michael-todorovic · 2025-05-26T10:06:41Z

Actually, List isn't recursive so we only get

/layers/ns/layer/run/1/
/layers/ns/layer/run/2/

I changed a bit the logic to get and address directly the right latest attempt

corrieriluca · 2025-05-26T10:10:30Z

Is it not recursive across all backend implementations (GCS, Azure & S3)?

michael-todorovic · 2025-05-26T10:38:41Z

You're right, Azure looks to be recursive by default, gcs isn't because of Delimiter being used. I'll adjust again

Signed-off-by: Michael Todorovic <[email protected]>

codecov-commenter · 2025-05-26T13:43:23Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.13%. Comparing base (765acea) to head (bff7694).
Report is 1 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #607      +/-   ##
==========================================
+ Coverage   44.93%   45.13%   +0.20%     
==========================================
  Files          79       79              
  Lines        5759     5780      +21     
==========================================
+ Hits         2588     2609      +21     
  Misses       2956     2956              
  Partials      215      215

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

corrieriluca · 2025-05-26T13:44:41Z

internal/datastore/storage/common.go

+		// Azure returns the full path, so we need to split by "/"
+		attemptId := strings.Split(attemptStr, "/")[0]


Instead of making custom code for Azure here, why not changing the List implementation of the Azure backend?

I've checked GCS and S3 implementations, they seem to build a list of prefixes whereas the Azure one (see below) seems to build a list of filenames 🤔

burrito/internal/datastore/storage/azure/azure.go

Lines 101 to 119 in 847331e

func (a *Azure) List(prefix string) ([]string, error) {

keys := []string{}

marker := ""

pager := a.Client.NewListBlobsFlatPager(a.Config.Container, &container.ListBlobsFlatOptions{

Prefix: &prefix,

Marker: &marker,

})

for pager.More() {

resp, err := pager.NextPage(context.TODO())

if err != nil {

return nil, err

}

for _, blob := range resp.Segment.BlobItems {

keys = append(keys, *blob.Name)

}

}

return keys, nil

}

I preferred a systemic fix for all buckets types to make sure we get the right stuff. For example, GCS could list recursively just like Azure if this line gets removed (during a future refacto)

burrito/internal/datastore/storage/gcs/gcs.go

Line 123 in 847331e

Delimiter: "/",

I found it safer to deal with all cases in a single location and still avoid a regex to extract the attempt id :) Maybe the comment is misleading though and could be adjusted, which way would you prefer?

Okay I agree on the safeness of this systematic fix. The comment is indeed misleading and should be generic yes 👍

It should be better now.
I also changed GetAttempts to use []int{} internally so we can sort as integers (so 10 comes after 2) and deduplicate

Signed-off-by: Michael Todorovic <[email protected]>

AlanLonguet · 2025-05-27T07:28:03Z

I see the issue thx @michael-todorovic to have found it, the function just made a big hypothesis that all attempts would be stored in the datastore but we can't rely on that however I feel we need to put less logic in the generic function that interacts with the backend and make providers compliant with the output we want to obtain from a List function

michael-todorovic · 2025-05-27T08:16:20Z

I see the issue thx @michael-todorovic to have found it, the function just made a big hypothesis that all attempts would be stored in the datastore but we can't rely on that however I feel we need to put less logic in the generic function that interacts with the backend and make providers compliant with the output we want to obtain from a List function

@corrieriluca suggested it as well, I can adjust accordingly
Though I don't have azure nor gcp setups to really test so I would use ephemeral containers (azurite, localstack, gcs emulator, minio, ceph) that could be used for unit tests as well. I would do it via docker-compose as it's well-known, rather than testcontainers. This is more work but no more surprises :)
Wdyt?

AlanLonguet · 2025-05-27T13:07:40Z

If that's ok for you, even if it's a bit of extra work, I'd advocate for that in favor of keeping "sanitizing" logic in the generic GetAttempts. This makes the GetAttempts function more easily testable and delegate bugs to the underlying provider implementation.

michael-todorovic · 2025-05-27T13:24:36Z

Sure, we saw in #602 that it could be useful as well 😄
I'll then proceed with:

List should return only files/folders under prefix, without being recursive
Implement a ListRecursive later on, when needed. It would show anything under prefix

fix: get the right plan if there were several attempts

c28b1e7

Signed-off-by: Michael Todorovic <[email protected]>

michael-todorovic added 2 commits May 26, 2025 12:03

fix: cleanup

48ecab7

Signed-off-by: Michael Todorovic <[email protected]>

fix: get logs and plan from latest attempt

8b3c02f

Signed-off-by: Michael Todorovic <[email protected]>

fix: manage all backends

bff7694

Signed-off-by: Michael Todorovic <[email protected]>

corrieriluca requested changes May 26, 2025

View reviewed changes

fix: sort slice and adjust comments

12b7f3f

Signed-off-by: Michael Todorovic <[email protected]>

michael-todorovic marked this pull request as draft May 27, 2025 14:42

		// Azure returns the full path, so we need to split by "/"
		attemptId := strings.Split(attemptStr, "/")[0]

	func (a *Azure) List(prefix string) ([]string, error) {
	keys := []string{}
	marker := ""
	pager := a.Client.NewListBlobsFlatPager(a.Config.Container, &container.ListBlobsFlatOptions{
	Prefix: &prefix,
	Marker: &marker,
	})
	for pager.More() {
	resp, err := pager.NextPage(context.TODO())
	if err != nil {
	return nil, err
	}

	for _, blob := range resp.Segment.BlobItems {
	keys = append(keys, *blob.Name)
	}
	}
	return keys, nil
	}

fix: get the right plan if there were several attempts #607

Are you sure you want to change the base?

fix: get the right plan if there were several attempts #607

Uh oh!

Conversation

michael-todorovic commented May 25, 2025

Uh oh!

LucasMrqes commented May 25, 2025

Uh oh!

corrieriluca commented May 25, 2025

Uh oh!

michael-todorovic commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corrieriluca commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michael-todorovic commented May 26, 2025

Uh oh!

corrieriluca commented May 26, 2025

Uh oh!

michael-todorovic commented May 26, 2025

Uh oh!

codecov-commenter commented May 26, 2025

Codecov Report

Uh oh!

corrieriluca May 26, 2025

Choose a reason for hiding this comment

Uh oh!

michael-todorovic May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corrieriluca May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michael-todorovic May 27, 2025

Choose a reason for hiding this comment

Uh oh!

AlanLonguet commented May 27, 2025

Uh oh!

michael-todorovic commented May 27, 2025

Uh oh!

AlanLonguet commented May 27, 2025

Uh oh!

michael-todorovic commented May 27, 2025

Uh oh!

Uh oh!

michael-todorovic commented May 26, 2025 •

edited

Loading

corrieriluca commented May 26, 2025 •

edited

Loading

michael-todorovic May 26, 2025 •

edited

Loading

corrieriluca May 26, 2025 •

edited

Loading