Hybrid crawler: the concurrency doesn't work the way we think it does #1215

alban-stourbe-wmx · 2025-03-18T16:52:36Z

katana version:

The problem is linked to the latest version (v1.1.2)

Current Behavior:

In a hybrid crawl, the default setting for concurrency is 10, which means that 10 pages are processed by the browser at the same time. This is controlled by the option -c, -concurrency int (default 10).

However, our investigation shows that concurrency doesn’t work as expected in the browser environment. Here’s why:

With the default setting, 10 pages open in the browser during a crawl, but only the last page opened is actually processed because the browser’s context is tied to that active page.

The navigateRequest function handles this active page by waiting for its WaitLoad and WaitIdle functions to finish before it parses the request and completes its execution. Then, a new process starts and a new page opens, continuing the cycle. As a result, only the last page opened is processed without any errors.

Meanwhile, the other 9 pages, even though they are loaded and stable, never trigger their WaitLoad and WaitIdle functions because these events only occur on the active (last opened) page. Consequently, when the timeout is reached, these pages close and a timeout error is returned (this isn’t a DOM error but simply a timeout). The cycle then repeats with 9 pages timing out and a new set of pages opening.

The only way to avoid these errors is to set the concurrency to 1, ensuring that only one page is open and active at a time.

In my tests, when I manually refreshed a page that will timed out, it has been processed correctly. This confirms that only the active page can be properly processed.

I'd written this issue before but I hadn't investigated it enough to understand the real problem : #919

However, now I'm sure that the problem comes from the competition and the error is triggered by the page timeout because it never receives the page load event because it's not active for the browser.

Expected Behavior:

The 10 pages would have to be considered active for the browser for it not to timeout. However, I don't know if it's really possible to make concurrency in the browser context.

Steps To Reproduce:

Just run katana in headless mode and set a one-minute timeout. You'll notice that all pages get stuck except for the last one processed. In the results file, every stuck page will show the error: could not get the dom.

katana -headless -u -cwu ws://127.0.0.1:9222/devtools/browser/ea01541b-7878-4465-8a18-ce03133610e8 -no-incognito -depth 1 -jsonl -o result.json -timeout 60 -delay 1 -debug

The text was updated successfully, but these errors were encountered:

alban-stourbe-wmx · 2025-03-19T08:05:14Z

hi guys ;) @dogancanbakir @Mzack9999

ehsandeep · 2025-03-19T12:45:51Z

Hey @alban-stourbe-wmx!

Thanks for flagging this. We are aware of the issue and have started testing possible solutions internally. We will update Katana once we ensure it’s working correctly.

alban-stourbe-wmx added the Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hybrid crawler: the concurrency doesn't work the way we think it does #1215

Hybrid crawler: the concurrency doesn't work the way we think it does #1215

alban-stourbe-wmx commented Mar 18, 2025

alban-stourbe-wmx commented Mar 19, 2025

Uh oh!

ehsandeep commented Mar 19, 2025

Uh oh!

Hybrid crawler: the concurrency doesn't work the way we think it does #1215

Hybrid crawler: the concurrency doesn't work the way we think it does #1215

Comments

alban-stourbe-wmx commented Mar 18, 2025

katana version:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

alban-stourbe-wmx commented Mar 19, 2025

Uh oh!

ehsandeep commented Mar 19, 2025

Uh oh!