Skip to content

Hybrid crawler: the concurrency doesn't work the way we think it does #1215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alban-stourbe-wmx opened this issue Mar 18, 2025 · 2 comments
Labels
Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@alban-stourbe-wmx
Copy link
Contributor

katana version:

The problem is linked to the latest version (v1.1.2)

Current Behavior:

In a hybrid crawl, the default setting for concurrency is 10, which means that 10 pages are processed by the browser at the same time. This is controlled by the option -c, -concurrency int (default 10).

However, our investigation shows that concurrency doesn’t work as expected in the browser environment. Here’s why:

With the default setting, 10 pages open in the browser during a crawl, but only the last page opened is actually processed because the browser’s context is tied to that active page.

The navigateRequest function handles this active page by waiting for its WaitLoad and WaitIdle functions to finish before it parses the request and completes its execution. Then, a new process starts and a new page opens, continuing the cycle. As a result, only the last page opened is processed without any errors.

Meanwhile, the other 9 pages, even though they are loaded and stable, never trigger their WaitLoad and WaitIdle functions because these events only occur on the active (last opened) page. Consequently, when the timeout is reached, these pages close and a timeout error is returned (this isn’t a DOM error but simply a timeout). The cycle then repeats with 9 pages timing out and a new set of pages opening.

The only way to avoid these errors is to set the concurrency to 1, ensuring that only one page is open and active at a time.

In my tests, when I manually refreshed a page that will timed out, it has been processed correctly. This confirms that only the active page can be properly processed.

I'd written this issue before but I hadn't investigated it enough to understand the real problem : #919

However, now I'm sure that the problem comes from the competition and the error is triggered by the page timeout because it never receives the page load event because it's not active for the browser.

Expected Behavior:

The 10 pages would have to be considered active for the browser for it not to timeout. However, I don't know if it's really possible to make concurrency in the browser context.

Steps To Reproduce:

Just run katana in headless mode and set a one-minute timeout. You'll notice that all pages get stuck except for the last one processed. In the results file, every stuck page will show the error: could not get the dom.

katana -headless -u -cwu ws://127.0.0.1:9222/devtools/browser/ea01541b-7878-4465-8a18-ce03133610e8 -no-incognito -depth 1 -jsonl -o result.json -timeout 60 -delay 1 -debug

@alban-stourbe-wmx alban-stourbe-wmx added the Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Mar 18, 2025
@alban-stourbe-wmx
Copy link
Contributor Author

hi guys ;) @dogancanbakir @Mzack9999

@ehsandeep
Copy link
Member

Hey @alban-stourbe-wmx!

Thanks for flagging this. We are aware of the issue and have started testing possible solutions internally. We will update Katana once we ensure it’s working correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

2 participants