You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem is linked to the latest version (v1.1.2)
Current Behavior:
In a hybrid crawl, the default setting for concurrency is 10, which means that 10 pages are processed by the browser at the same time. This is controlled by the option -c, -concurrency int (default 10).
However, our investigation shows that concurrency doesn’t work as expected in the browser environment. Here’s why:
With the default setting, 10 pages open in the browser during a crawl, but only the last page opened is actually processed because the browser’s context is tied to that active page.
The navigateRequest function handles this active page by waiting for its WaitLoad and WaitIdle functions to finish before it parses the request and completes its execution. Then, a new process starts and a new page opens, continuing the cycle. As a result, only the last page opened is processed without any errors.
Meanwhile, the other 9 pages, even though they are loaded and stable, never trigger their WaitLoad and WaitIdle functions because these events only occur on the active (last opened) page. Consequently, when the timeout is reached, these pages close and a timeout error is returned (this isn’t a DOM error but simply a timeout). The cycle then repeats with 9 pages timing out and a new set of pages opening.
The only way to avoid these errors is to set the concurrency to 1, ensuring that only one page is open and active at a time.
In my tests, when I manually refreshed a page that will timed out, it has been processed correctly. This confirms that only the active page can be properly processed.
I'd written this issue before but I hadn't investigated it enough to understand the real problem : #919
However, now I'm sure that the problem comes from the competition and the error is triggered by the page timeout because it never receives the page load event because it's not active for the browser.
Expected Behavior:
The 10 pages would have to be considered active for the browser for it not to timeout. However, I don't know if it's really possible to make concurrency in the browser context.
Steps To Reproduce:
Just run katana in headless mode and set a one-minute timeout. You'll notice that all pages get stuck except for the last one processed. In the results file, every stuck page will show the error: could not get the dom.
Thanks for flagging this. We are aware of the issue and have started testing possible solutions internally. We will update Katana once we ensure it’s working correctly.
katana version:
The problem is linked to the latest version (v1.1.2)
Current Behavior:
In a hybrid crawl, the default setting for concurrency is 10, which means that 10 pages are processed by the browser at the same time. This is controlled by the option -c, -concurrency int (default 10).
However, our investigation shows that concurrency doesn’t work as expected in the browser environment. Here’s why:
With the default setting, 10 pages open in the browser during a crawl, but only the last page opened is actually processed because the browser’s context is tied to that active page.
The navigateRequest function handles this active page by waiting for its WaitLoad and WaitIdle functions to finish before it parses the request and completes its execution. Then, a new process starts and a new page opens, continuing the cycle. As a result, only the last page opened is processed without any errors.
Meanwhile, the other 9 pages, even though they are loaded and stable, never trigger their WaitLoad and WaitIdle functions because these events only occur on the active (last opened) page. Consequently, when the timeout is reached, these pages close and a timeout error is returned (this isn’t a DOM error but simply a timeout). The cycle then repeats with 9 pages timing out and a new set of pages opening.
The only way to avoid these errors is to set the concurrency to 1, ensuring that only one page is open and active at a time.
In my tests, when I manually refreshed a page that will timed out, it has been processed correctly. This confirms that only the active page can be properly processed.
I'd written this issue before but I hadn't investigated it enough to understand the real problem : #919
However, now I'm sure that the problem comes from the competition and the error is triggered by the page timeout because it never receives the page load event because it's not active for the browser.
Expected Behavior:
The 10 pages would have to be considered active for the browser for it not to timeout. However, I don't know if it's really possible to make concurrency in the browser context.
Steps To Reproduce:
Just run katana in headless mode and set a one-minute timeout. You'll notice that all pages get stuck except for the last one processed. In the results file, every stuck page will show the error: could not get the dom.
katana -headless -u -cwu ws://127.0.0.1:9222/devtools/browser/ea01541b-7878-4465-8a18-ce03133610e8 -no-incognito -depth 1 -jsonl -o result.json -timeout 60 -delay 1 -debug
The text was updated successfully, but these errors were encountered: