-
Notifications
You must be signed in to change notification settings - Fork 828
On the same website, playwright succeeded but PlaywrightCrawler failed with error: page. goto: net:: ERR-EMPTYResponse #2989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, and thank you for your interest in this project! It seems that the network proxying layer in Crawlee (proxy-chain) exhibits different behaviour than a regular browser while loading the page (https://www.hbhtcm.com). import { chromium } from 'playwright';
import { Server as ProxyChainServer } from 'proxy-chain';
const server = new ProxyChainServer({
port: 0,
});
await server.listen();
const url = 'https://www.hbhtcm.com';
const browser = await chromium.launch({
headless: false,
proxy: { // comment me out to make this "work"
server: `http://127.0.0.1:${server.port}`, // comment me out to make this "work"
} // comment me out to make this "work"
});
const page = await browser.newPage();
await page.goto(url); // Fails with `page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/`
await browser.close(); By the way, locally, I didn't manage to load the page even without the proxy (vanilla Playwright). The This might be just those two libraries behaving differently on an empty HTTP response - while |
hello,Thank you for your reply,I have reproduced your code,do not use proxy is success,use proxy is fail,but use proxy before,first open fiddler soft, change port ->8888,fiddler working prot, use proxy code running is success In the playwrightcrawler module, when using Fiddler as a proxy, the access is successful. Without Fiddler proxy, the access fails I don't know where the problem lies const playwright = require('playwright'); (async () => {
})(); |
again,Use the same code as above,test more website, result:
// Fiddler listen in port->8888 Using Playwright, Fiddler soft can capture data packets, but using the proxy-chain proxy module, Fiddler cannot capture data packets It can be inferred that this type of proxy, when processing certain responses, missed something, causing page.goto: net:: ERR-EMPTYResponse. However, by directing the proxy port to Fiddler's working port 8888 and passing it through Fiddler, Fiddler supplemented and processed these issues, making the work normal again, Although I don't know what processing the proxy module has done, through these tests, it can be known that the proxy chain's work that was not handled properly has become normal again through Fiddler processing It may also be due to differences in the network. If playwrightcrawler can obtain data packets through intermediaries, it should be able to discover something |
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/playwright (PlaywrightCrawler)
Issue description
run code:
Console output:
use Playwright is success
INFO PlaywrightCrawler: Starting the crawler.
use Playwright message: 湖北省中医院
use crawlee is fail
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/
Call log:
{"id":"dcHa1anqjn3kF91","url":"https://www.hbhtcm.com","retryCount":1}
ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/
Call log:
navigating to "https://www.hbhtcm.com/", waiting until "load"
at gotoExtended (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\utils\playwright-utils.js:165:17)
at PlaywrightCrawler._navigationHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\playwright-crawler.js:117:52)
at PlaywrightCrawler._handleNavigation (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\browser\internals\browser-crawler.js:331:52)
at async PlaywrightCrawler._runRequestHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\browser\internals\browser-crawler.js:260:13)
at async PlaywrightCrawler._runRequestHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\playwright-crawler.js:114:9)
at async wrap (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@apify\timeout\cjs\index.cjs:54:21) {"id":"dcHa1anqjn3kF91","url":"https://www.hbhtcm.com","method":"GET","uniqueKey":"https://www.hbhtcm.com"}
so:On the same website, playwright succeeded but PlaywrightCrawler failed with error: page. goto: net:: ERR-EMPTYResponse
This is very strange.
and use puppeteer run this web is success
Using Fiddler, proxy to 127.0.0.1:8888, playwrightCrawler is successful again, packet capture is normal
This is a very strange phenomenon,I want to know what the reason is
Code sample
Package version
[email protected] D:\crawleeRunVersion\notice_project\crawlee1.1.2 ├── @crawlee/[email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] └── [email protected]
Node.js version
Node v22.12.0
Operating system
windows
Apify platform
I have tested this on the
next
releaseNo response
Other context
No response
The text was updated successfully, but these errors were encountered: