Skip to content

On the same website, playwright succeeded but PlaywrightCrawler failed with error: page. goto: net:: ERR-EMPTYResponse #2989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
jeff1998-git opened this issue May 28, 2025 · 3 comments
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@jeff1998-git
Copy link

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

run code:
Console output:
use Playwright is success
INFO PlaywrightCrawler: Starting the crawler.
use Playwright message: 湖北省中医院

use crawlee is fail
WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/
Call log:

ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/
Call log:

  • navigating to "https://www.hbhtcm.com/", waiting until "load"

    at gotoExtended (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\utils\playwright-utils.js:165:17)
    at PlaywrightCrawler._navigationHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\playwright-crawler.js:117:52)
    at PlaywrightCrawler._handleNavigation (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\browser\internals\browser-crawler.js:331:52)
    at async PlaywrightCrawler._runRequestHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\browser\internals\browser-crawler.js:260:13)
    at async PlaywrightCrawler._runRequestHandler (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@crawlee\playwright\internals\playwright-crawler.js:114:9)
    at async wrap (D:\crawleeRunVersion\notice_project\crawlee1.1.2\node_modules@apify\timeout\cjs\index.cjs:54:21) {"id":"dcHa1anqjn3kF91","url":"https://www.hbhtcm.com","method":"GET","uniqueKey":"https://www.hbhtcm.com"}

so:On the same website, playwright succeeded but PlaywrightCrawler failed with error: page. goto: net:: ERR-EMPTYResponse
This is very strange.
and use puppeteer run this web is success
Using Fiddler, proxy to 127.0.0.1:8888, playwrightCrawler is successful again, packet capture is normal
This is a very strange phenomenon,I want to know what the reason is

Code sample

const url = 'https://www.hbhtcm.com';    // fail
// const url = 'https://www.baidu.com';  // success

// 1. use crawlee
const crawleeTest = new PlaywrightCrawler({

    useSessionPool: false,
    navigationTimeoutSecs: 120,
    requestHandlerTimeoutSecs: 300,
    maxRequestsPerCrawl: 50,

    launchContext: {
        // !!! You need to specify this option to tell Crawlee to use playwright-extra as the launcher !!!
        launchOptions: {
            // Other playwright options work as usual
            headless: true,
            channel: 'chrome',
            launcher: playwright.chromium,
            args: [
                '--disable-http2', // isable HTTP/2  use of HTTP/1.1

                '--no-sandbox',
                '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
                '--disable-blink-features=AutomationControlled',
                "--ignore-certificate-errors",
                "--ignore-certificate-errors-spki-list",

            ]
        },
    },

    // ------------------------------------------
    preNavigationHooks: [
        async ({ page }) => {
            // set headers
            await page.setExtraHTTPHeaders({
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0',
                'sec-ch-ua': '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
                'sec-ch-ua-mobile': '?0',
                'sec-ch-ua-platform': '"Windows"',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
                'Accept-Encoding': 'gzip, deflate, br, zstd',
                'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
                'Cache-Control': 'max-age=0',
                'Connection': 'keep-alive',
                'Sec-Fetch-Dest': 'document',
                'Sec-Fetch-Mode': 'navigate',
                'Sec-Fetch-Site': 'none',
                'Sec-Fetch-User': '?1',
                'Upgrade-Insecure-Requests': '1',
            });

            // set Cookie
            // await page.context().addCookies([
            //     { name: 'view', value: '1748198195', domain: 'www.hbhtcm.com', path: '/' },
            //     { name: 'PHPSESSID', value: 'aa5m36c3b87m38ofumijqilbvm', domain: 'www.hbhtcm.com', path: '/' },
            // ]);
        },
    ],
    // --------------------------------------------

    async requestHandler({ request, page, enqueueLinks, log }) {
        log.info(`Processing ${request.url}...`);
        const title = await page.title()
        console.log("use PlaywrightCrawler message::", title);
    },

    // This function is called if the page processing failed more than maxRequestRetries+1 times.
    failedRequestHandler({ request, log }) {
        log.info(`Request ${request.url} failed too many times.`);
    },
});

await crawleeTest.addRequests([url]);   // 

// ==================================================================
// 2. use Playwright 
const playwrightTest = async () => {

    const browser = await playwright.chromium.launch({
        headless: true,
    });

    const page = await browser.newPage();
    await page.goto(url);
    console.log('use Playwright message:', await page.title());
    await browser.close();
};

// Running results:
await Promise.all([crawleeTest.run(), playwrightTest()]);

Package version

[email protected] D:\crawleeRunVersion\notice_project\crawlee1.1.2 ├── @crawlee/[email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] ├── [email protected] └── [email protected]

Node.js version

Node v22.12.0

Operating system

windows

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@jeff1998-git jeff1998-git added the bug Something isn't working. label May 28, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 28, 2025
@barjin
Copy link
Contributor

barjin commented May 29, 2025

Hello, and thank you for your interest in this project!

It seems that the network proxying layer in Crawlee (proxy-chain) exhibits different behaviour than a regular browser while loading the page (https://www.hbhtcm.com).

import { chromium } from 'playwright';
import { Server as ProxyChainServer } from 'proxy-chain';

const server = new ProxyChainServer({
    port: 0,
});
await server.listen();

const url = 'https://www.hbhtcm.com';

const browser = await chromium.launch({
    headless: false,
    proxy: {                                       // comment me out to make this "work"
        server: `http://127.0.0.1:${server.port}`, // comment me out to make this "work"
    }                                              // comment me out to make this "work"
});

const page = await browser.newPage();
await page.goto(url); // Fails with `page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/`
await browser.close();

By the way, locally, I didn't manage to load the page even without the proxy (vanilla Playwright). The page.goto call has timed out for me. Can you confirm that your Playwright instance can load the page contents (without Crawlee)?

This might be just those two libraries behaving differently on an empty HTTP response - while proxy-chain fails immediately, the browser might try to wait for the response for longer. If this is the case, it's IMO wontfix, as both behaviours seem reasonable.

@jeff1998-git
Copy link
Author

hello,Thank you for your reply,I have reproduced your code,do not use proxy is success,use proxy is fail,but use proxy before,first open fiddler soft, change port ->8888,fiddler working prot, use proxy code running is success

In the playwrightcrawler module, when using Fiddler as a proxy, the access is successful. Without Fiddler proxy, the access fails

I don't know where the problem lies

const playwright = require('playwright');
const { Server: ProxyChainServer } = require('proxy-chain');

(async () => {
// create proxy server
const server = new ProxyChainServer({
port: 0,
prepareRequestFunction: ({ request }) => {
return {
upstreamProxyUrl: null,
};
},
});

await server.listen();
console.log(`proxy running,port: ${server.port}`);

const url = 'https://www.hbhtcm.com';

try {
    // 1、 run browser and use proxy
    const browser = await playwright.chromium.launch({
        headless: false,
        proxy: {
            server: `http://127.0.0.1:${server.port}`,
        },
    });

    const page = await browser.newPage();

    console.log(`page accessing: ${url}`);

    await page.goto(url, { timeout: 60000 });  // Fails with `page.goto: net::ERR_EMPTY_RESPONSE at https://www.hbhtcm.com/`

    console.log('loading page is success...');

    await page.waitForTimeout(5000);

    await browser.close();
} catch (error) {
    console.error(`Use proxy Navigation failed: ${error.message}`);

    // 2、try again --- run browser and do not use proxy 
    // Attempt to access without using a proxy and determine if it is a proxy issue
    try {
        console.log('try no proxy...');
        const directBrowser = await playwright.chromium.launch({ headless: false });
        const directPage = await directBrowser.newPage();
        await directPage.goto(url, { timeout: 60000 });
        console.log('Do not use proxy server...is success...');   // output: Do not use a proxy server...is success...
        await directBrowser.close();
    } catch (directError) {
        console.error(`Do not use proxy server..is fail...Too.: ${directError.message}`);
        console.error('What is the problem....');
    }
} finally {

    await server.close();
    console.log('proxy server close');
}

})();

@jeff1998-git
Copy link
Author

again,Use the same code as above,test more website, result:

// const url = 'https://www.google.com';     // use proxy is fail    && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.youtube.com/';   // use proxy is fail    && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.baidu.com/';     // use proxy is success && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.163.com/'        // use proxy is success && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.bing.com/';      // use proxy is success && no proxy is success  ||  Fiddler port->8888 success
// const url = 'https://www.hbhtcm.com';     // use proxy is fail    && no proxy is success  ||  Fiddler port->8888 success

// Fiddler listen in port->8888
// use fiddler soft listen in port 8888
const server = new ProxyChainServer({
port: 8888,
prepareRequestFunction: ({ request }) => {
return {
upstreamProxyUrl: null,
};
},
});

Using Playwright, Fiddler soft can capture data packets, but using the proxy-chain proxy module, Fiddler cannot capture data packets
When using the playwrightcrawler module, Fiddler cannot capture the data packets. I speculate that the playwrightcrawler module should be encapsulated with proxy-chain, so that Fiddler cannot retrieve the data packets from the playwrightcrawler module,

It can be inferred that this type of proxy, when processing certain responses, missed something, causing page.goto: net:: ERR-EMPTYResponse. However, by directing the proxy port to Fiddler's working port 8888 and passing it through Fiddler, Fiddler supplemented and processed these issues, making the work normal again,

Although I don't know what processing the proxy module has done, through these tests, it can be known that the proxy chain's work that was not handled properly has become normal again through Fiddler processing

It may also be due to differences in the network. If playwrightcrawler can obtain data packets through intermediaries, it should be able to discover something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants