Load a SPA webpage via AJAX

北战南征 提交于 2019-12-05 01:35:04

You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.

The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.

I wanted to try Headless Chrome so let's see what it can do with your page:

Quick check using internal REPL

Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:

% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim() 
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}

NB: to get chrome command line working on a Mac I did this beforehand:

alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"

Using programmatically with Node & Puppeteer

Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.

(Step 0 : Install Node & Yarn if you don't have them)

In a new directory:

yarn init
yarn add puppeteer

Create index.js with this:

const puppeteer = require('puppeteer');
(async() => {
    const url = 'https://connect.garmin.com/modern/activity/1915361012';
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    // Go to URL and wait for page to load
    await page.goto(url, {waitUntil: 'networkidle'});
    // Wait for the results to show up
    await page.waitForSelector('.page-title');
    // Extract the results from the page
    const text = await page.evaluate(() => {
        const title = document.querySelector('.page-title');
        return title.innerText.trim();
    });
    console.log(`Found: ${text}`);
    browser.close();
})();

Result:

$ node index.js 
Found: Daily Mile - Round 2 - Day 27

First off: avoid eval - your content security policy should block it and it leaves you open to easy XSS attacks. Scraping bots definitely won't run it.

The problem you're describing is common to all SPAs - when a person visits they get your app shell script, which then loads in the rest of the content - all good. When a bot visits they ignore the scripts and return the empty shell.

The solution is server side rendering. One way to do this is if you're using a JS renderer (say React) and Node.js on the server you can fairly easily build the JS and serve it statically.

However, if you aren't then you'll need to run a headless browser on your server that executes all the JS a user would and then serves up the result to the bot.

Fortunately someone else has already done all the work here. They've put a demo online that you can try out with your site:

I think you should know the concept of SPA, SPA is Single Page Application, it is only static html file. when the route changs, the page will create or modify DOM nodes dynamically to achieve the effect of switch page by using Javascript.

Therefore, if you use $.get(), the server will response a static html file that has a stable page, so you won't load what you want.

If you wants to use $.get() , it has two ways, the first is using headless browser, for example, headless chrome, phantomJS and etc. It will help you load the page and you can get dom nodes of the loaded page.The second is SSR (Server Slide Render), if you use SSR, you will get HTML data of page directly by $.get, because the server response HTML data of correspond page when requesting different routes.

Reference:

SSR

the SRR frame of vue: Nuxt.js

PhantomJS

Node API of Headless Chrome

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!