What's the least redundant way to make a site with JavaScript-generated HTML crawlable?

后端 未结 5 898
半阙折子戏
半阙折子戏 2021-01-30 09:30

After reading Google\'s policy on making Ajax-generated content crawlable, along with many developers\' blog posts and Stackoverflow Q&A threads on the subject, I\'m left wi

5条回答
  •  挽巷
    挽巷 (楼主)
    2021-01-30 09:54

    I have found a solution that does not require any Java, Node.js or any other way to make a redundant copy of a JS code generating website. Also it supports all browsers.

    So what you need to do is provide the snapshot for Google. It's the best solution, because you dont need to mess with other URLS and so on. Also: you don't add noscript to your basic website so it's lighter.

    How to make a snapshot? Phantomjs, HTMLUnit and so on require a server where you can put it and call. You need to configure it, and combine with u website. And this is a mess. Unfortunately there is no PHP headless browser. It's obvious because of the specifics of PHP.

    So what is the other way of getting snapshot? Well... if user opens website you can get the snapshot of what he sees with JS (innerHTML).

    So what you need to do is:

    • check if you need a snapshot for your site (if you have it, you don't need to take another)
    • you send this snapshot to server for saving to file (PHP handles the POST request with snapshot, and saves to file)

    And if Google Bot visits your hash bang website you get the file of the snapshot for the page requested.

    Things to solve:

    • safety: you don't want any script from user or his browser (injection) save to snapshot, maybe its best only that you can generate snapshots (see sitemap below)
    • compatibility: you don't want to save from any browser but from one that supports your website the best
    • don't bother mobile: just don't use mobile users to generate snapshots so page will be not slower for them
    • failover: if you don't have snapshot output standard website - its nothing good for Google, but its still better than nothing

    Also there is one thing: not all pages will be visited by users but you need snapshots for the Google before they visit.

    So what to do? There is solution for this also:

    • generate sitemap that has all the pages you have on website (it must be generated on fly to be up to date, and crawler soft does not help because it does not execute JS)
    • visit in any way pages from the sitemap that does not have snapshot. This will call snapshot code and generate it properly
    • visit regularly (daily?)

    But hey, how to visit all those pages? Well. There are some solutions for this:

    • write a app in Java, C# or other language to get pages to visit from server and visit it with built in browser control. Add this to your schedule on server.
    • write a JS script that opens required pages in iFRAME one by another. Add this to your schedule on a computer.
    • just open the script mentioned above manually if your site is mostly static

    Also remember to refresh old snaps ocassionally to make them up to date.

    I hope to hear from you what do you think about this solution.

提交回复
热议问题