Copy InnerHTML to text file Daily using javascript

问题

I am trying to program a javascript that'll grab the Inner HTML code from the top news story of the BBC website (http://www.bbc.co.uk/news), and put it in a txt document. I don't know much about javascript, I know more of .BAT and .VBS, but I know that they can't do this.

I'm not sure how to approach this. I thought of making it scan for a fixed outerHTML code, and then copy the inner one to txt file.

However, I can't seem to find an outerHTML code that is permanent everyday. For example, this is the title of today's.

<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>

As you see, it has the headline incorporated.

I'm using Firefox if that makes a different.

Any help would be much appreciated.

Regards,

Master-chip.

回答1:

Pure client Browser approach:

Ok i made this fiddle for you and may help others too. This was interesting to me and challenging. Below are the points on how i achieved the possible solution

Used ECMA 5 Blob Api to create text file on the fly.
Loaded http://www.bbc.co.uk/news in iframe (Cross Domain origin access - See Note section below)
On iframe loaded event trigger a timeout using either setTimeout or setInterval (Commented - For repetitive execution hourly or daily) as per your need (Adjust time as per your need).
Querying the text nodes using document.querySelectorAll(".title-link span") seemed to be generic based on examining the webpage source.
Check out the fiddler Link

Javascript:

 (function () {
    var textFile = null,
        makeTextFile = function (text) {
            var data = new Blob([text], {
                type: 'text/plain'
            });

            // If we are replacing a previously generated file we need to
            // manually revoke the object URL to avoid memory leaks.
            if (textFile !== null) {
                window.URL.revokeObjectURL(textFile);
            }

            textFile = window.URL.createObjectURL(data);

            return textFile;
        };

    var iframe = document.getElementById('frame');    
    var commFunc = function () {
            var iframe2 = document.getElementById('frame'); //This is required to get the fresh updated DOM
            var innerDoc = iframe2.contentDocument || iframe2.contentWindow.document;            
            var getAll = Array.prototype.slice.call(innerDoc.querySelectorAll(".title-link span"));          
            var dummy = "";
            for (var obj in getAll) {
                dummy = dummy.concat("\n" + (getAll[obj]).innerText);
            }
            var link = document.createElement("a");
            link.href = makeTextFile(dummy);
            link.download = "sample.txt"
            link.click();
            console.log("Downloaded the sample.txt file");
        };

    iframe.onload = function () {
        setTimeout(commFunc, 1000); //Adjust the time required to load
        //setInterval(commFunc, 1000);
    };  

    //Click the button when the page inside the iframe is loaded
    create.addEventListener('click', commFunc);            
})();

HTML:

<span class="title-link__title-text">Benefit plan 'could hit young Britons'</span>
    <div>
        <iframe id="frame" src="http://www.bbc.co.uk/news"></iframe>
    </div>
    <button id="create">Download</button>

Note:

To run the above javascript on chrome you need to disable web security. The above script should run good on firefox, no tweaks needed.
This is a possible illustration that can be achieved using pure browser scripting. Tab should be active for periodic grabbing.
Targetted for modern browsers

Suggested Approach:

Use node.js server and you can modify the above script for to run as stanalone
Or any server side scripting frameworks like php, java spring etc.

Using Node js approach:

Javascript:

var jsdom = require("node-jsdom");
var fs = require("fs");
jsdom.env({
  url: "http://www.bbc.co.uk/news",
  scripts: ["http://code.jquery.com/jquery.js"],
  done: function (errors, window) {
    var $ = window.$;
    console.log("HN Links");
    $(".title-link span").each(function() {
      //console.log(" -", $(this).text());
      fs.existsSync("sample.txt") === true ? fs.appendFile("sample.txt", "\r"+ $(this).text()) : fs.writeFile("sample.txt", "\r"+ $(this).text())
    });
  }
});

Dependencies for the above code:

NodeJS
JSDOM
Jquery
NodeJS Filesystem

Hope it helped you and other also

回答2:

My thoughts -

JS can be used to get data/text from pages, but, to save it into a file, you have to use something in the backend like Python or PHP etc.,
Why use JS? You can scrape the web very well using CURL. Use PHP Curl if that's easier for you.

You can scrape/download the webpage using -

<?php
    // Defining the basic cURL function
    function curl($url) {
        $ch = curl_init();  // Initialising cURL
        curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL
        return $data;   // Returning the data from the function
    }
?>

Then use the function at your discretion-

<?php
    $scraped_website = curl("http://www.yahoo.com");  // Executing our curl function to scrape the webpage http://www.yahoo.com and return the results into the $scraped_website variable
?>

Reference Links-

Web scraping with PHP and CURL

Scraping in PHP with CURL

You can scrape more clearly using DIV's and Node's of HTML elements. Check these out - Part1 - Part2 - Part3

Hope it helps. Happy Coding!

回答3:

You want download txt file with content from html?Is this right, you can use this create txt file and download it If you want to get text from all title spans, you need do this

var txt = "";
var nodeList = document.querySelectorAll(".title-link__title-text") 
for(var i=0; i<nodeList.length;i++){
   txt+="\n"+nodeList[i].innerText; 
}

And then write txt variable to file, like in post i mentioned above.

来源：https://stackoverflow.com/questions/31939468/copy-innerhtml-to-text-file-daily-using-javascript

标签

javascript

html

node.js

feed