Getting javascript variable value while scraping with python

。_饼干妹妹 提交于 2021-02-05 08:19:07

问题


I know this is asked before also, but I am a newbie in scraping and python. Please help me and it would be very much helpful in my learning path.

I am scraping a news site using python with packages such as Beautiful Soup and etc.

I am facing difficulty while getting the value of java script variable which is declared in script tag and also it is getting updated there.

Here is the part of HTML page which I am scraping:(containing only script part)

<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>

From the above part, I want to get the value of min_news_id in python. I should also get the value of same variable if updated from line 2.

Here is how I am doing it:

    self.pattern = re.compile('var min_news_id = (.+?);') // or self.pattern = re.compile('min_news_id = (.+?);')
    page = bs(htmlPage, "html.parser")
    //find all the scripts tag
    scripts = page.find_all("script")
    for script in scripts:
        for line in script:
            scriptString = str(line)
            if "min_news_id" in scriptString:
                scriptString.replace('"', '\\"')
                print(scriptString)
                if(self.pattern.match(str(scriptString))):
                    print("matched")
                    data = self.pattern.match(scriptString)
                    jsVariable = json.loads(data.groups()[0])
                    InShortsScraper.newsOffset = jsVariable
                    print(InShortsScraper.newsOffset)

But I am never getting the value of the variable. Is it problem with my regular expression or any other? Please help me. Thank You in advance.


回答1:


html = '''<!-- Eliminate render-blocking JavaScript and CSS in above-the-fold content -->
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/materialize/0.97.0/js/materialize.min.js"></script>

  <script type="text/javascript" src="/dist/scripts/index.js"></script>
  <script type="text/javascript" src="/dist/scripts/read.js"></script>
  <script src="/dist/scripts/jquery.scrolldepth.min.js"></script>
  <script type="text/javascript">

    var min_news_id = "d7zlgjdu-1"; // line 1
    function loadMoreNews(){
      $("#load-more-btn").hide();
      $("#load-more-gif").show();
      $.post("/en/ajax/more_news",{'category':'politics','news_offset':min_news_id},function(data){
          data = JSON.parse(data);
          min_news_id = data.min_news_id||min_news_id; // line 2
          $(".card-stack").append(data.html);
      })
      .fail(function(){alert("Error : unable to load more news");})
      .always(function(){$("#load-more-btn").show();$("#load-more-gif").hide();});
    }
    jQuery.scrollDepth();
  </script>'''

finder = re.findall(r'min_news_id = .*;', html)
print(finder)

Output:
['min_news_id = "d7zlgjdu-1";', 'min_news_id = data.min_news_id||min_news_id;']

#2 OR YOU CAN USE

print(finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip())

Output:
d7zlgjdu-1

#3 OR YOU CAN USE

finder = re.findall(r'[a-z0-9]{8}-[0-9]', html)
print(finder)   

Output:
['d7zlgjdu-1'] 



回答2:


you can't monitor javascript variable change using BeautifulSoup, here how to get next page news using while loop, re and json

from bs4 import BeautifulSoup
import requests, re

page_url = 'https://inshorts.com/en/read/politics'
ajax_url = 'https://inshorts.com/en/ajax/more_news'

htmlPage = requests.get(page_url).text
# BeautifulSoup extract article summary
# page = BeautifulSoup(htmlPage, "html.parser")
# ...

# get current min_news_id
min_news_id = re.search('min_news_id\s+=\s+"([^"]+)', htmlPage).group(1) # result: d7zlgjdu-1

customHead = {'X-Requested-With': 'XMLHttpRequest', 'Referer': page_url}

while min_news_id:
    # change "politics" if in different category
    reqBody = {'category' : 'politics', 'news_offset' : min_news_id }
    # get Ajax next page
    ajax_response = requests.post(ajax_url, headers=customHead, data=reqBody).json() # parse string to json
    # again, do extract article summary
    page = BeautifulSoup(ajax_response["html"], "html.parser")
    # ....
    # ....

    # new min_news_id
    min_news_id = ajax_response["min_news_id"]

    # remove this to loop all page (thousand?)
    break



回答3:


thank you for the response, Finally I solved using requests package after reading its documentation,

here is my code :

if InShortsScraper.firstLoad == True:
            self.pattern = re.compile('var min_news_id = (.+?);')
        else:
            self.pattern = re.compile('min_news_id = (.+?);')
        page = None
        # print("Pattern: " + str(self.pattern))
        if news_offset == None:
            htmlPage = urlopen(url)
            page = bs(htmlPage, "html.parser")
        else:
            self.loadMore['news_offset'] = InShortsScraper.newsOffset
            # print("payload : " + str(self.loadMore))
            try:
                r = myRequest.post(
                    url = url,
                    data = self.loadMore
                )
            except TypeError:
                print("Error in loading")

            InShortsScraper.newsOffset = r.json()["min_news_id"]
            page = bs(r.json()["html"], "html.parser")
        #print(page)
        if InShortsScraper.newsOffset == None:
            scripts = page.find_all("script")
            for script in scripts:
                for line in script:
                    scriptString = str(line)
                    if "min_news_id" in scriptString:
                        finder = re.findall(self.pattern, scriptString)
                        InShortsScraper.newsOffset = finder[0].replace('min_news_id = ', '').replace('"','').replace(';','').strip()


来源:https://stackoverflow.com/questions/53283742/getting-javascript-variable-value-while-scraping-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!