Extract multi-line javascript content from [removed] tag using Scrapy

前端未结
关注
 3  1723
借酒劲吻你 2021-01-13 15:37
I\'m trying to extract data from this script tag using Scrapy:

      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   暖寄归人
                                             
                
                
                (楼主)
            
              
              
                2021-01-13 15:45
              

            
            
                        
If you don't want to play around with regular expressions, there's js2xml, which parses Javascript code and converts it to an lxml document.
Then you can use XPath to query things from the Javascript statements.
(disclaimer: I wrote and maintain js2xml)

Here's sample code on how to get those data.bundles assignments:

import scrapy

selector = scrapy.Selector(text="""""")


(this first part is to get the HTML input in a Scrapy Selector)

import js2xml
import pprint

data_bundles = {}
for script in selector.xpath('//script/text()').extract():
    # this is how you turn Javascript code into an XML document (lxml document in fact)
    jstree = js2xml.parse(script)

    # then, we're interested in assignments of data.bundles object
    for a in jstree.xpath('//assign[left//property/identifier/@name="bundles" and right/object]'):
        # the assigned property is give by a  property from a 
        bundle_prop = a.xpath('./left/bracketaccessor/property/string/text()')
        if bundle_prop is not None:
            curr_prop = bundle_prop[0]

        data_bundles[curr_prop] = {}

        # the left object is assigned an object (inside a  element)
        # let's loop on the  elements)
        # the values are either numbers or string arguments of a function call
        for prop in a.xpath('./right/object/property'):
            data_bundles[curr_prop][prop.xpath('@name')[0]] = prop.xpath('.//number/@value | .//string/text()')[0]

pprint.pprint(data_bundles)


This is what you get out of this:

{'KONTANT_KOMPLETT.REGULAR': {'commitmentTime': '0',
                              'monthlyPrice': '0',
                              'newMsisdnFee': '0',
                              'offeringTitle': 'SMART Kontant',
                              'offeringType': 'PREPAID',
                              'signupFee': '0',
                              'upfrontPrice': '3499'},
 'SMART_BASIC.TLF12PLEAS': {'commitmentTime': '12',
                            'monthlyPrice': '299',
                            'newMsisdnFee': '199',
                            'offeringTitle': 'SMART Basis',
                            'offeringType': 'VOICE',
                            'signupFee': '0',
                            'upfrontPrice': '2199'},
 'SMART_MINI.TLF12PLEAS': {'commitmentTime': '12',
                           'monthlyPrice': '199',
                           'newMsisdnFee': '199',
                           'offeringTitle': 'SMART Mini',
                           'offeringType': 'VOICE',
                           'signupFee': '0',
                           'upfrontPrice': '2999'},
 'SMART_PLUSS.TLF12PLEAS': {'commitmentTime': '12',
                            'monthlyPrice': '399',
                            'newMsisdnFee': '199',
                            'offeringTitle': 'SMART Pluss',
                            'offeringType': 'VOICE',
                            'signupFee': '0',
                            'upfrontPrice': '1599'},
 'SMART_SUPERX.TLF12PLEAS': {'commitmentTime': '12',
                             'monthlyPrice': '499',
                             'newMsisdnFee': '199',
                             'offeringTitle': 'SMART Super',
                             'offeringType': 'VOICE',
                             'signupFee': '0',
                             'upfrontPrice': '1099'}}


For more info on the XML schema you get with  js2xml.parse(), you can check https://github.com/redapple/js2xml/blob/master/SCHEMA.rst
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复