Feedparser to dataframe doesnt ouput all columns

后端 未结 2 1333
既然无缘
既然无缘 2021-01-27 23:49

I parsing a URL from feedparser and trying to get all columns, but i do not get all columns as a out put, not sure where the issue is. If you execute the below. I do not get dat

相关标签:
2条回答
  • 2021-01-28 00:05

    Because the Ebay XML has a default namespace in root you need to define a prefix to this namespace URI in order to parse by named nodes. See how namespace dictionary is used in second argument of findall and the .tag needs namespace removed from retrieved value. Do note the opening for loop is not required for below list/dict comprehension solution.

    import lxml.etree as ET 
    import urllib
    import pandas as pd
    
    response = urllib.request.urlopen('http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US')
    xml = response.read()
    
    root = ET.fromstring(xml)
    nmsp = {'doc': 'http://www.ebay.com/marketplace/rps/v1/feed'}
       
    df = pd.DataFrame([{item.tag.replace(f"{{{nmsp['doc']}}}", ''): item.text 
                               if item.text.strip() != "" else item.find("*").text
                       for item in lnk.findall("*") if item is not None} 
                       for lnk in root.findall('.//doc:item', nmsp)])
                           
    

    Output (running exact posted code above)

    df
    #         itemId                                              title  ... shippingCost                                dealUrl
    #0  372639986116  Samsung Galaxy BUDS SM-R170 (Bluetooth 5.0) He...  ...         0.00  https://www.ebay.com/deals/6052526231
    #1  153918933129  Lenovo ThinkPad X1 Carbon Gen7, 14" FHD IPS, i...  ...         0.00  https://www.ebay.com/deals/6052642213
    #2  283899231838  Ray Ban RB4278 628271 51 Black Matte Black Pla...  ...         0.00  https://www.ebay.com/deals/6051914268
    #3  283957227324                  Ghost of Tsushima - PlayStation 4  ...         0.00  https://www.ebay.com/deals/6052642134
    #4  202905303442  Samsung Galaxy S20+ Plus SM-G985F/DS 128GB 8GB...  ...         0.00  https://www.ebay.com/deals/6052752611
    #5  332946625819  DEWALT DCB6092 20V/60V MAX FLEXVOLT 9 Ah Li-Io...  ...         0.00  https://www.ebay.com/deals/6052523001
    #6  264175647395  Citizen Eco-Drive Men's Silver Dial Black Leat...  ...         0.00  https://www.ebay.com/deals/6051783829
    #7  303374676252  Champion Authentic Cotton 9-Inch Men's Shorts ...  ...         0.00  https://www.ebay.com/deals/6051880500
    #8  202940881433   Samsung QN65Q90TAFXZA 65" 4K QLED Smart UHD T...  ...         0.00  https://www.ebay.com/deals/6052527037
    #9  400789484589  Light Blue by Dolce & Gabbana D&G Perfume Wome...  ...         0.00  https://www.ebay.com/deals/6052122816
    
    0 讨论(0)
  • 2021-01-28 00:21

    Another method.

    import pandas as pd
    from simplified_scrapy import SimplifiedDoc, utils, req
    
    getdeals = ['http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200',
                'http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200&offset=200',
                'http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200&offset=400']
        
    posts=[]
    header = ['title','endsAt','image255','price','originalPrice','discountPercentage','shippingCost','dealUrl']
    for url in getdeals:
        try: # It's a good habit to have try and exception in your code.
            feed = SimplifiedDoc(req.get(url))
            for deals in feed.selects('item'):
                row = []
                for h in header: row.append(deals.select(h+">text()")) # Returns None when the element does not exist
                posts.append(row)
        except Exception as e:
            print (e)
            
    df=pd.DataFrame(posts,columns=header)
    df.tail()
    
    0 讨论(0)
提交回复
热议问题