Feedparser to dataframe doesnt ouput all columns

后端未结

关注

 2  1333

I parsing a URL from feedparser and trying to get all columns, but i do not get all columns as a out put, not sure where the issue is. If you execute the below. I do not get dat

相关标签:

2条回答

眼角桃花

2021-01-28 00:05

Because the Ebay XML has a default namespace in root you need to define a prefix to this namespace URI in order to parse by named nodes. See how namespace dictionary is used in second argument of findall and the .tag needs namespace removed from retrieved value. Do note the opening for loop is not required for below list/dict comprehension solution.

import lxml.etree as ET 
import urllib
import pandas as pd

response = urllib.request.urlopen('http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US')
xml = response.read()

root = ET.fromstring(xml)
nmsp = {'doc': 'http://www.ebay.com/marketplace/rps/v1/feed'}
   
df = pd.DataFrame([{item.tag.replace(f"{{{nmsp['doc']}}}", ''): item.text 
                           if item.text.strip() != "" else item.find("*").text
                   for item in lnk.findall("*") if item is not None} 
                   for lnk in root.findall('.//doc:item', nmsp)])

Output (running exact posted code above)

df
#         itemId                                              title  ... shippingCost                                dealUrl
#0  372639986116  Samsung Galaxy BUDS SM-R170 (Bluetooth 5.0) He...  ...         0.00  https://www.ebay.com/deals/6052526231
#1  153918933129  Lenovo ThinkPad X1 Carbon Gen7, 14" FHD IPS, i...  ...         0.00  https://www.ebay.com/deals/6052642213
#2  283899231838  Ray Ban RB4278 628271 51 Black Matte Black Pla...  ...         0.00  https://www.ebay.com/deals/6051914268
#3  283957227324                  Ghost of Tsushima - PlayStation 4  ...         0.00  https://www.ebay.com/deals/6052642134
#4  202905303442  Samsung Galaxy S20+ Plus SM-G985F/DS 128GB 8GB...  ...         0.00  https://www.ebay.com/deals/6052752611
#5  332946625819  DEWALT DCB6092 20V/60V MAX FLEXVOLT 9 Ah Li-Io...  ...         0.00  https://www.ebay.com/deals/6052523001
#6  264175647395  Citizen Eco-Drive Men's Silver Dial Black Leat...  ...         0.00  https://www.ebay.com/deals/6051783829
#7  303374676252  Champion Authentic Cotton 9-Inch Men's Shorts ...  ...         0.00  https://www.ebay.com/deals/6051880500
#8  202940881433   Samsung QN65Q90TAFXZA 65" 4K QLED Smart UHD T...  ...         0.00  https://www.ebay.com/deals/6052527037
#9  400789484589  Light Blue by Dolce & Gabbana D&G Perfume Wome...  ...         0.00  https://www.ebay.com/deals/6052122816

0 讨论(0)

不思量自难忘°

2021-01-28 00:21

Another method.

import pandas as pd
from simplified_scrapy import SimplifiedDoc, utils, req

getdeals = ['http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200',
            'http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200&offset=200',
            'http://www.ebay.com/rps/feed/v1.1/epnexcluded/EBAY-US?limit=200&offset=400']
    
posts=[]
header = ['title','endsAt','image255','price','originalPrice','discountPercentage','shippingCost','dealUrl']
for url in getdeals:
    try: # It's a good habit to have try and exception in your code.
        feed = SimplifiedDoc(req.get(url))
        for deals in feed.selects('item'):
            row = []
            for h in header: row.append(deals.select(h+">text()")) # Returns None when the element does not exist
            posts.append(row)
    except Exception as e:
        print (e)
        
df=pd.DataFrame(posts,columns=header)
df.tail()

0 讨论(0)