Python: How to access the elements in a generator object and put them in a Pandas dataframe or in a dictionary?

吃可爱长大的小学妹 提交于 2021-01-28 00:56:16

问题


I am using the scholarly module in python to search for a keyword. I am getting back a generator object as follows:

import pandas as pd
import numpy as np
import scholarly

search_query = scholarly.search_keyword('Python')
print(next(search_query))

{'_filled': False,
 'affiliation': 'Juelich Center for Neutron Science',
 'citedby': 75900,
 'email': '@fz-juelich.de',
 'id': 'zWxqzzAAAAAJ',
 'interests': ['Physics', 'C++', 'Python'],
 'name': 'Gennady Pospelov',
 'url_picture': 'https://scholar.google.com/citations?view_op=medium_photo&user=zWxqzzAAAAAJ'}

I want to access the element 'citedby' but when I try to do next(search_query)['citedby'] it returns TypeError: 'Author' object is not subscriptable.

My question is how can I access elements in the generator object? and How can I convert that object to a Pandas dataframe?


回答1:


This is not a generator problem. The objects the generator produces are not dictionaries.

Granted, the scholary library does not help matters by giving the Author instances that you are given a dictionary-like string conversion, and not actually documenting what API that class does support.

Each of the 'keys' in the Author representation is actually an attribute on the object:

author = next(search_query)
print(author.citedby)

You can get a dictionary for the object by using the vars() function:

author_dict = vars(author)

The data doesn't necessarily map to a dataframe directly, though. How would the interests list be represented in the dataframe tabular data structure, for example? And you wouldn't want to include the _filled internal attribute either (that's a flag to record if author.fill() has been called yet).

That said, you could just create a dataframe from the dictionaries by mapping the generator over the vars function:

search_query = scholarly.search_keyword('Python')
df = pd.DataFrame(map(vars, search_query))

and then drop the _filled column if necessary, and convert the interests column into something a bit more structured, such as separate columns with 0 / 1 values or similar.

Note that this is going to be slow, because the scholarly library pages through the Google search results sequentially, and the library deliberately delays requests with a random sleep interval of 5-10 seconds each time to avoid Google blocking the requests. So you'll have to be patient as the Python keyword search easily produces nearly 30 pages of results.



来源:https://stackoverflow.com/questions/57167543/python-how-to-access-the-elements-in-a-generator-object-and-put-them-in-a-panda

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!