问题
I have a program which retrieves a list of PubMed publications and wish to build a graph of co-authorship, meaning that for each article I want to add each author (if not already present) as a vertex and add an undirected edge (or increase its weight) between every coauthor.
I managed to write the first of the program which retrieves the list of authors for each publication and understand I could use the NetworkX library to build the graph (and then export it to GraphML for Gephi) but cannot wrap my head on how to transform the "list of lists" to a graph.
Here follows my code. Thank you very much.
### if needed install the required modules
### python3 -m pip install biopython
### python3 -m pip install numpy
from Bio import Entrez
from Bio import Medline
Entrez.email = "rja@it.com"
handle = Entrez.esearch(db="pubmed", term='("lung diseases, interstitial"[MeSH Terms] NOT "pneumoconiosis"[MeSH Terms]) AND "artificial intelligence"[MeSH Terms] AND "humans"[MeSH Terms]', retmax="1000", sort="relevance", retmode="xml")
records = Entrez.read(handle)
ids = records['IdList']
h = Entrez.efetch(db='pubmed', id=ids, rettype='medline', retmode='text')
#now h holds all of the articles and their sections
records = Medline.parse(h)
# initialize an empty vector for the authors
authors = []
# iterate through all articles
for record in records:
#for each article (record) get the authors list
au = record.get('AU', '?')
# now from the author list iterate through each author
for a in au:
if a not in authors:
authors.append(a)
# following is just to show the alphabetic list of all non repeating
# authors sorted alphabetically (there should become my graph nodes)
authors.sort()
print('Authors: {0}'.format(', '.join(authors)))
回答1:
Cool - the code was running, so the data structures are clear! As an approach, we build the conncetivity-matrix for both articles/authors and authors/co-authors.
List of authors : If you want to describe the relation between the articles and the authors, I think you need the author list of each article
authors = []
author_lists = [] # <--- new
for record in records:
au = record.get('AU', '?')
author_lists.append(au) # <--- new
for a in au:
if a not in authors: authors.append(a)
authors.sort()
print(authors)
numpy, pandas matplotlib - is just the way I am used to work
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
AU = np.array(authors) # authors as np-array
NA = AU.shape[0] # number of authors
NL = len(author_lists) # number of articles/author lists
AUL = np.array(author_lists) # author lists as np-array
print('NA, NL', NA,NL)
Connectivity articles/authors
CON = np.zeros((NL,NA),dtype=int) # initializes connectivity matrix
for j in range(NL): # run through the article's author list
aul = np.array(AUL[j]) # get a single author list as np-array
z = np.zeros((NA),dtype=int)
for k in range(len(aul)): # get a singel author
z += (AU==aul[k]) # get it's position in the AU, add it up
CON[j,:] = z # insert the result in the connectivity matrix
#---- grafics --------
fig = plt.figure(figsize=(20,10)) ;
plt.spy(CON, marker ='s', color='chartreuse', markersize=5)
plt.xlabel('Authors'); plt.ylabel('Articles'); plt.title('Authors of the articles', fontweight='bold')
plt.show()
Connectivity authors/co-authors, the resulting matrix is symmetric
df = pd.DataFrame(CON) # let's use pandas for the following step
ACON = np.zeros((NA,NA)) # initialize the conncetivity matrix
for j in range(NA): # run through the authors
df_a = df[df.iloc[:, j] >0] # give all rows with author j involved
w = np.array(df_a.sum()) # sum the rows, store it in np-array
ACON[j] = w # insert it in the connectivity matrix
#---- grafics --------
fig = plt.figure(figsize=(10,10)) ;
plt.spy(ACON, marker ='s', color='chartreuse', markersize=3)
plt.xlabel('Authors'); plt.ylabel('Authors'); plt.title('Authors that are co-authors', fontweight='bold')
plt.show()
For the graphics with Networkx, I think think you need clear ideas what you want represent, because there are many points and many possibilities too (perhaps you post an example?). Only a few author-circels are ploted below.
import networkx as nx
def set_edges(Q):
case = 'A'
if case=='A':
Q1 = np.roll(Q,shift=1)
Edges = np.vstack((Q,Q1)).T
return Edges
Q = nx.Graph()
Q.clear()
AT = np.triu(ACON) # only the tridiagonal is needed
fig = plt.figure(figsize=(7,7)) ;
for k in range (9):
iA = np.argwhere(AT[k]>0).ravel() # get the indices with AT{k}>0
Edges = set_edges(iA) # select the involved nodes and set the edges
Q.add_edges_from(Edges, with_labels=True)
nx.draw(Q, alpha=0.5)
plt.title('Co-author-ship', fontweight='bold')
plt.show()
来源:https://stackoverflow.com/questions/54609288/python-how-to-convert-elements-of-a-list-of-lists-into-an-undirected-graph