regex for loop over list in python

问题

I have this list

[<th align="left">
 <a href="blablabla">F</a>ojweousa</th>,
 <th align="left">
 <a href="blablabla">S</a>awdefrgt</th>, ...]

and want

the one single character after ">
the multiple characters between </a> and </th>,

to be concatenated so that i can move on with my life.

Here is my code

item2 = []
for element in items2:
    first_letter = re.search('">.</a', str(items2))
    second_letter = re.search(r'</a>[a-zA-Z0-9]</th>,', str(items2))
    item2.append([str(first_letter) + str(second_letter)])

I know i should do something like item2.group or item2.join but if i do, the output gets even more messy. Here is the output with the current code

[['<re.Match object; span=(155, 161), match=\'">F</a\'>None'],
 ['<re.Match object; span=(155, 161), match=\'">F</a\'>None'],
 ...]]

I would like the output to look like this so that i can use it in pd.dataframe:

[Fojweousa, Sawdefrgt, ...]

It is a list, that is why i cant use html bs4 or select methods.

回答1:

You can use the BeautifulSoup get_text() to get plain text from each element you found with find_all and strip to get rid of leading and trailing whitespace:

items2 = table.find_all('th', attrs={'align': 'left'})[1:]
result = [x.get_text().strip() for x in items2]

Here, .find_all('th', attrs={'align': 'left'}) finds all th elements with attribute align equal to left, and [1:] skips the first occurrence.

Next, [x.get_text().strip() for x in items2] is a list comprehension that iterates over the found items (items2, x is each single found element) and gets plain text from each x element using x.get_text() and strip() removes leading/trailing whitespace.

来源：https://stackoverflow.com/questions/66152180/regex-for-loop-over-list-in-python

标签

python

html

for-loop

beautifulsoup