问题
I have this list
[<th align="left">
<a href="blablabla">F</a>ojweousa</th>,
<th align="left">
<a href="blablabla">S</a>awdefrgt</th>, ...]
and want
the one single character after
">
the multiple characters between
</a>
and</th>,
to be concatenated so that i can move on with my life.
Here is my code
item2 = []
for element in items2:
first_letter = re.search('">.</a', str(items2))
second_letter = re.search(r'</a>[a-zA-Z0-9]</th>,', str(items2))
item2.append([str(first_letter) + str(second_letter)])
I know i should do something like item2.group
or item2.join
but if i do, the output gets even more messy. Here is the output with the current code
[['<re.Match object; span=(155, 161), match=\'">F</a\'>None'],
['<re.Match object; span=(155, 161), match=\'">F</a\'>None'],
...]]
I would like the output to look like this so that i can use it in pd.dataframe:
[Fojweousa, Sawdefrgt, ...]
It is a list, that is why i cant use html bs4 or select methods.
回答1:
You can use the BeautifulSoup get_text() to get plain text from each element you found with find_all
and strip
to get rid of leading and trailing whitespace:
items2 = table.find_all('th', attrs={'align': 'left'})[1:]
result = [x.get_text().strip() for x in items2]
Here, .find_all('th', attrs={'align': 'left'})
finds all th
elements with attribute align
equal to left
, and [1:]
skips the first occurrence.
Next, [x.get_text().strip() for x in items2]
is a list comprehension that iterates over the found items (items2
, x
is each single found element) and gets plain text from each x
element using x.get_text()
and strip()
removes leading/trailing whitespace.
来源:https://stackoverflow.com/questions/66152180/regex-for-loop-over-list-in-python