I have need to get the average div height and width of an html doc.
I have try this solution but it doesn\'t work:
import numpy as np
average_width = np.
There may be better way-
Way -1
Below is my tested code to extract width and height.
from bs4 import BeautifulSoup
html_doc = '''<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;">
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">Journal of Infection (2015)
</span>
<span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">xx</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">, 1</span>
<span style="font-family: EICMDD+AdvPS44A44B; font-size:7px">e</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">4
<br/>
</span>
</div>'''
soup = BeautifulSoup(html_doc,'html.parser')
my_att = [i.attrs['style'] for i in soup.find_all("div")]
dd = ''.join(my_att).split(";")
dd_cln= filter(None, dd)
dd_cln= [i.strip() for i in dd_cln ]
my_dict = dict(i.split(':') for i in dd_cln)
print my_dict['width']
Way-2 Use regular expression as described here.
Working code-
import numpy as np
import re
from bs4 import BeautifulSoup
html_doc = '''<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:45px; top:81px; width:127px; height:9px;">
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">Journal of Infection (2015)
</span>
<span style="font-family: EICMDB+AdvTrebu-B; font-size:8px">xx</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">, 1</span>
<span style="font-family: EICMDD+AdvPS44A44B; font-size:7px">e</span>
<span style="font-family: EICMDA+AdvTrebu-R; font-size:8px">4
<br/>
</span>
</div>'''
soup = BeautifulSoup(html_doc,'html.parser')
my_att = [i.attrs['style'] for i in soup.find_all("div")]
css = ''.join(my_att)
print css
width_list = map(float,re.findall(r'(?<=width:)(\d+)(?=px;)', css))
height_list = map(float,re.findall(r'(?<=height:)(\d+)(?=px;)', css))
print np.mean(height_list)
print np.mean(width_list)