Beautiful Soup - `findAll` not capturing all tags in SVG (`ElementTree` does)

问题

I was attempting to generate a choropleth map by modifying an SVG map depicting all counties in the US. The basic approach is captured by Flowing Data. Since SVG is basically just XML, the approach leverages the BeautifulSoup parser.

The thing is, the parser does not capture all path elements in the SVG file. The following captured only 149 paths (out of over 3000):

#Open SVG file
svg=open(shp_dir+'USA_Counties_with_FIPS_and_names.svg','r').read()

#Parse SVG
soup = BeautifulSoup(svg, selfClosingTags=['defs','sodipodi:namedview'])

#Identify counties
paths = soup.findAll('path')

len(paths)

I know, however, that many more exist from both physical inspection, and the fact that ElementTree methods capture 3,143 paths with the following routine:

#Parse SVG
tree = ET.parse(shp_dir+'USA_Counties_with_FIPS_and_names.svg')

#Capture element
root = tree.getroot()

#Compile list of IDs from file
ids=[]
for child in root:
    if 'path' in child.tag:
        ids.append(child.attrib['id'])

len(ids)

I have not yet figured out how to write from the ElementTree object in a way that is not all messed up.

#Define style template string
style='font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;'+\
        'stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;'+\
        'stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

#For each path...
for child in root:
    #...if it is a path....
    if 'path' in child.tag:
        try:
            #...update the style to the new string with a county-specific color...
            child.attrib['style']=style+col_map[child.attrib['id']]
        except:
            #...if it's not a county we have in the ACS, leave it alone
            child.attrib['style']=style+'#d0d0d0'+'\n'

#Write modified SVG to disk
tree.write(shp_dir+'mhv_by_cty.svg')

The modification/write routine above yields this monstrosity:

My primary question is this: why did BeautifulSoup fail to capture all of the path tags? Second, why would the image modified with the ElementTree objects have all of that extracurricular activity going on? Any advice would be greatly appreciated.

回答1:

alexce's answer is correct for your first question. As far as your second question is concerned:

why would the image modified with the ElementTree objects have all of that extracurricular activity going on?"

the answer is pretty simple - not every <path> element draws a county. Specifically, there are two elements, one with id="State_Lines" and one with id="separator", that should be eliminated. You didn't supply your dataset of colors, so I just used a random hex color generator (adapted from here) for each county, then used lxml to parse the .svg's XML and iterate through each <path> element, skipping the ones I mentioned above:

from lxml import etree as ET
import random

def random_color():
    r = lambda: random.randint(0,255)
    return '#%02X%02X%02X' % (r(),r(),r())

new_style = 'font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel;fill:'

tree = ET.parse('USA_Counties_with_FIPS_and_names.svg')
root = tree.getroot()
for child in root:
    if 'path' in child.tag and child.attrib['id'] not in ["separator", "State_Lines"]:
        child.attrib['style'] = new_style + random_color()

tree.write('counties_new.svg')

resulting in this nice image:

回答2:

You need to do the following:

upgrade to beautifulsoup4:
```
pip install beautifulsoup4 -U
```
import it as:
```
from bs4 import BeautifulSoup
```
install latest lxml module:
```
pip install lxml -U
```
explicitly specify lxml as a parser:
```
soup = BeautifulSoup(svg, 'lxml')
```

Demo:

>>> from bs4 import BeautifulSoup
>>> 
>>> svg = open('USA_Counties_with_FIPS_and_names.svg','r').read()
>>> soup = BeautifulSoup(svg, 'lxml')
>>> paths = soup.findAll('path')
>>> len(paths)
3143

来源：https://stackoverflow.com/questions/28016981/beautiful-soup-findall-not-capturing-all-tags-in-svg-elementtree-does

标签

python

svg

beautifulsoup

elementtree

choropleth