Extending CSS selectors in BeautifulSoup

前端 未结 2 985
失恋的感觉
失恋的感觉 2021-01-17 10:35

The Question:

BeautifulSoup provides a very limited support for CSS selectors. For instance, the only supported pseudo-class i

相关标签:
2条回答
  • 2021-01-17 11:22

    Officially, Beautifulsoup doesn't support all the CSS selectors.

    If python is not the only choice, i strongly recommend JSoup (the java equivalent of this). It supports all the CSS selectors.

    • It is open source (MIT license)
    • Syntax is easy
    • Supports all the css selectors
    • Can span multiple threads too to scale up
    • Rich API support in java to store in DBs. So, it is easy to integrate.

    The other alternate way if you still want to stick with python, make it a jython implementation.

    http://jsoup.org/

    https://github.com/jhy/jsoup/

    0 讨论(0)
  • 2021-01-17 11:30

    After checking the source code, it seems that BeautifulSoup does not provide any convenient point in its interface to extend or monkey patch its existing functionality in this regard. Using functionality from lxml is not possible either since BeautifulSoup only uses lxml during parsing and uses the parsing results to create its own respective objects from them. The lxml objects are not preserved and cannot be accessed later.

    That being said, with enough determination and with the flexibility and introspection capabilities of Python, anything is possible. You can modify the BeautifulSoup method internals even at run-time:

    import inspect
    import re
    import textwrap
    
    import bs4.element
    
    
    def replace_code_lines(source, start_token, end_token,
                           replacement, escape_tokens=True):
        """Replace the source code between `start_token` and `end_token`
        in `source` with `replacement`. The `start_token` portion is included
        in the replaced code. If `escape_tokens` is True (default),
        escape the tokens to avoid them being treated as a regular expression."""
    
        if escape_tokens:
            start_token = re.escape(start_token)
            end_token = re.escape(end_token)
    
        def replace_with_indent(match):
            indent = match.group(1)
            return textwrap.indent(replacement, indent)
    
        return re.sub(r"^(\s+)({}[\s\S]+?)(?=^\1{})".format(start_token, end_token),
                      replace_with_indent, source, flags=re.MULTILINE)
    
    
    # Get the source code of the Tag.select() method
    src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select))
    
    # Replace the relevant part of the method
    start_token = "if pseudo_type == 'nth-of-type':"
    end_token = "else"
    replacement = """\
    if pseudo_type == 'nth-of-type':
        try:
            if pseudo_value in ("even", "odd"):
                pass
            else:
                pseudo_value = int(pseudo_value)
        except:
            raise NotImplementedError(
                'Only numeric values, "even" and "odd" are currently '
                'supported for the nth-of-type pseudo-class.')
        if isinstance(pseudo_value, int) and pseudo_value < 1:
            raise ValueError(
                'nth-of-type pseudo-class value must be at least 1.')
        class Counter(object):
            def __init__(self, destination):
                self.count = 0
                self.destination = destination
    
            def nth_child_of_type(self, tag):
                self.count += 1
                if pseudo_value == "even":
                    return not bool(self.count % 2)
                elif pseudo_value == "odd":
                    return bool(self.count % 2)
                elif self.count == self.destination:
                    return True
                elif self.count > self.destination:
                    # Stop the generator that's sending us
                    # these things.
                    raise StopIteration()
                return False
        checker = Counter(pseudo_value).nth_child_of_type
    """
    new_src = replace_code_lines(src, start_token, end_token, replacement)
    
    # Compile it and execute it in the target module's namespace
    exec(new_src, bs4.element.__dict__)
    # Monkey patch the target method
    bs4.element.Tag.select = bs4.element.select
    

    This is the portion of code being modified.

    Of course, this is everything but elegant and reliable. I don't envision this being seriously used anywhere, ever.

    0 讨论(0)
提交回复
热议问题