Regular expression to get an attribute from HTML tag

前端 未结 4 1614
难免孤独
难免孤独 2020-12-03 03:55

I am looking for a regular expression that can get me src (case insensitive) tag from following HTML snippets in java.



        
相关标签:
4条回答
  • 2020-12-03 04:02

    This answer is for google searchers, Because it's too late

    Copying cletus's showed error and Modifying his answer and passing modified String src\\s*=\\s*([\"'])?([^\"']*) as parameter passed into Pattern.compile worked for me,

    Here is the full example

        String htmlString = "<div class=\"current\"><img src=\"img/HomePageImages/Paris.jpg\"></div>"; //Sample HTML
    
        String ptr= "src\\s*=\\s*([\"'])?([^\"']*)";
        Pattern p = Pattern.compile(ptr);
        Matcher m = p.matcher(htmlString);
        if (m.find()) {
            String src = m.group(2); //Result
        }
    
    0 讨论(0)
  • 2020-12-03 04:08

    One possibility:

    String imgRegex = "<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>";
    

    is a possibility (if matched case-insensitively). It's a bit of a mess, and deliberately ignores the case where quotes aren't used. To represent it without worrying about string escapes:

    <img[^>]+src\s*=\s*['"]([^'"]+)['"][^>]*>
    

    This matches:

    • <img
    • one or more characters that aren't > (i.e. possible other attributes)
    • src
    • optional whitespace
    • =
    • optional whitespace
    • starting delimiter of ' or "
    • image source (which may not include a single or double quote)
    • ending delimiter
    • although the expression can stop here, I then added:
      • zero or more characters that are not > (more possible attributes)
      • > to close the tag

    Things to note:

    • If you want to include the src= as well, move the open bracket further left :-)
    • This does not care about delimiter balancing or attribute values without delimiters, and it can also choke on badly-formed attributes (such as attributes that include > or image sources that include ' or ").
    • Parsing HTML with regular expressions like this is non-trivial, and at best a quick hack that works in the majority of cases.
    0 讨论(0)
  • 2020-12-03 04:16

    This question comes up a lot here.

    Regular expressions are a bad way of handling this problem. Do yourself a favour and use an HTML parser of some kind.

    Regexes are flaky for parsing HTML. You'll end up with a complicated expression that'll behave unexpectedly in some corner cases that will happen otherwise.

    Edit: If your HTML is that simple then:

    Pattern p = Pattern.compile("src\\s*=\\s*([\\"'])?([^ \\"']*)");
    Matcher m = p.matcher(str);
    if (m.find()) {
      String src = m.group(2);
    }
    

    And there are any number of Java HTML parsers out there.

    0 讨论(0)
  • 2020-12-03 04:25

    You mean the src-attribute of the img-Tag? In that case you can go with the following:

    <[Ii][Mm][Gg]\\s*([Ss][Rr][Cc]\\s*=\\s*[\"'].*?[\"'])
    

    That should work. The expression src='...' is in parantheses, so it is a matcher-group and can be processed separately.

    0 讨论(0)
提交回复
热议问题