Parser JSoup change the tags to lower case letter

前端 未结 5 1014
灰色年华
灰色年华 2020-12-20 15:39

I did some research and it seems that is standard Jsoup make this change. I wonder if there is a way to configure this or is there some other Parser I can be converted to a

相关标签:
5条回答
  • 2020-12-20 15:49

    I am using 1.11.1-SNAPSHOT version which does not have this piece of code.

    private Tag(String tagName) {
        this.tagName = tagName.toLowerCase();
    }
    

    So I checked ParseSettings as suggested above and changed this piece of code from:

    static {
        htmlDefault = new ParseSettings(false, false);
        preserveCase = new ParseSettings(true, true);
    }
    

    to:

    static {
        htmlDefault = new ParseSettings(true, true);
        preserveCase = new ParseSettings(true, true);
    }
    

    and skipped test cases while building JAR.

    0 讨论(0)
  • 2020-12-20 16:00

    Here is a code sample (version >= 1.11.x):

    Parser parser = Parser.htmlParser();
    parser.settings(new ParseSettings(true, true));
    Document doc = parser.parseInput(html, baseUrl);
    
    0 讨论(0)
  • 2020-12-20 16:12

    There is ParseSettings class introduced in version 1.9.3. It comes with options to preserve case for tags and attributes.

    0 讨论(0)
  • 2020-12-20 16:13

    Unfortunately not, the constructor of Tag class changes the name to lower case:

    private Tag(String tagName) {
        this.tagName = tagName.toLowerCase();
    }
    

    But there are two ways to change this behavour:

    1. If you want a clean solution, you can clone / download the JSoup Git and change this line.
    2. If you want a dirty solution, you can use reflection.

    Example for #2:

    Field tagName = Tag.class.getDeclaredField("tagName"); // Get the field which contains the tagname
    tagName.setAccessible(true); // Set accessible to allow changes
    
    for( Element element : doc.select("*") ) // Iterate over all tags
    {
        Tag tag = element.tag(); // Get the tag of the element
        String value = tagName.get(tag).toString(); // Get the value (= name) of the tag
    
        if( !value.startsWith("#") ) // You can ignore all tags starting with a '#'
        {
            tagName.set(tag, value.toUpperCase()); // Set the tagname to the uppercase
        }
    }
    
    tagName.setAccessible(false); // Revert to false
    
    0 讨论(0)
  • 2020-12-20 16:15

    You must use xmlParser instead of htmlParser and the tags will remain unchanged. One line does the trick:

    String html = "<camelCaseTag>some text</camelCaseTag>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());
    
    0 讨论(0)
提交回复
热议问题