retrieve useful info from webpage using JSOUP

前端 未结 1 1779
感情败类
感情败类 2021-01-23 21:38

How can i retrieve the Contact us link from any webpage in world wide web from it\'s \"footer\" part of the page in JAVA.

E.g. find footer element, or an element with id

相关标签:
1条回答
  • 2021-01-23 21:42

    But I cannot be 100% sure on that the fetched link...

    SHORT ANSWER

    You will NEVER be sure.


    LONG ANSWER

    For a given random HTML page, you want to find the "Contact Us" link. This kind of work is trivial for a human. It represents a big challenge for a computer.

    I can see some options in your case:

    Option 1: Crowd sourcing

    • Fetch all the website urls you want the "Contact Us" information
    • Send them to a crowd service platform asking real people to find the information for you (Rapidworkers.com, Crowdsource.com, Clickworker.com, Amazon Mechanical Turk, microworkers.com)

    Check if the platform offer an API.

    + work done by human
    + dynamically adapt to unknown pattern
    - cost money
    - We suck at repetitive tasks
    

    Option 2: IA (patten searching)

    • Train an IA for extracting the information
    • Then through at it your websites

    Have a look at Weka for instance or Java-ML.

    + Automated task
    + Can perform a repetitive task long time
    - May take time to built a robust solution
    - Risk of false positive or complete miss
    

    Option 3: Use Jsoup

    • Carefully study the pattern of the websites you target
    • Tell Jsoup to find the pattern you have detected

    This option is a never ending task. You'll have to always feed Jsoup with new patterns. I suggest you having a monitoring system telling you when website escapes any known pattern.

    + Automated task
    + Can perform a repetitive task long time
    - Take time for studying, discovering, adding new patterns
    - Risk of false positive or complete miss
    

    Option 4: A mix of the three above options

    You can have the three options working on the websites you target.

    + Reduce chances of false positive or complete misses
    + More confident final result
    - Take time for studying, discovering, adding new patterns
    - Cost money
    
    0 讨论(0)
提交回复
热议问题