Extract links from a web page

前端未结

关注

 6  998

遇见更好的自我

Using Java, how can I extract all the links from a given web page?

相关标签:

6条回答

萌比男神i

2020-12-01 08:40
download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use
```
File input = new File("/tmp/input.html");
 Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png

Element masthead = doc.select("div.masthead").first();
```
and find all links and then get the detials using
```
String linkhref=links.attr("href");
```
Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax

The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.

EDIT: In case you want more tutorials, you can try out this one made by mkyong.

http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/
0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2020-12-01 08:42
Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.

A simple regex which would match 99% of pages could be this:
```
// The HTML page as a String
String HTMLPage;
Pattern linkPattern = Pattern.compile("(<a[^>]+>.+?<\/a>)",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(HTMLPage);
ArrayList<String> links = new ArrayList<String>();
while(pageMatcher.find()){
    links.add(pageMatcher.group());
}
// links ArrayList now contains all links in the page as a HTML tag
// i.e. <a att1="val1" ...>Text inside tag</a>
```
You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case. If you are only interested in the href="" and text in between you can also use this regex:
```
Pattern linkPattern = Pattern.compile("<a[^>]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
```
And access the link part with .group(1) and the text part with .group(2)
0 讨论(0)
发布评论:

提交评论
- 加载中...
清酒与你

2020-12-01 08:43

You would probably need to use regular expressions on the HTML link tags <a href=> and </a>

0 讨论(0)
发布评论:

提交评论
- 加载中...

天命终不由人

2020-12-01 08:53

You can use the HTML Parser library to achieve this:

public static List<String> getLinksOnPage(final String url) {
    final Parser htmlParser = new Parser(url);
    final List<String> result = new LinkedList<String>();

    try {
        final NodeList tagNodeList = htmlParser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
        for (int j = 0; j < tagNodeList.size(); j++) {
            final LinkTag loopLink = (LinkTag) tagNodeList.elementAt(j);
            final String loopLinkStr = loopLink.getLink();
            result.add(loopLinkStr);
        }
    } catch (ParserException e) {
        e.printStackTrace(); // TODO handle error
    }

    return result;
}

0 讨论(0)

醉话见心

2020-12-01 08:54

import java.io.*;
import java.net.*;

public class NameOfProgram {
    public static void main(String[] args) {
        URL url;
        InputStream is = null;
        BufferedReader br;
        String line;

        try {
            url = new URL("http://www.stackoverflow.com");
            is = url.openStream();  // throws an IOException
            br = new BufferedReader(new InputStreamReader(is));

            while ((line = br.readLine()) != null) {
                if(line.contains("href="))
                    System.out.println(line.trim());
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                if (is != null) is.close();
            } catch (IOException ioe) {
                //exception
            }
        }
    }
}

0 讨论(0)

没有蜡笔的小新

2020-12-01 09:01

This simple example seems to work, using a regex from here

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public ArrayList<String> extractUrlsFromString(String content)
{
    ArrayList<String> result = new ArrayList<String>();

    String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(content);
    while (m.find())
    {
        result.add(m.group());
    }

    return result;
}

and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.

import org.apache.commons.io.IOUtils;

public String getUrlContentsAsString(String urlAsString)
{
    try
    {
        URL url = new URL(urlAsString);
        String result = IOUtils.toString(url);
        return result;
    }
    catch (Exception e)
    {
        return null;
    }
}

0 讨论(0)