Dbpedia resource parsing in JAVA

问题

By using DBpedia Spotlight, I get DBpedia URIs. For example

http://dbpedia.org/resource/Part-of-speech_tagging

I need to request this URI in Java so that it can return me some json/xml and I can fetch the necessary information from the response.

For example, in the above mentioned URI, I need the value of dct:subject

Below is the screenshot of the response what I get in the browser.

回答1:

There isn't enough info in your question about what you're trying to achieve to provide the best path by which to reach that goal. You might consider using the Jena or RDF4J/Sesame Frameworks.

Or you might consider just asking the DBpedia endpoint for the thing you want, whether that's the complete description of <http://dbpedia.org/resource/Part-of-speech_tagging>, here in JSON (as linked from the Formats menu seen in your screencap), or using a SPARQL query URI to request just the dct:subject values --

PREFIX dbr: <http://dbpedia.org/resource/>
SELECT DISTINCT ?subject
  WHERE { dbr:Part-of-speech_tagging dct:subject ?subject }
LIMIT 100

-- which might be retrieved in various serializations -- here in JSON.

回答2:

I'm not exactly sure which values you are looking for but you should be able to do this without any dependencies to scrape what you want from the page source. The four Java methods supplied below should get you what you need (one method is a support method).

Getting the Web Page HTML Source:

First we acquire the Web Page HTML Source by using the getWebPageSource() method. This method will get the entire HTML source code that makes up the web page located at the supplied Link String. The Source is returned in a List Interface object (List<String>). Example usage would be:

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);

When this code is run the pageSource List variable will contain all the HTML source code for the web link string you provided which in this case is: "http://dbpedia.org/resource/Part-of-speech_tagging". If you like you can create a loop to iterate through the list and display it in your Console Window with the System.out.println() method like this:

for (int i = 0; i < pageSource.size(); i++) {
    System.out.println(pageSource.get(i));
}

Getting Related Links Using A Reference String:

Now that you have the Web Page Source you can locate and grab the data you want. The next method is the getRelatedLinks() method. This method will retrieve all links which are contained between specifically supplied String Tags where the desired Links may reside between and are related to the supplied Reference String. In your case the Reference String would be: "rel=\"dct:subject\"". The String Start Tag would be "href=\"" and the String End Tag would be "\">". So, any Web Page Source line that contains the Reference String of "rel=\"dct:subject\"" is looked at and if on the same source line the supplied Start Tag String ("href=\"") and the supplied End Tag String ("\">") are found then the text between those tags is retrieved. Example usage would be:

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");

All links related to the reference string of: "rel=\"dct:subject\"" will now be held within the String Array variable named relatedLinksTo. If you were to iterate through the Array and display its contents to the Console Window:

// Display Related Links...
for (int i = 0; i < relatedLinksTo.length; i++) {
    System.out.println(relatedLinksTo[i]);
}

you will see:

http://dbpedia.org/resource/Category:Corpus_linguistics
http://dbpedia.org/resource/Category:Markov_models
http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing
http://dbpedia.org/resource/Category:Word-sense_disambiguation

And if you just want the title(s) which each link is related to instead of the entire Link String then you would do it this way:

// Display Related Links Titles...
for (int i = 0; i < relatedLinksTo.length; i++) {
    String rLink = relatedLinksTo[i].substring(relatedLinksTo[i].lastIndexOf(":") + 1);
    System.out.println(rLink);
}

and what you will see within the Console Window is:

Corpus_linguistics
Markov_models
Tasks_of_natural_language_processing
Word-sense_disambiguation

This method utilizes the support method named getBetween() also supplied below.

Getting A Specific Link From A Related Link List:

You may not want the entire Related Link List but instead just one or more specific links to a specific title like: Tasks_of_natural_language_processing. To get this one or more links you would use the getFromRelatedLinksThatContain() method. Here is how you would achieve this:

String sourceLinkString = "http://dbpedia.org/resource/Part-of-speech_tagging";
List<String> pageSource = getWebPageSource(sourceLinkString);
String[] relatedLinksTo = getRelatedLinks("rel=\"dct:subject\"", pageSource, "href=\"", "\">");
String[] desiredLinks = getFromRelatedLinksThatContain(relatedLinksTo, "Tasks_of_natural_language_processing");

This method requires you to pass what was returned from the getRelatedLinks() method along with the desired title you want the Link for (Tasks_of_natural_language_processing). The title must be actual text contained within any link. If you were to now iterate through the desiredLinks array:

for (int i = 0; i < desiredLinks.length; i++) {
    System.out.println(desiredLinks[i]);
}

You will see the following Link String displayed within the Console Window:

http://dbpedia.org/resource/Category:Tasks_of_natural_language_processing.

The TESTED Methods:

/**
 * Returns a List ArrayList containing the page source for the supplied web
 * page link.<br><br>
 *
 * @param link (String) The URL address of the web page to process.<br>
 *
 * @return (List ArrayList) A List ArrayList containing the page source for
 *         the supplied web page link.
 */
public List<String> getWebPageSource(String webLink) {
    if (webLink.equals("")) {
        return null;
    }
    try {
        URL url = new URL(webLink);

        URLConnection yc;
        //If url is a SSL Endpoint (using a Secure Socket Layer such as https)...
        if (webLink.startsWith("https:")) {
            yc = new URL(webLink).openConnection();
            //send request for page data...
            yc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
            yc.connect();
        }
        //and if not a SLL Endpoint (just http)...
        else {
            yc = url.openConnection();
        }

        InputStream inputStream = yc.getInputStream();
        InputStreamReader streamReader = null;
        String encoding = null;
        try {
            encoding = yc.getContentEncoding().toLowerCase();
        }
        catch (Exception ex) {
        }
        if (null == encoding) {
            encoding = "UTF-8";
            streamReader = new InputStreamReader(yc.getInputStream(), encoding);
        }
        else {
            switch (encoding) {
                case "gzip":
                    // Is compressed using GZip: Wrap the reader
                    inputStream = new GZIPInputStream(inputStream);
                    streamReader = new InputStreamReader(inputStream);
                    break;
                //streamReader = new InputStreamReader(inputStream);
                case "utf-8":
                    encoding = "UTF-8";
                    streamReader = new InputStreamReader(yc.getInputStream(), encoding);
                    break;
                case "utf-16":
                    encoding = "UTF-16";
                    streamReader = new InputStreamReader(yc.getInputStream(), encoding);
                    break;
                default:
                    break;
            }
        }

        List<String> sourceText;
        try (BufferedReader in = new BufferedReader(streamReader)) {
            String inputLine;
            sourceText = new ArrayList<>();
            while ((inputLine = in.readLine()) != null) {
                sourceText.add(inputLine);
            }
        }
        return sourceText;
    }
    catch (MalformedURLException ex) {
        // Do whatever you want with exception.
        ex.printStackTrace();
    }
    catch (IOException ex) {
        // Do whatever you want with exception.
        ex.printStackTrace();
    }
    return null;
}

/**
 * This method will retrieve all links which are contained between specifically 
 * supplied String Tags where the desired Links may reside between and are related 
 * to the supplied <b>Reference String</b>. A String Start Tag and a String End Tag 
 * would be required as well.<br><br>
 * 
 * So, if any Web Page Source line that contains the Reference String of:<pre>
 * 
 *     "rel=\"dct:subject\""</pre><br>
 * 
 * is looked at and if <i>on the same source line</i> the supplied Start Tag 
 * String (ie: "href=\"") and the supplied End Tag String (ie: "\">") are found then 
 * the text between those tags is retrieved.<br><br>
 * 
 * This method utilizes the support method named <b>getBetween()</b>.<br><br>
 * 
 * @param referenceString (String) The reference string to look for on any web 
 * page source line.<br>
 * 
 * @param pageSource (List Interface of String) The List which contains all the 
 * HTML Web Page Source.<br>
 * 
 * @param desiredLinkStartTag (String) The Start Tag String where the desired 
 * Link or links may reside after. This can be any string. Links are retrieved 
 * from between the Start Tag and the End Tag.<br>
 * 
 * @param desiredLinkEndTag (String) The End Tag String where the desired 
 * Link or links may reside before. This can be any string. Links are retrieved 
 * from between the Start Tag and the End Tag.<br>
 * 
 * @return (1D String Array) A String Array containing the Links Found.<br>
 * 
 * @see #getBetween(java.lang.String, java.lang.String, java.lang.String, boolean...) getBetween()
 */
public String[] getRelatedLinks(String referenceString, List<String> pageSource, 
        String desiredLinkStartTag, String desiredLinkEndTag) {
    List<String> links = new ArrayList<>();
    for (int i = 0; i < pageSource.size(); i++) {
        if (pageSource.get(i).contains(referenceString)) {
            String[] lnks = getBetween(pageSource.get(i), desiredLinkStartTag, desiredLinkEndTag);
            links.addAll(Arrays.asList(lnks));
        }
    }
    return links.toArray(new String[0]);
}

/**
 * Retrieves a specific Link from within the Related Links List generated by 
 * the <b>getRelatedLinks()</b> method.<br><br>
 * 
 * This method requires the use of the <b>getRelatedLinks()</b> method.
 * 
 * @param relatedArray (1D String Array) The array returned from the <b>getRelatedLinks()</b> 
 * method.<br>
 * 
 * @param desiredStringInLink (String - Letter Case Sensitive) The string title 
 * contained within the link to retrieve.<br>
 * 
 * @return (1D String Array) Containing any links found.<br>
 * 
 * @see #getRelatedLinks(java.lang.String, java.util.List, java.lang.String, java.lang.String) getRelatedLinks()
 * 
 */
public String[] getFromRelatedLinksThatContain(String[] relatedArray, String desiredStringInLink) {
    List<String> desiredLinks = new ArrayList<>();
    for (int i = 0; i < relatedArray.length; i++) {
        if (relatedArray[i].contains(desiredStringInLink)) {
            desiredLinks.add(relatedArray[i]);
        }
    }
    return desiredLinks.toArray(new String[0]);
}

/**
 * Retrieves any string data located between the supplied string leftString
 * parameter and the supplied string rightString parameter.<br><br>

 * This method will return all instances of a substring located between the
 * supplied Left String and the supplied Right String which may be found
 * within the supplied Input String.<br>
 *
 * @param inputString (String) The string to look for substring(s) in.
 *
 * @param leftString  (String) What may be to the Left side of the substring
 *                    we want within the main input string. Sometimes the
 *                    substring you want may be contained at the very
 *                    beginning of a string and therefore there is no
 *                    Left-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param rightString (String) What may be to the Right side of the
 *                    substring we want within the main input string.
 *                    Sometimes the substring you want may be contained at
 *                    the very end of a string and therefore there is no
 *                    Right-String available. In this case you would simply
 *                    pass a Null String ("") to this parameter which
 *                    basically informs the method of this fact. Null can
 *                    not be supplied and will ultimately generate a
 *                    NullPointerException.
 *
 * @param options     (Optional - Boolean - 2 Parameters):<pre>
 *
 *      ignoreLetterCase    - Default is false. This option works against the
 *                            string supplied within the leftString parameter
 *                            and the string supplied within the rightString
 *                            parameter. If set to true then letter case is
 *                            ignored when searching for strings supplied in
 *                            these two parameters. If left at default false
 *                            then letter case is not ignored.
 *
 *      trimFound           - Default is true. By default this method will trim
 *                            off leading and trailing white-spaces from found
 *                            sub-string items. General sentences which obviously
 *                            contain spaces will almost always give you a white-
 *                            space within an extracted sub-string. By setting
 *                            this parameter to false, leading and trailing white-
 *                            spaces are not trimmed off before they are placed
 *                            into the returned Array.</pre>
 *
 * @return (1D String Array) Returns a Single Dimensional String Array
 *         containing all the sub-strings found within the supplied Input
 *         String which are between the supplied Left String and supplied
 *         Right String. You can shorten this method up a little by
 *         returning a List&lt;String&gt; ArrayList and removing the 'List
 *         to 1D Array' conversion code at the end of this method. This
 *         method initially stores its findings within a List object
 *         anyways.
 */
public static String[] getBetween(String inputString, String leftString, String rightString, boolean... options) {
    // Return nothing if nothing was supplied.
    if (inputString.equals("") || (leftString.equals("") && rightString.equals(""))) {
        return null;
    }

    // Prepare optional parameters if any supplied.
    // If none supplied then use Defaults...
    boolean ignoreCase = false; // Default.
    boolean trimFound = true;   // Default.
    if (options.length > 0) {
        if (options.length >= 1) {
            ignoreCase = options[0];
        }
        if (options.length >= 2) {
            trimFound = options[1];
        }
    }

    // Remove any ASCII control characters from the
    // supplied string (if they exist).
    String modString = inputString.replaceAll("\\p{Cntrl}", "");

    // Establish a List String Array Object to hold
    // our found substrings between the supplied Left
    // String and supplied Right String.
    List<String> list = new ArrayList<>();

    // Use Pattern Matching to locate our possible
    // substrings within the supplied Input String.
    String regEx = Pattern.quote(leftString)
            + (!rightString.equals("") ? "(.*?)" : "(.*)?")
            + Pattern.quote(rightString);
    if (ignoreCase) {
        regEx = "(?i)" + regEx;
    }
    Pattern pattern = Pattern.compile(regEx);
    Matcher matcher = pattern.matcher(modString);
    while (matcher.find()) {
        // Add the found substrings into the List.
        String found = matcher.group(1);
        if (trimFound) {
            found = found.trim();
        }
        list.add(found);
    }

    String[] res;
    // Convert the ArrayList to a 1D String Array.
    // If the List contains something then convert
    if (list.size() > 0) {
        res = new String[list.size()];
        res = list.toArray(res);
    } // Otherwise return Null.
    else {
        res = null;
    }
    // Return the String Array.
    return res;
}

Or ... Use SPARQL or any other desirable parser like jSON.

来源：https://stackoverflow.com/questions/54618940/dbpedia-resource-parsing-in-java

标签

java

sparql

dbpedia