Matching strings with at least one word in common

前端 未结 1 1337
失恋的感觉
失恋的感觉 2021-01-24 11:11

I\'m making a query to get the URIs of documents, that have a specific title. My query is:

PREFIX rdf: 

        
1条回答
  •  失恋的感觉
    2021-01-24 11:56

    Let's say you've got some data like (in Turtle):

    @prefix :  .
    @prefix dc:  .
    
    :a dc:title "Great Gatsby" .
    :b dc:title "Boring Gatsby" .
    :c dc:title "Great Expectations" .
    :d dc:title "The Great Muppet Caper" .
    

    Then you can use a query like:

    prefix : 
    prefix dc: 
    
    select ?x ?title where {
      # this is just in place of this.getTitle().  It provides a value for
      # ?TITLE that is "Gatsby Strikes Again".
      values ?TITLE { "Gatsby Strikes Again" }
    
      # Select a thing and its title.
      ?x dc:title ?title .
    
      # Then filter based on whether the ?title matches the result
      # of replacing the strings in ?TITLE with "|", and matching
      # case insensitively.
      filter( regex( ?title, replace( ?TITLE, " ", "|" ), "i" ))
    }
    

    to get results like

    ------------------------
    | x  | title           |
    ========================
    | :b | "Boring Gatsby" |
    | :a | "Great Gatsby"  |
    ------------------------
    

    What's particularly neat about this is that since you're generating the pattern on the fly, you could even make it based on another value from the graph pattern. For instance, if you want all pairs of things whose titles match on at least one word, you could do:

    prefix : 
    prefix dc: 
    
    select ?x ?xtitle ?y ?ytitle where {
      ?x dc:title ?xtitle .
      ?y dc:title ?ytitle .
      filter( regex( ?xtitle, replace( ?ytitle, " ", "|" ), "i" ) && ?x != ?y )
    }
    order by ?x ?y
    

    to get:

    -----------------------------------------------------------------
    | x  | xtitle                   | y  | ytitle                   |
    =================================================================
    | :a | "Great Gatsby"           | :b | "Boring Gatsby"          |
    | :a | "Great Gatsby"           | :c | "Great Expectations"     |
    | :a | "Great Gatsby"           | :d | "The Great Muppet Caper" |
    | :b | "Boring Gatsby"          | :a | "Great Gatsby"           |
    | :c | "Great Expectations"     | :a | "Great Gatsby"           |
    | :c | "Great Expectations"     | :d | "The Great Muppet Caper" |
    | :d | "The Great Muppet Caper" | :a | "Great Gatsby"           |
    | :d | "The Great Muppet Caper" | :c | "Great Expectations"     |
    -----------------------------------------------------------------
    

    Of course, it's very important to note that you're pulling generating patterns based on your data now, and that means that someone who can put data into your system could put very expensive patterns in to bog down the query and cause a denial-of-service. On a more mundane note, you could run into trouble if any of your titles have characters in them that would interfere with the regular expressions. One interesting problem would be if something had a title with multiple spaces so that the pattern became The|Words|With||Two|Spaces, since the empty pattern in there might make everything match. This is an interesting approach, but it's got a lot of caveats.

    In general, you could do this as shown here, or by generating the regular expression in code (where you can take care of escaping, etc.), or you could use a SPARQL engine that supports some text-based extensions (e.g., jena-text, which adds Apache Lucene or Apache Solr to Apache Jena).

    0 讨论(0)
提交回复
热议问题