问题
I have a reasonable number of nodes (roughly 60,000)
(:Document {title:"A title"})
Given a title, I want to find the matching node, if it exists. The problem is that the title I'm given is not consistent. That is, sometimes the beginning of a new word is Capital, sometimes it is all lower case. Sometimes Key-Words are combined with Kebab case, sometimes they are written normally as key words.
To compensate for that I use apoc and the Levenshtein distance between the given title and every node and only accept a node as a match if it is below some threshold:
MATCH (a:Document)
WHERE apoc.text.distance(a.title, "A title") < 10
RETURN a
This does not scale well. Currently a single lookup takes about 700 ms which is too slow, given that this will likely grow to somewhere around 150,000 nodes.
I was thinking of storing / caching the occurrence of alternative titles in an alias:[...]
property of the node and build an index over all aliases, but I don't know if this is possible in Neo4j.
What is the fastest way to "fuzzy find" a title given a large database of nodes?
回答1:
In Neo4j 3.5 (currently on beta03), there are FTS (Full-Text Search) capabilities.
EDIT : I have written a detailed blog post about FTS in Neo4j : https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html
You can query then your documents using the Lucene Classic Query Parser Syntax.
Create the index :
CALL db.index.fulltext.createNodeIndex('documents', ['Document'], ['title','text'])
Import some documents :
LOAD CSV WITH HEADERS FROM "file:///docs.csv" AS row
CREATE (n:Document) SET n = row
Query documents with title containing "heavy toll"
CALL db.index.fulltext.queryNodes('documents', 'title: "heavy toll"')
YIELD node, score
RETURN node.title, score
╒══════════════════════════════════════════════════════════════════════╤══════════════════╕
│"node.title" │"score" │
╞══════════════════════════════════════════════════════════════════════╪══════════════════╡
│"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times"│3.7325966358184814│
└──────────────────────────────────────────────────────────────────────┴──────────────────┘
Query for same title with a typo :
CALL db.index.fulltext.queryNodes('documents', 'title: \\"heavy~ tall~\\"')
YIELD node, score
RETURN node.title, score
Notice the escaping of the quotes => \" , the string passed to the underlying parser should contain the quotes in order to perform a phrase query instead of a boolean query.
Also the tidle
next to the terms indicate to perform a fuzzy search using the Damarau-Levenshtein algo.
╒══════════════════════════════════════════════════════════════════════╤═════════════════════╕
│"node.title" │"score" │
╞══════════════════════════════════════════════════════════════════════╪═════════════════════╡
│"Among Deaths in 2016, a Heavy Toll in Pop Music - The New York Times"│0.868073046207428 │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"Prisons Run by C.E.O.s? Privatization Under Trump Could Carry a Heavy│0.4014900326728821 │
│ Price - The New York Times" │ │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"‘All Talk,’ ‘No Action,’ Says Trump in Twitter Attack on Civil Rights│0.28181418776512146 │
│ Icon - The New York Times" │ │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
│"Immigrants Head to Washington to Rally While Obama Is Still There - T│0.24634429812431335 │
│he New York Times" │ │
├──────────────────────────────────────────────────────────────────────┼─────────────────────┤
回答2:
Indexing as noted in the answer by Christophe Willemsen is definitely needed for speeding up the search but I would also like to point a another historic function that might be a better fit with your "fuzzy find":
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English.
来源:https://stackoverflow.com/questions/53149046/neo4j-fast-way-to-match-fuzzy-text-property