问题
I want to be able to search for numbers like 2.3 using FTS4 in SQLite, but the . is being treated as a token boundary. Short of writing a full bespoke tokenizer is there any other way of excluding the . from the list of token boundary characters?
Being able to search for decimal numbers seems like a common use case, but I can't find anything relevant on SO / Google. My best solution at present is to replace all . chars in the text with a known (long) string of letters and substitute accordingly on each search...
Peter
回答1:
The tokenizer defines what tokens are, so you would indeed need to write your own.
You could search for the phrase "2 3"
, which would find the numbers 2 and 3 with any separator.
回答2:
Actually you don't need to write your own tokenizer.
The 'simple' tokenizer allows the delimiters to be customized, but it's an undocumented feature.
Create your FTS table as follows, and SQLite will tokenize on ' ' (space) and '#' (hash) characters only:
CREATE VIRTUAL TABLE documents USING fts4(title, content, tokenize=simple '' '# ');
There was a discussion on the SQLite mailing list about this feature in 2012 here. The source code in SQLite which supports this feature is here.
Quote from the mailing list regarding its lack of documentation:
"Likely the reason is that we forgot that this feature even exists. It seems to have existed in the simple tokenizer, unchanged, since the original introduction of FTS1 back in 2006."
Quote regarding whether it's safe to use the feature:
"But it has been in the code for so long now that we dare not change it for fear of breaking long-established programs."
...however it was also mentioned that the feature is not likely to have been tested thoroughly.
I have just emailed the SQLite users mailing list, asking if this feature can be documented.
Update: One thing I'd add is that we found that if characters such as '*' or '-' are configured as delimiters like this, then they would also be stripped out of the FTS MATCH clause in queries. So while this works fine for '.' etc., it can't be used for characters which are also special characters in FTS MATCH.
回答3:
You don't need to write your own tokenizer: some built in tokenizers allow you at table creation time to specify that some characters should or should not be treated as word characters. For example, the unicode61 tokenizer (see the FTS3 docs, you may have to enable support for it at build time) allows you to pass tokenchars
and separators
parameters (poke around in the source).
In your case, I believe you could use something like this:
CREATE VIRTUAL TABLE whatever USING fts4(
columns,
tokenize=unicode61 "tokenchars=."
);
来源:https://stackoverflow.com/questions/16080307/how-to-stop-being-treated-as-a-separator-in-sqlite-fts4