PHP word index, performance and reasonable results

问题

I'm currently working on an indexer for a search feature. The indexer will work over data from "fields". Fields looks like:

  Field_id   Field_type   Field_name   Field_Data
- 101        text         Name         Intel i7
- 102        integer      Cores        4 physical, 4 virtual
- 103        select       Vendor       Intel
- 104        multitext    Description  The i7 is intel's next gen range of cpus.

The indexer would generate the following results/index:

  Keyword    Occurrences
- intel      101, 103, 104
- i7         101, 104
- physical   102
- virtual    102
- next       104
- gen        104
- range      104
- cpus       104   (*)
- cpu        104   (*)

So it somewhat looks all nice and fine, however, there are some issues which I'd like to sort out:

filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)
With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?
Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)
I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site
Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?
Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!
There were other issues which I probably forgot about, if you spot any, you're welcome to yell at me ;-)
I do not need the search engine to crawl links, in fact, I specifically want it to not crawl links.

(by the way, I'm not biased towards intel, it simply happens that I own an i7-based pc ;-) )

回答1:

This is in response to your original question, and your later answer/question.

I've used the Sphinx search engine before (quite a while ago, so I'm a bit rusty), and found it to be very good, even if the documentation is sometimes a bit lacking.

I'm sure there are other ways to do this, both with your own custom code, or with other search engines—Sphinx just happens to be the one I've used. I'm not suggesting that it will do everything you want, just the way you want, but I am reasonably certain that it will do most of it quite easily, and a lot faster than anything written in PHP/MySQL alone.

I recommend reading Build a custom search engine with PHP before digging into the Sphinx documentation. If you don't think it's suitable after reading that, fair enough.

In answer to your specific questions, I've put together some links from the documentation, together with some relevant quotes:

filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)

11.2.8. stopwords

Stopwords are the words that will not be indexed. Typically you'd put most frequent words in the stopwords list because they do not add much value to search results but consume a lot of resources to process.

With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?

11.2.9. wordforms

Word forms are applied after tokenizing the incoming text by charset_table rules. They essentialy let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (eg. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)

Sphinx supports the Porter Stemming Algorithm

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?

3.2. Attributes

A good example for attributes would be a forum posts table. Assume that only title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (ie. search only those rows that have some specific values of author_id or forum_id columns in the SQL table); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.

This can be achieved by specifying all the mentioned columns (excluding title and content, that are full-text fields) as attributes, indexing them, and then using API calls to setup filtering, sorting, and grouping.

You can also use the 5.3. Extended query syntax to search specific fields (as opposed to filtering results by attributes):

field search operator: @vendor intel

How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id?

8.6.1. Query

On success, Query() returns a result set that contains some of the found matches (as requested by SetLimits()) and additional general per-query statistics. > The result set is a hash (PHP specific; other languages might utilize other structures instead of hash) with the following keys and values:

"matches":
Hash which maps found document IDs to another small hash containing document weight and attribute values (or an array of the similar small hashes if SetArrayResult() was enabled).

"total":
Total amount of matches retrieved on server (ie. to the server side result set) by this query. You can retrieve up to this amount of matches from server for this query text with current query settings.

"total_found":
Total amount of matching documents in index (that were found and procesed on server).

"words":
Hash which maps query keywords (case-folded, stemmed, and otherwise processed) to a small hash with per-keyword statitics ("docs", "hits").

"error":
Query error message reported by searchd (string, human readable). Empty if there were no errors.

"warning":
Query warning message reported by searchd (string, human readable). Empty if there were no warnings.

Also see Listing 11 and Listing 13 from Build a custom search engine with PHP.

回答2:

Grab a list of stop words(non-keywords) from here, the guy has even formatted them in php for you. http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/

Then simply do a preg_replace on the string you are indexing.

What I've done in past is remove suffixes like 's', 'ed' etc with regex and use the same regex on the search string. It's not ideal though. This was for a basic website with only 200 pages.

If you are concerned about performance you might want to consider using a search engine like Lucine (solr) instead of a database. This will make indexing much easier. You don't want to reinvent the wheel here.

回答3:

filtering out common words (as you perhaps noticed, "the" "is" "of" and "intel's" are missing from list)

Find (or create) a list of common words and filter user input.

With regards to "cpus" (plurals vs singulars), would it be best to use a particular type (singular or plural), both or exact (ie, "cpus" is different "cpu")?

Depends. I would search for both if that's not a big burden; or for the singular form using the LIKE clause if possible.

Continuing with previous item, how can I determine a plural (different flavors: test=>tests fish=>fish and leaf=>leaves)

Create an Inflector method or class. ie: Inflect::plural('fish') gives you 'fish'. There might be classes like these for the English language, look them up.

I'm currently using MySql and I'm very concerned with performance issues; we have 500+ categories and we didn't even launch the site

Having good schema and code design helps, but I can't really give you much advice on that one.

Let's say I wanted to use the search term "vendor:intel", where vendor specifies the field name (field_name), do you think there would be a huge impact on the sql server?

That would really help, since you'd be looking up a single column instead of multiple. Just be careful to filter user input and/or allow looking up only particular columns.

Search throttling; I don't like this at all, but it's a possibility, and if you know of any workarounds, make yourself heard!

Not many options here. To help here and in performance, you should consider having some sort of caching.

回答4:

I would heartily suggest you take a look at Solr. It's a Java based self contained Search and index system and probably has more benefits than a PHP solution.

回答5:

Search is tough to implement. Would recommend using a package if you're new to it.

Have you considered http://framework.zend.com/manual/en/zend.search.lucene.html ?

回答6:

Since many are suggesting to use an existing package, (and I want to make it harder for you than just suggesting a package ;-) ), let's presume I will use such a package (over in this answer thread).

How does a search engine index a set of fields and bind the found phrases/keywords/etc with the particular field id? That's not the question I want answered, at least not directly. My issue is, how easy is it to make the search engine work as I want? Given my above requirements, is this even possible/feasible?

From personal experience, I'd rather wasted some time tweaking my system rather than fixing someone else's code, which I have to waste way more time to understand first. Call me conservative, but I rarely stick to someone else's code/programs, and when I did, it was because of a desperate situation - and I usually end up somehow contributing to said project.

回答7:

There's a PHP implementation of a Brill Part of Speech tagger on php/ir. This might provide a framework for identifying those words that should be discarded and those you want to index, while it also identifies plurals (and the root singular). It's not perfect, though a custom dictionary to handle technical terms, it could prove useful for resolving your first three questions.

来源：https://stackoverflow.com/questions/3315910/php-word-index-performance-and-reasonable-results

标签

php

mysql

performance

indexing

word