Compare strings of text between two tables in a database or locally

China☆狼群 提交于 2019-12-25 08:42:17

问题


Edit: SQL doesn't work for this. I just found out about Solr/Sphinx and it seems like the right tool for this problem, so if you know Solr or Sphinx I'm eager to hear from you.

Basically, I have a .tsv with patent info and a .csv with product names. I need to match each row of the patents column against the product names and extract the occurrences in a new .csv column.

You can scroll down and see the example at the end.

Original question:

SQL newbie here so bear with me :). I can't figure out how to do this:

My database:

mysql> SHOW TABLES;
+-----------------------+
| Tables_in_prodpatdb   |
+-----------------------+
| assignee              |
| patents               |
| patent_info           |
| products              |
+-----------------------+
mysql> DESCRIBE patents;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| ...         |             |      |     |         |       |
| patent_id   | varchar(20) | YES  |     | NULL    |       |
| text        | text        | YES  |     | NULL    |       |
| ...         |             |      |     |         |       |
+-------------+-------------+------+-----+---------+-------+
mysql> DESCRIBE products;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| name        | text        | YES  |     | NULL    |       |
+-------------+-------------+------+-----+---------+-------+

I have to work with the columns name and text, they look like this:

name
product1
product2
product3
...
~10M rows

text
long text description 1
long text description 2
long text description 3
...
~88M rows

I need to check patents.text row 1 and match it against products.name column to find every product name in that row, then store those products names in a new table. Then check row 2 and repeat.

If a patents.text row has a product name several times only copy it to the new table once. If some row has no product names just skip it. The output should be something like this:

Operation  Product
1          prod5, prod6
2          prod7
...

An example:

name
valve
a/c fan
farmed salmon
...

  text
  This patent deals with a new approach to air-conditioned fan. With some new valve the a/c fan is 
so much better. The new valve is great.
  This patent has no product names in it.
  This patent talks about farmed salmon.
  ...

Desired output:

Operation   Product
1           valve, a/c fan
2           farmed salmon
...

回答1:


You can use GROUP_CONCAT with inner SELECT query, e.g.:

SELECT p.text, 
(SELECT GROUP_CONCAT(name) FROM products WHERE LOCATE(LOWER(name), LOWER(p.text)) > 0) AS 'products' 
FROM patent p;



回答2:


The only way I can see doing this with a reasonable performance is a full text search. I've seldom done these myself (maybe 3 times in 20+ years now); so I'll defer to someone else w/ more experience.

Using https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html as a starting point.

Provided the full text index has been created, it may be something as simple as:

SELECT pat.patent_ID, group_concat(P.Name)  
FROM patents pat 
CROSS JOIN products p 
WHERE MATCH (pat.text)
        AGAINST (p.name IN NATURAL LANGUAGE MODE)
GROUP BY pat.patent_ID;

Since every product vs every patent we have to cross join so we now have 880 million rows... That alone is a alot. The more reading I do on this however, the more I realize we're dealing with unstructured data in a RDBMS. by it's nature that's not an ideal fit; and there may be much more optimized methods to handle this outside of a RDBMS. or; we have to spend the time to structure the data in the RDBMS so it can be more effective iwth the indexes (such as splitting the text into it's own rows per word for indexing)

Lastly, Do we really need to look for ALL products? the shear size of the data involved on both sizes means this is going to take time in a database that doesn't handle unstructured data well.

Scratch the below as it will not be able to handle the load effectively. But keeping it out there for posterity

I think concat() and group_concat() may do the trick.

We join where the patent.text is like the product name generating multiple rows. the group_concat then combines these rows into one record. I'm not sure where "Operation" comes from in your result.

SELECT pat.text, group_concat(P.Name) as Product
FROM patents pat
INNER JOIN text
 on pat.text like concat('%',p.name,'%')
GROUP by pat.text

However don't expect this to be fast; as we're doing a wild card search using a % on both ends; so no index can be used.



来源:https://stackoverflow.com/questions/44681622/compare-strings-of-text-between-two-tables-in-a-database-or-locally

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!