n-grams from text in PostgreSQL

柔情痞子 提交于 2019-12-12 11:34:32

问题


I am looking to create n-grams from text column in PostgreSQL. I currently split(on white-space) data(sentences) in a text column to an array.

enter code hereselect regexp_split_to_array(sentenceData,E'\s+') from tableName

Once I have this array, how do I go about:

  • Creating a loop to find n-grams, and write each to a row in another table

Using unnest I can obtain all the elements of all the arrays on separate rows, and maybe I can then think of a way to get n-grams from a single column, but I'd loose the sentence boundaries which I wise to preserve.

Sample SQL code for PostgreSQL to emulate the above scenario

create table tableName(sentenceData  text);

INSERT INTO tableName(sentenceData) VALUES('This is a long sentence');

INSERT INTO tableName(sentenceData) VALUES('I am currently doing grammar, hitting this monster book btw!');

INSERT INTO tableName(sentenceData) VALUES('Just tonnes of grammar, problem is I bought it in TAIWAN, and so there aint any englihs, just chinese and japanese');

select regexp_split_to_array(sentenceData,E'\\s+')   from tableName;

select unnest(regexp_split_to_array(sentenceData,E'\\s+')) from tableName;

回答1:


Check out pg_trgm: "The pg_trgm module provides functions and operators for determining the similarity of text based on trigram matching, as well as index operator classes that support fast searching for similar strings."



来源:https://stackoverflow.com/questions/3045336/n-grams-from-text-in-postgresql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!