How to extract n-gram word sequences from text in Postgres

大憨熊 提交于 2019-12-24 04:57:06

问题


I am hoping to use Postgres to extract sequences of words from Text. For example the whole word trigrams for the following sentence

"ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium"

would be

  • "ed ut perspiciatis"
  • "ut perspiciatis unde"
  • "perspiciatis unde omnis" ...

I have been doing this with R but I am hoping Postgres would be able to handle it more efficiently.

I have seen a similar question asked here n-grams from text in PostgreSQL but I don't understand how to use pg_trgm to extract word sequences


回答1:


The function below assumes that a word consists of alphanumeric characters (any others are removed) and a space works as a separator.

create or replace function word_ngrams(str text, n int)
returns setof text language plpgsql as $$
declare
    i int;
    arr text[];
begin
    arr := regexp_split_to_array(str, '[^[:alnum:]]+');
    for i in 1 .. cardinality(arr)- n+ 1 loop
        return next array_to_string(arr[i : i+n-1], ' ');
    end loop;
end $$;

Find all three-word phrases:

select word_ngrams('ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium', 3)

        word_ngrams         
----------------------------
 ed ut perspiciatis
 ut perspiciatis unde
 perspiciatis unde omnis
 unde omnis iste
 omnis iste natus
 iste natus error
 natus error sit
 error sit voluptatem
 sit voluptatem accusantium
(9 rows)

Find all six-word phrases:

select word_ngrams('ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium', 6)

                 word_ngrams                 
---------------------------------------------
 ed ut perspiciatis unde omnis iste
 ut perspiciatis unde omnis iste natus
 perspiciatis unde omnis iste natus error
 unde omnis iste natus error sit
 omnis iste natus error sit voluptatem
 iste natus error sit voluptatem accusantium
(6 rows)


来源:https://stackoverflow.com/questions/51570272/how-to-extract-n-gram-word-sequences-from-text-in-postgres

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!