Can NLTK be used in a Postgres Python Stored Procedure

后端 未结 1 420
臣服心动
臣服心动 2021-02-04 16:32

Has anyone done or even no if its possible to use NLTK within a Postgres Python Stored Procedure or trigger

相关标签:
1条回答
  • 2021-02-04 16:58

    You can use pretty much any Python library in a PL/Python stored procedure or trigger.

    See the PL/Python documentation.

    Concepts

    The crucial point to understand is that PL/Python is CPython (in PostgreSQL up to and including 9.3, anyway); it uses exactly the same interpreter that the normal standalone Python does, it just loads it as a library into the PostgreSQL backed. With a few limitations (outlined below), if it works with CPython it works with PL/Python.

    If you have multiple Python interpreters installed on your system - versions, distributions, 32-bit vs 64-bit etc - you might need to make sure you're installing extensions and libraries into the right one when running distutils scripts, etc, but that's about it.

    Since you can load any library available to the system Python there's no reason to think NLTK would be a problem unless you know it requires things like threading that aren't really recommended in a PostgreSQL backend. (Sure enough, I tried it and it "just worked", see below).

    One possible concern is that the startup overhead of something like NLTK might be quite big, you probably want to preload PL/Python it in the postmaster and import the module in your setup code so it's ready when backends start. Understand that the postmaster is the parent process that all the other backends fork() from, so if the postmaster preloads something it's available to the backends with greatly reduced overheads. Test performance either way.

    Security

    Because you can load arbitrary C libraries via PL/Python and because the Python interpreter has no real security model, plpythonu is an "untrusted" language. Scripts have full and unrestricted access to the system as the postgres user and can fairly simply bypass access controls in PostgreSQL. For obvious security reasons this means that PL/Python functions and triggers may only be created by the superuser, though it's quite reasonable to GRANT normal users the ability to run carefully written functions that were installed by the superuser.

    The upside is that you can do pretty much anything you can do in normal Python, keeping in mind that the Python interpreter's lifetime is that of the database connection (session). Threading isn't recommended, but most other things are fine.

    PL/Python functions must be written with careful input sanitation, must set search_path when invoking the SPI to run queries, etc. This is discussed more in the manual.

    Limitations

    Long-running or potentially problematic things like DNS lookups, HTTP connections to remote systems, SMTP mail delivery, etc should generally be done from a helper script using LISTEN and NOTIFY rather than an in-backend job in order to preserve PostgreSQL's performance and avoid hampering VACUUM with lots of long transactions. You can do these things in the backend, it just isn't a great idea.

    You should avoid creating threads within the PostgreSQL backend.

    Don't attempt to load any Python library that'll load the libpq C library. This could cause all sorts of exciting problems with the backend. When talking to PostgreSQL from PL/Python use the SPI routines not a regular client library.

    Don't do very long-running things in the backend, you'll cause vacuum problems.

    Don't load anything that might load a different version of an already loaded native C library - say a different libcrypto, libssl, etc.

    Don't write directly to files in the PostgreSQL data directory, ever.

    PL/Python functions run as the postgres system user on the OS, so they don't have access to things like the user's home directory or files on the client side of the connection.

    Test result

    $ yum install python-nltk python-nltk
    $ psql -U postgres regress
    
    regress=# CREATE LANGUAGE plpythonu;
    
    regress=# CREATE OR REPLACE FUNCTION nltk_word_tokenize(word text) RETURNS text[] AS $$
              import nltk
              return nltk.word_tokenize(word)
              $$ LANGUAGE plpythonu;
    
    regress=# SELECT nltk_word_tokenize('This is a test, it''s going to work fine');
                  nltk_word_tokenize               
    -----------------------------------------------
     {This,is,a,test,",",it,'s,going,to,work,fine}
    (1 row)
    

    So, as I said: Try it. So long as the Python interpreter PostgreSQL is using for plpython has nltk's dependencies installed it will work fine.

    Note

    PL/Python is CPython, but I'd love to see a PyPy based alternative that can run untrusted code using PyPy's sandbox features.

    0 讨论(0)
提交回复
热议问题