nltk doesn't add $NLTK_DATA to search path?

前端 未结 2 718
北恋
北恋 2020-11-29 12:53

under linux,I have set env var $NLTK_DATA(\'/home/user/data/nltk\'),and blew test works as expected

>>> from nltk.corpus import brown
>>> b         


        
相关标签:
2条回答
  • 2020-11-29 13:30

    If you don't want to set the $NLTK_DATA before running your scripts, you can do it within the python scripts with:

    import nltk
    nltk.path.append('/home/alvas/some_path/nltk_data/')
    

    E.g. let's move the the nltk_data to a non-standard path that NLTK won't find it automatically:

    alvas@ubi:~$ ls nltk_data/
    chunkers  corpora  grammars  help  misc  models  stemmers  taggers  tokenizers
    alvas@ubi:~$ mkdir some_path
    alvas@ubi:~$ mv nltk_data/ some_path/
    alvas@ubi:~$ ls nltk_data/
    ls: cannot access nltk_data/: No such file or directory
    alvas@ubi:~$ ls some_path/nltk_data/
    chunkers  corpora  grammars  help  misc  models  stemmers  taggers  tokenizers
    

    Now, we use the nltk.path.append() hack:

    alvas@ubi:~$ python
    >>> import os
    >>> import nltk
    >>> nltk.path.append('/home/alvas/some_path/nltk_data/')
    >>> nltk.pos_tag('this is a foo bar'.split())
    [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
    >>> nltk.data
    <module 'nltk.data' from '/usr/local/lib/python2.7/dist-packages/nltk/data.pyc'>
    >>> nltk.data.path
    ['/home/alvas/some_path/nltk_data/', '/home/alvas/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
    >>> exit()
    

    Let's move it back and see whether it works:

    alvas@ubi:~$ ls nltk_data
    ls: cannot access nltk_data: No such file or directory
    alvas@ubi:~$ mv some_path/nltk_data/ .
    alvas@ubi:~$ python
    >>> import nltk
    >>> nltk.data.path
    ['/home/alvas/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
    >>> nltk.pos_tag('this is a foo bar'.split())
    [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
    

    If you really really want to find nltk_data automagically, use something like:

    import scandir
    import os, sys
    import time
    
    import nltk
    
    def find(name, path):
        for root, dirs, files in scandir.walk(path):
            if root.endswith(name):
                return root
    
    def find_nltk_data():
        start = time.time()
        path_to_nltk_data = find('nltk_data', '/')
        print >> sys.stderr, 'Finding nltk_data took', time.time() - start
        print >> sys.stderr,  'nltk_data at', path_to_nltk_data
        with open('where_is_nltk_data.txt', 'w') as fout:
            fout.write(path_to_nltk_data)
        return path_to_nltk_data
    
    def magically_find_nltk_data():
        if os.path.exists('where_is_nltk_data.txt'):
            with open('where_is_nltk_data.txt') as fin:
                path_to_nltk_data = fin.read().strip()
            if os.path.exists(path_to_nltk_data):
                nltk.data.path.append(path_to_nltk_data)
            else:
                nltk.data.path.append(find_nltk_data())
        else:
            path_to_nltk_data  = find_nltk_data()
            nltk.data.path.append(path_to_nltk_data)
    
    
    magically_find_nltk_data()
    print nltk.pos_tag('this is a foo bar'.split())
    

    Let's call that python script, test.py:

    alvas@ubi:~$ ls nltk_data/
    chunkers  corpora  grammars  help  misc  models  stemmers  taggers  tokenizers
    alvas@ubi:~$ python test.py
    Finding nltk_data took 4.27330780029
    nltk_data at /home/alvas/nltk_data
    [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
    alvas@ubi:~$ mv nltk_data/ some_path/
    alvas@ubi:~$ python test.py
    Finding nltk_data took 4.75850391388
    nltk_data at /home/alvas/some_path/nltk_data
    [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
    
    0 讨论(0)
  • 2020-11-29 13:33

    If you are someone who wants to install the NLTK data in a conda environment, and doesn't want to specify the data location in every script, or export the environment variable, you need to do the following:

    1. Activate your desired conda environment.
    2. Print sys.prefix within your conda environment, and copy this path (let's say /home/dickens/envs/nltk_env.
    3. Run nltk.download() within the conda environment, select your desired packages, and append /share/nltk_data to your path from above as the download location. For e.g. in our case, it will become /home/dickens/envs/nltk_env/share/nltk_data.
    4. You are now good to go!
    0 讨论(0)
提交回复
热议问题