read corpus of text files in spacy

荒凉一梦 提交于 2020-02-24 08:45:09

问题


All the examples that I see for using spacy just read in a single text file (that is small in size). How does one load a corpus of text files into spacy?

I can do this with textacy by pickling all the text in the corpus:

docs =  textacy.io.spacy.read_spacy_docs('E:/spacy/DICKENS/dick.pkl', lang='en')

for doc in docs:
    print(doc)

But I am not clear as to how to use this generator object (docs) for further analysis.

Also, I would rather use spacy, not textacy.

spacy also fails to read in a single file that is large (~ 2000000 characters).

Any help is appreciated...

Ravi


回答1:


If you can convert your corpus into a data-frame in a way that each row corresponds to a document, you can basically write a function to do what you want to do and then:

df['new_column'] = df['document'].apply(lambda x: your_function(x))

Or, I am not sure if this is what you want, but you can try something like this:

import spacy
import os

nlp = spacy.load('en_core_web_lg')
docs ='path\\to\\the\\corpus_folder'

def get_filename(path):
    return [i.path for i in os.scandir(path) if i.is_file()]

files=get_filename(docs)
for filepath in files:
    with open(filepath, 'r') as file_to_read:
        some_text = file_to_read.read()
        print(os.path.basename(filepath))
        print(nlp(some_text))
        print([tok.text for tok in nlp.tokenizer(some_text) if not tok.is_stop])
        print('-'*40)

And this is the output:

text1.txt
Read multiple files.
['Read', 'multiple', 'files', '.']
----------------------------------------
text2.txt
Read it, man.
['Read', ',', 'man', '.']
----------------------------------------

However, it is not reading with spaCy.




回答2:


So I finally got this working, and it shall be preserved here for posterity.

Start with a generator, here named iterator because I'm currently too afraid to change anything for fear of it breaking again:

def path_iterator(paths):
    for p in paths:
        print("yielding")
        yield p.open("r").read(25)

Get an iterator, generator, or list of paths:

my_files = Path("/data/train").glob("*.txt")

This gets wrapped in our ... function from above, and passed to nlp.pipe. In goes a generator, out comes a generator. The batch_size=5 is required here, or it will fall back into the bad habit of first reading all the files:

doc = nlp.pipe(path_iterator(my_paths), batch_size=5)

The important part, and reason why we're doing all this, is that until now nothing has happened. We're not waiting for a thousand files to be processed or anything. That happens only on demand, when you start reading from docs:

for d in doc:
    print("A document!")

You will see alternating blocks of five (our batch_size, above) "Yielding" and "A document". It's an actual pipeline now, and data starts coming out very soon after starting it.

And while I'm currently running a version one minor tick too old for this, the coup de grace is multi-processing:

# For those with these new AMD CPUs with hundreds of cores
doc = nlp.pipe(path_iterator(my_paths), batch_size=5, n_process=64) 


来源:https://stackoverflow.com/questions/52467936/read-corpus-of-text-files-in-spacy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!