问题
All the examples that I see for using spacy just read in a single text file (that is small in size). How does one load a corpus of text files into spacy?
I can do this with textacy by pickling all the text in the corpus:
docs = textacy.io.spacy.read_spacy_docs('E:/spacy/DICKENS/dick.pkl', lang='en')
for doc in docs:
print(doc)
But I am not clear as to how to use this generator object (docs) for further analysis.
Also, I would rather use spacy, not textacy.
spacy also fails to read in a single file that is large (~ 2000000 characters).
Any help is appreciated...
Ravi
回答1:
If you can convert your corpus into a data-frame in a way that each row corresponds to a document, you can basically write a function to do what you want to do and then:
df['new_column'] = df['document'].apply(lambda x: your_function(x))
Or, I am not sure if this is what you want, but you can try something like this:
import spacy
import os
nlp = spacy.load('en_core_web_lg')
docs ='path\\to\\the\\corpus_folder'
def get_filename(path):
return [i.path for i in os.scandir(path) if i.is_file()]
files=get_filename(docs)
for filepath in files:
with open(filepath, 'r') as file_to_read:
some_text = file_to_read.read()
print(os.path.basename(filepath))
print(nlp(some_text))
print([tok.text for tok in nlp.tokenizer(some_text) if not tok.is_stop])
print('-'*40)
And this is the output:
text1.txt
Read multiple files.
['Read', 'multiple', 'files', '.']
----------------------------------------
text2.txt
Read it, man.
['Read', ',', 'man', '.']
----------------------------------------
However, it is not reading with spaCy.
回答2:
So I finally got this working, and it shall be preserved here for posterity.
Start with a generator, here named iterator
because I'm currently too afraid to change anything for fear of it breaking again:
def path_iterator(paths):
for p in paths:
print("yielding")
yield p.open("r").read(25)
Get an iterator, generator, or list of paths:
my_files = Path("/data/train").glob("*.txt")
This gets wrapped in our ... function from above, and passed to nlp.pipe
. In goes a generator, out comes a generator. The batch_size=5
is required here, or it will fall back into the bad habit of first reading all the files:
doc = nlp.pipe(path_iterator(my_paths), batch_size=5)
The important part, and reason why we're doing all this, is that until now nothing has happened. We're not waiting for a thousand files to be processed or anything. That happens only on demand, when you start reading from docs
:
for d in doc:
print("A document!")
You will see alternating blocks of five (our batch_size, above) "Yielding" and "A document". It's an actual pipeline now, and data starts coming out very soon after starting it.
And while I'm currently running a version one minor tick too old for this, the coup de grace is multi-processing:
# For those with these new AMD CPUs with hundreds of cores
doc = nlp.pipe(path_iterator(my_paths), batch_size=5, n_process=64)
来源:https://stackoverflow.com/questions/52467936/read-corpus-of-text-files-in-spacy