python-textprocessing

Can I use TfidfVectorizer in scikit-learn for non-English language? Also how do I read a non-English text in Python?

Deadly 提交于 2021-01-29 05:22:57
问题 I have to read a text document which contains both English and non-English (Malayalam specifically) languages in Python. The following I see: >>>text_english = 'Today is a good day' >>>text_non_english = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത' Now, if I write a code to extract the first letter using >>>print(text_english[0]) 'T' and when I run >>>print(text_non_english[0]) � To get the first letter, I have to write the following >>>print(text_non_english[0:3]) ആ Why this happens? My aim to extract the

scikit learn implementation of tfidf differs from manual implementation

為{幸葍}努か 提交于 2020-01-23 01:21:09
问题 I tried to manually calculate tfidf values using the formula but the result I got is different from the result I got when using scikit-learn implementation. from sklearn.feature_extraction.text import TfidfVectorizer tv = TfidfVectorizer() a = "cat hat bat splat cat bat hat mat cat" b = "cat mat cat sat" tv.fit_transform([a, b]).toarray() # array([[0.53333448, 0.56920781, 0.53333448, 0.18973594, 0. , # 0.26666724], # [0. , 0.75726441, 0. , 0.37863221, 0.53215436, # 0. ]]) tv.get_feature_names

Word tokeinizing from the list of words in python?

天涯浪子 提交于 2020-01-15 09:36:30
问题 my program has list of words and amongst that i need few specific words to be tokenized as one word. my program would split a string into words eg str="hello my name is vishal, can you please help me with the red blood cells and platelet count. The white blood cell is a single word." output will be list=['hello','my','name','is','vishal','can','you','please','help','me','with','the','red','blood','cells','and','platelet','count','the','white','blood','cell','is','a','single','word'].. now I

Nltk stanford pos tagger error : Java command failed

北战南征 提交于 2019-12-30 04:22:08
问题 I'm trying to use nltk.tag.stanford module for tagging a sentence (first like wiki's example) but i keep getting the following error : Traceback (most recent call last): File "test.py", line 28, in <module> print st.tag(word_tokenize('What is the airspeed of an unladen swallow ?')) File "/usr/local/lib/python2.7/dist-packages/nltk/tag/stanford.py", line 59, in tag return self.tag_sents([tokens])[0] File "/usr/local/lib/python2.7/dist-packages/nltk/tag/stanford.py", line 81, in tag_sents

How to read the large number of text files from a Directory using Python

假如想象 提交于 2019-12-23 06:07:22
问题 I'm working on a project using Python(3.6) and Django(2) in which I need to read all the text files from a directory one by one, I have written the code but it is reading only 28 files from a folder which has 30 text files at the moment for testing purpose and return an error. From views.py: def get_txt_files(base_dir): for entry in os.scandir(base_dir): if entry.is_file() and entry.name.endswith(".txt"): # print(entry.path) yield entry.path, entry.name elif entry.is_dir(): yield from get_txt

how to get subset of list from index of list in python

删除回忆录丶 提交于 2019-12-11 12:53:37
问题 I had a list of strings, I need to subset from the list based on their indexes list in generic manner indx=[0,5,7] # index list a = ['a', 'b', 3, 4, 'd', 6, 7, 8] I need to get subsets in generic manner in first iteration: a[0:5] 2nd iteration: a[5:7] 3rd iteration: a[7:] The code I have tried: for i in indx: if len(indx)==i: print(a[i:]) else: print(a[i:i+1]) Expected output: a[0:5]=='a', 'b', 3, 4, 'd' a[5:7]=6, 7 a[7:]=8 回答1: You can try this : indx.append(len(a)) print(*[a[i:j] for i,j in

add own text inside nested braces

好久不见. 提交于 2019-12-11 05:28:51
问题 I have this source of text which contains HTML tags and PHP code at the same time: <html> <head> <title><?php echo "title here"; ?></title> <head> <body> <h1 <?php echo "class='big'" ?>>foo</h1> </body> </html> and I need place my own text (for example: MY_TEXT) after opened tag and get this result: <html> <head> <title><?php echo "title here"; ?></title> <head> <body> <h1 <?php echo "class='big'" ?>>MY_TEXTfoo</h1> </body> </html> thus I need consider nested braces if I will use regex it

Breaking string into multiple lines according to character width (python)

瘦欲@ 提交于 2019-12-02 08:53:45
问题 I am drawing text atop a base image via PIL . One of the requirements is for it to overflow to the next line(s) if the combined width of all characters exceeds the width of the base image. Currently I'm using textwrap.wrap(text, width=16) to accomplish this. Here width defines the number of characters to accommodate in one line. Now the text can be anything since it's user generated. So the problem is that hard-coding width won't take into account width variability due to font type, font size

How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?

ぃ、小莉子 提交于 2019-11-27 11:45:45
How would I go about counting the words in a sentence? I'm using Python. For example, I might have the string: string = "I am having a very nice 23!@$ day. " That would be 7 words. I'm having trouble with the random amount of spaces after/before each word as well as when numbers or symbols are involved. str.split() without any arguments splits on runs of whitespace characters: >>> s = 'I am having a very nice day.' >>> >>> len(s.split()) 7 From the linked documentation: If sep is not specified or is None , a different splitting algorithm is applied: runs of consecutive whitespace are regarded

How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?

落花浮王杯 提交于 2019-11-26 18:05:36
问题 How would I go about counting the words in a sentence? I'm using Python. For example, I might have the string: string = "I am having a very nice 23!@$ day. " That would be 7 words. I'm having trouble with the random amount of spaces after/before each word as well as when numbers or symbols are involved. 回答1: str.split() without any arguments splits on runs of whitespace characters: >>> s = 'I am having a very nice day.' >>> >>> len(s.split()) 7 From the linked documentation: If sep is not