text-processing

How to use os.walk to only list text files

南楼画角 提交于 2019-12-23 01:11:46
问题 This question was similar in addressing hidden filetypes. I am struggling with a similar problem because I need to process only text containing files in folders that have many different filetypes- pictures, text, music. I am using os.walk which lists EVERYTHING, including files without an extension-like Icon files. I am using linux and would be satisfied to filter for only txt files. One way is too check the filename extension and this post explains nicely how it's done. But this still leaves

How to use os.walk to only list text files

≡放荡痞女 提交于 2019-12-23 01:11:42
问题 This question was similar in addressing hidden filetypes. I am struggling with a similar problem because I need to process only text containing files in folders that have many different filetypes- pictures, text, music. I am using os.walk which lists EVERYTHING, including files without an extension-like Icon files. I am using linux and would be satisfied to filter for only txt files. One way is too check the filename extension and this post explains nicely how it's done. But this still leaves

How to extract a single function from a source file

こ雲淡風輕ζ 提交于 2019-12-22 08:48:07
问题 I'm working on a small academic research about extremely long and complicated functions in the Linux kernel. I'm trying to figure out if there is a good reason to write 600 or 800 lines-long functions. For that purpose, I would like to find a tool that can extract a function from a .c file, so I can run some automated tests on the function. For example, If I have the function cifs_parse_mount_options() within the file connect.c , I'm seeking a solution that would roughly work like: extract

Extracting the body text of an HTML document using PHP

∥☆過路亽.° 提交于 2019-12-22 08:34:05
问题 I know it's better to use DOM for this purpose but let's try to extract the text in this way: <?php $html=<<<EOD <html> <head> </head> <body> <p>Some text</p> </body> </html> EOD; preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE); if (empty($matches)) exit; $matched_body_start_tag = $matches[0][0]; $index_of_body_start_tag = $matches[0][1]; $index_of_body_end_tag = strpos($html, '</body>'); $body = substr( $html, $index_of_body_start_tag + strlen($matched_body_start_tag), $index

Parse string into a tree structure?

假装没事ソ 提交于 2019-12-22 05:22:03
问题 I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] [["Goodbye" "farewell"] ["planet" "rock" "globe" ["." "!"]]]] I've tried playing with some regular expressions for this (such as #"{([^{}]*)}" ), but everything I've tried seems to "flatten" the tree into a big list of lists. I could be approaching this

Parse string into a tree structure?

て烟熏妆下的殇ゞ 提交于 2019-12-22 05:21:08
问题 I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] [["Goodbye" "farewell"] ["planet" "rock" "globe" ["." "!"]]]] I've tried playing with some regular expressions for this (such as #"{([^{}]*)}" ), but everything I've tried seems to "flatten" the tree into a big list of lists. I could be approaching this

How do I join pairs of consecutive lines in a large file (1 million lines) using vim, sed, or another similar tool?

橙三吉。 提交于 2019-12-22 03:18:05
问题 I need to move the contents of every second line up to the line above such that line2's data is alongside line1's, either comma or space separated works. Input: line1 line2 line3 line4 Output: line1 line2 line3 line4 I've been doing it in vim with a simple recording but vim seems to crash when I tell it to do it 100 000 times... I'm thinking maybe sed would be a good alternative but not sure how to do what I want or maybe there's a better option? Each line only contains 1 numerical value, I

Unindent or linearize XML

╄→гoц情女王★ 提交于 2019-12-22 01:20:54
问题 I'm looking for a fast way to linearize an XML in JAVA I'm using ~2GB file so DOM is excluded. Java targhet is 1.5.0.22 I have to generate from an xml a file composed of 80bytes + newline. I have to write this in a DB2 table that will be read by a Cobol program. In Cobol is important the size because the data are read as CHAR from table this implies the an empty rows is 80 spaces. I read the file byte to byte(I must) but I can use internal temp bufferization to store the probably sequence to

Parse log files programmatically in .NET

醉酒当歌 提交于 2019-12-22 01:02:13
问题 We have a large number (read: 50,000) of relatively small (read under 500K, typically under 50K) log files created using log4net from our client application. A typical log looks like: Start Painless log Framework:8.1.7.0 Application:8.1.7.0 2010-05-05 19:26:07,678 [Login ] INFO Application.App.OnShowLoginMessage(194) - Validating Credentials... 2010-05-05 19:26:08,686 [1 ] INFO Application.App.OnShowLoginMessage(194) - Checking for Application Updates... 2010-05-05 19:26:08,830 [1 ] INFO

Count word frequencies in list-of-lists-of-words

与世无争的帅哥 提交于 2019-12-21 12:41:23
问题 I have this large corpus data in dataframe res (dataframe) text.1 1 <NA> 2 beren stuart vanuatu monday october venkatesh ramesh sandeep talanki nagaraj subject approve qlikview gpa access process form gpa access email requestor line manager access granted raj add user qlikview workgroup gpa access form requestors lim tek kon vanuatu address lini high port vila efate title relationship manager emerging corporates employee id lan id limtk bsbcc authorising manager beren stuart vanuatu read gpa