text-parsing

How to get the first column of every line from a CSV file?

泪湿孤枕 提交于 2019-11-28 21:07:41
How do get the first column of every line in an input CSV file and output to a new file? I am thinking using awk but not sure how. Try this: awk -F"," '{print $1}' data.txt It will split each input line in the file data.txt into different fields based on , character (as specified with the -F ) and print the first field (column) to stdout. Can be done: $ cut -d, -f1 data.txt echo "a,b,c" | cut -d',' -f1 > newFile Input a,12,34 b,23,56 Code awk -F "," '{print $1}' Input Format awk -F <delimiter> '{print $<column_number>}' Input This can be achieved using grep : $ grep -o '^[^,]\+' file.csv Using

Create Great Parser - Extract Relevant Text From HTML/Blogs

随声附和 提交于 2019-11-28 17:08:02
问题 I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of BeautifulSoup / Urllib2, which is okay, but it assumes you know the proper tags for the blog entry. Does anyone have any better ideas? Here are some thoughts maybe someone could expand upon, that I don't have enough knowledge/know-how yet to implement.

What is CoNLL data format?

独自空忆成欢 提交于 2019-11-28 16:19:38
I am new to text mining. I am using a open source jar (Mate Parser) which gives me output in a CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction. But i am able to understand some of the output but not able to comprehend the CoNLL data format. Can any one help me in making me understand the CoNLL data format?? Any kind of pointers would be appreciated. There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here . Each line represents a single word with a

Best way to get all digits from a string [duplicate]

依然范特西╮ 提交于 2019-11-28 08:59:59
This question already has an answer here: return only Digits 0-9 from a String 7 answers Is there any better way to get take a string such as "(123) 455-2344" and get "1234552344" from it than doing this: var matches = Regex.Matches(input, @"[0-9]+", RegexOptions.Compiled); return String.Join(string.Empty, matches.Cast<Match>() .Select(x => x.Value).ToArray()); Perhaps a regex pattern that can do it in a single match? I couldn't seem to create one to achieve that though. Do you need to use a Regex? return new String(input.Where(Char.IsDigit).ToArray()); Have you got something against Replace ?

Tips for reading in a complex file - Python

若如初见. 提交于 2019-11-28 02:29:59
I have complex, variable text files that I want to read into Python, but I'm not sure what the best strategy would be. I'm not looking for you to code anything for me, just some tips about what modules would best suit my needs/tips etc. The files look something like: Program Username: X Laser: X Em: X exp 1 sample 1 Time: X Notes: X Read 1 X data Read 2 X data # unknown number of reads sample 2 Time: X Notes: X Read 1 X data ... # Unknown number of samples exp 2 sample 1 ... # Unknown number of experiments, samples and reads # The 4 spaces between certain words represent tabs To analyse this

Extracting “((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun” from Text (Justeson & Katz, 1995)

两盒软妹~` 提交于 2019-11-28 01:53:57
问题 I would like to query if it is possible to extract ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun proposed by Justeson and Katz (1995) in R package openNLP? That is, I would like to use this linguistic filtering to extract candidate noun phrases. I cannot well understand its meaning. Could you do me a favor to explain it or transform such representation into R language. Many thanks. Maybe we can start the sample code from: library("openNLP") acq <- "This paper describes a novel optical

How to extract polynomial coefficients in Java?

南笙酒味 提交于 2019-11-28 01:38:49
Taking the string -2x^2+3x^1+6 as an example, how how to extract -2 , 3 and 6 from this equation stored in the string? Not giving the exact answer but some hints: Use replace meyhod: replace all - with +- . Use split method: // after replace effect String str = "+-2x^2+3x^1+6" String[] arr = str.split("+"); // arr will contain: {-2x^2, 3x^1, 6} Now, each index value can be splitted individually: String str2 = arr[0]; // str2 = -2x^2; // split with x and get vale at index 0 String polynomial= "-2x^2+3x^1+6"; String[] parts = polynomial.split("x\\^\\d+\\+?"); for (String part : parts) { System

How to parse text into sentences

我的梦境 提交于 2019-11-27 23:12:43
I'm trying to break up a paragraph into sentences. Here is my code so far: import java.util.*; public class StringSplit { public static void main(String args[]) throws Exception{ String testString = "The outcome of the negotiations is vital, because the current tax levels signed into law by President George W. Bush expire on Dec. 31. Unless Congress acts, tax rates on virtually all Americans who pay income taxes will rise on Jan. 1. That could affect economic growth and even holiday sales."; String[] sentences = testString.split("[\\.\\!\\?]"); for (int i=0;i<sentences.length;i++){ System.out

Splitting large text file by a delimiter in Python

纵然是瞬间 提交于 2019-11-27 21:36:38
I imaging this is going to be a simple task but I can't find what I am looking for exactly in previous StackOverflow questions to here goes... I have large text files in a proprietry format that look comething like this: :Entry - Name John Doe - Date 20/12/1979 :Entry -Name Jane Doe - Date 21/12/1979 And so forth. The text files range in size from 10kb to 100mb. I need to split this file by the :Entry delimiter. How could I process each file based on :Entry blocks? You could use itertools.groupby to group lines that occur after :Entry into lists: import itertools as it filename='test.dat' with

How should I detect which delimiter is used in a text file?

。_饼干妹妹 提交于 2019-11-27 20:07:55
I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use? One way would be to read in every line and count both tabs and commas and find out which is most consistently used in every line. Of course, the data could include commas or tabs, so that may be easier said than done. Edit: Another fun aspect of this project is that I will also need to detect the schema of the file when I read it in because it could be one of many. This means