text-parsing

In Perl, how can I correctly parse tab/space delimited files with quoted strings?

筅森魡賤 提交于 2019-12-07 08:01:10
问题 I need to parse tab/space delimited files that have a lot of columns in Perl. The values are such that the there are large strings enclosed within double quotes. These strings can have any characters such as tabs and spaces or anything else. When I try to parse them with the split function it splits these strings as well. Now how can I make perl understand that the strings within the " " are a single column entry? A simple example is, 12 345546.67677 "Hello World!!!" -567.55656 0.5465767

how to read text files and create a data frame in R

非 Y 不嫁゛ 提交于 2019-12-06 06:52:26
This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 4 years ago . Need to read the txt file in https://raw.githubusercontent.com/fonnesbeck/Bios6301/master/datasets/addr.txt and convert them into a data frame R with column number as: LastName, FirstName, streetno, streetname, city, state, and zip... Tried to use sep command to separate them but failed... Expanding on my comments, here's another approach. You may need to tweak some of the code if your full data set has a wider range of patterns to account for. library(stringr) # For str_trim #

PDF Text Extraction Approach Using OCR [closed]

风流意气都作罢 提交于 2019-12-06 02:15:37
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Has anybody attempted to extract text from a PDF using an OCR library and Java? What did you find to be the most reliable library for text extraction. Most of the approaches I've seen (tesseract, GOCR) are C libraries that would require some JNI code to be written. I'm familiar with pdfbox, which is now an

How can I extract/parse tabular data from a text file in Perl?

北城余情 提交于 2019-12-06 01:06:06
问题 I am looking for something like HTML::TableExtract, just not for HTML input, but for plain text input that contains "tables" formatted with indentation and spacing. Data could look like this: Here is some header text. Column One Column Two Column Three a b a b c Some more text Another Table Another Column abdbdbdb aaaa 回答1: Not aware of any packaged solution, but something not very flexible is fairly simple to do assuming you can do two passes over the file: (the following is partially

Parse raw text with MaltParser in Java

梦想的初衷 提交于 2019-12-05 21:05:40
I found that NLKT in python does it via *raw_parse* function but I need to use Java. I found cleartk has a MaltParser wrapper but there is no documentation about it. I'm looking for a function or a project that first converts raw English text to conll file that MaltParser can use and parses it with MaltParser. Any help is appreciated. There are examples coming with the MaltParser 1.7.2 distribution in the folder examples/apiexamples/srcex . However, these examples only show how to run the MaltParser programmatically after tokenization and pos-tagging have already been performed (and after the

How to understand and fix conflicts in PLY

最后都变了- 提交于 2019-12-05 20:12:48
I am working on a SystemVerilog parser and I am running into many ply conflicts (both shift/reduce and reduce/reduce). I currently have like 170+ conflicts and the problem I have is that I don't really understand the parser.out file generated by PLY. Without properly understanding that there is little I can do, so my goal is to understand what ply is reporting. All the PLY documentation is brief and not very explainatory... Here you have one of my states, the first where a conflict is found apparently: state 24 (134) attribute_instance_optional_list -> attribute_instance_list . (136) attribute

Parse string into a tree structure?

独自空忆成欢 提交于 2019-12-05 06:42:10
I'm trying to figure out how to parse a string in this format into a tree like data structure of arbitrary depth. "{{Hello big|Hi|Hey} {world|earth}|{Goodbye|farewell} {planet|rock|globe{.|!}}}" [[["Hello big" "Hi" "Hey"] ["world" "earth"]] [["Goodbye" "farewell"] ["planet" "rock" "globe" ["." "!"]]]] I've tried playing with some regular expressions for this (such as #"{([^{}]*)}" ), but everything I've tried seems to "flatten" the tree into a big list of lists. I could be approaching this from the wrong angle, or maybe a regex just isn't the right tool for the job. Thanks for your help! Don't

converting bibtex files to html with python (maybe pybtex?)

假如想象 提交于 2019-12-05 05:19:27
Hi I want to parse a bibtex publications file and sort for specific fields (e.g. year) and filter certain content, to then put it on a website. I came across pybtex, which works as far as reading and parsing the bibtex file, but it is basically not documented and I can't figure out how to sort the entries. Is pybtex the way to go (how can I sort the entries) or are there better options? thanks a lot!! Found a solution, this sorts the entries in a descending order using pybtex, newest publications go first: from pybtex.database.input import bibtex from operator import itemgetter, attrgetter

How to extract chunks from BIO chunked sentences? - python

限于喜欢 提交于 2019-12-05 03:58:10
Give an input sentence, that has BIO chunk tags : [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')] I would need to extract the relevant phrases out, e.g. if I want to extract 'NP' , I would need to extract the fragments of tuples that contains B-NP and I-NP . [out]: [('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')] (Note: the numbers in the extract tuples represent the token index.) I have tried extracting it using the following code: def extract_chunks(tagged

Powershell command to trim path if it ends with “\\”

狂风中的少年 提交于 2019-12-05 01:29:05
I need to trim path if it ends with \ . C:\Ravi\ I need to change to C:\Ravi I have a case where path will not end with \ (Then it must skip). I tried with .EndsWith("\") , but it fails when I have \\ instead of \ . Can this be done in PowerShell without resorting to conditionals? no need to overcomplicate "C:\Ravi\".trim('\') Consider using TrimEnd instead (especially if you are working with UNC Path): "C:\Ravi\".TrimEnd('\') You mention needing to differentiate between paths ending in "\" and "\\" and possibly handling those differently. While you can use .Trim("\") or .TrimEnd("\") to