information-extraction

Parsing date from text using Ruby

旧城冷巷雨未停 提交于 2019-11-30 16:37:29
I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." Any suggestions? Assuming you just want dates and not datetimes: require 'date' string = "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." r = /(January|February|March|April|May|June|July|August|September|October|November|December) (\d+{1,2}), (\d{4})/ if string[r] date =Date.parse(string[r]) puts date end Try

some ideas and direction of how to measure ranking, AP, MAP, recall for IR evaluation

笑着哭i 提交于 2019-11-30 16:33:29
I have question about how to evaluate the information retrieve result is good or not such as calculate the relevant document rank, recall, precision ,AP, MAP..... currently, the system is able to retrieve the document from the database once the users enter the query. The problem is I do not know how to do the evaluation. I got some public data set such as "Cranfield collection" dataset link it contains 1.document 2.query 3.relevance assesments DOCS QRYS SIZE* Cranfield 1,400 225 1.6 May I know how to use do the evaluation by using "Cranfield collection" to calculate the relevant document rank,

Parsing date from text using Ruby

戏子无情 提交于 2019-11-29 23:58:12
问题 I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." Any suggestions? 回答1: Assuming you just want dates and not datetimes: require 'date' string = "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." r = /(January|February|March|April|May|June|July|August|September

Training Tagger with Custom Tags in NLTK

Deadly 提交于 2019-11-29 07:36:53
I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York] . I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags. As @AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking

Methods for extracting locations from text?

≡放荡痞女 提交于 2019-11-29 02:57:36
问题 What are the recommended methods for extracting locations from free text? What I can think of is to use regex rules like "words ... in location". But are there better approaches than this? Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table. Does anybody know of better approaches? Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might

What is CoNLL data format?

独自空忆成欢 提交于 2019-11-28 16:19:38
I am new to text mining. I am using a open source jar (Mate Parser) which gives me output in a CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction. But i am able to understand some of the output but not able to comprehend the CoNLL data format. Can any one help me in making me understand the CoNLL data format?? Any kind of pointers would be appreciated. There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here . Each line represents a single word with a

Advanced PDF Parsing Using Python (extracting text without tables, etc): What's the Best Library? [closed]

人盡茶涼 提交于 2019-11-27 16:44:45
I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren't useful and look garbled (for instance, lots of numbers mashed together). I'm looking for something that's a bit more advanced. I'd like to extract the text from a PDF document, excluding any tables and

extract single string from HTML using Ruby/Mechanize (and Nokogiri)

ぃ、小莉子 提交于 2019-11-27 15:23:06
I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath. Sample code: require 'rubygems' require 'mechanize' post_agent = WWW::Mechanize.new post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708') puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip puts post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip puts post

How does Apple find dates, times and addresses in emails?

ⅰ亾dé卋堺 提交于 2019-11-27 09:56:14
In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not only works for emails in English, but in other languages also. I love this feature and would like to understand how they do it. The naive way to do this would be to have many regular expressions and run them all. However I this is not going to scale very well and will work for only a specific language or date format, etc. I think that Apple must be using some concept of machine learning to extract

What is CoNLL data format?

折月煮酒 提交于 2019-11-27 09:50:14
问题 I am new to text mining. I am using a open source jar (Mate Parser) which gives me output in a CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction. But i am able to understand some of the output but not able to comprehend the CoNLL data format. Can any one help me in making me understand the CoNLL data format?? Any kind of pointers would be appreciated. 回答1: There are many different CoNLL formats since CoNLL is a different shared