information-extraction

Media Information Extractor for Java [closed]

可紊 提交于 2019-11-27 07:28:27
I need a media information extraction library (pure Java or JNI wrapper) that can handle common media formats. I primarily use it for video files and I need at least these information: Video length (Runtime) Video bitrate Video framerate Video format and codec Video size (width X height) Audio channels Audio format Audio bitrate and sampling rate There are several libraries and tools around but I couldn't find for Java. After a few days of asking this question, I have found MediaInfo which supplies dozens of technical and tag information about a video or audio file. There is a JNI wrapper for

Advanced PDF Parsing Using Python (extracting text without tables, etc): What's the Best Library? [closed]

試著忘記壹切 提交于 2019-11-26 18:44:36
问题 I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren't useful and look garbled (for instance, lots of numbers mashed together). I'm looking for something

extract single string from HTML using Ruby/Mechanize (and Nokogiri)

百般思念 提交于 2019-11-26 18:30:27
问题 I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath. Sample code: require 'rubygems' require 'mechanize' post_agent = WWW::Mechanize.new post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708') puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip puts post_page

How does Apple find dates, times and addresses in emails?

余生颓废 提交于 2019-11-26 17:53:11
问题 In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not only works for emails in English, but in other languages also. I love this feature and would like to understand how they do it. The naive way to do this would be to have many regular expressions and run them all. However I this is not going to scale very well and will work for only a specific

Media Information Extractor for Java [closed]

江枫思渺然 提交于 2019-11-26 13:18:54
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . I need a media information extraction library (pure Java or JNI wrapper) that can handle common media formats. I primarily use it for video files and I need at least these information: Video length (Runtime) Video bitrate Video framerate Video format and codec Video size (width X height) Audio channels Audio