How to extract the highlighted text from the pdf in rails

喜你入骨 提交于 2020-07-23 06:17:46

问题


I am using pdf-reader for parsing the pdf and converting it to the object and plucking out the Highlight and underlined object from the pdf but the problem I am not able to pluck out the text from that object I checked other algo to for the reference and found that they are plucking the Contents from the Highlighted object but when I check it returning me nil as there is no such element in the object here is my algo which I used to get the highlighted objects from the pdf

  def read_pdf
    puts 'Running...'
    file = "/home/kshitiz/Downloads/test.pdf"
    pdf_file_name = 'test'

    doc = PDF::Reader.new(file)
    $objects = doc.objects

    doc.pages.each do |page|
     annots = highlights_on_page(page)
     annots.each do |annot|
       puts "#{annot[:Contents]}"
     end
    end
  end

  def is_highlight?(object)
     object[:Type] == :Annot && [:Highlight, :Underline].include?(object[:Subtype])
  end

  def annots_on_page(page)
     references = (page.attributes[:Annots] || [])
     lookup_all(references).flatten
  end

  def lookup_all(refs)
     refs = *refs
     refs.map { |ref| lookup(ref) }
  end

  def lookup(ref)
     object = $objects[ref]
     return object unless object.is_a?(Array)
     lookup_all(object)
  end

  def highlights_on_page(page)
     all_annots = annots_on_page(page)
     all_annots.select { |a| is_highlight?(a) }
  end

Can anyone help me out to resolve this bug in this algo

来源:https://stackoverflow.com/questions/62712684/how-to-extract-the-highlighted-text-from-the-pdf-in-rails

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!