问题
I am using pdf-reader
for parsing the pdf and converting it to the object and plucking out the Highlight and underlined object from the pdf but the problem I am not able to pluck out the text from that object I checked other algo to for the reference and found that they are plucking the Contents from the Highlighted object but when I check it returning me nil as there is no such element in the object here is my algo which I used to get the highlighted objects from the pdf
def read_pdf
puts 'Running...'
file = "/home/kshitiz/Downloads/test.pdf"
pdf_file_name = 'test'
doc = PDF::Reader.new(file)
$objects = doc.objects
doc.pages.each do |page|
annots = highlights_on_page(page)
annots.each do |annot|
puts "#{annot[:Contents]}"
end
end
end
def is_highlight?(object)
object[:Type] == :Annot && [:Highlight, :Underline].include?(object[:Subtype])
end
def annots_on_page(page)
references = (page.attributes[:Annots] || [])
lookup_all(references).flatten
end
def lookup_all(refs)
refs = *refs
refs.map { |ref| lookup(ref) }
end
def lookup(ref)
object = $objects[ref]
return object unless object.is_a?(Array)
lookup_all(object)
end
def highlights_on_page(page)
all_annots = annots_on_page(page)
all_annots.select { |a| is_highlight?(a) }
end
Can anyone help me out to resolve this bug in this algo
来源:https://stackoverflow.com/questions/62712684/how-to-extract-the-highlighted-text-from-the-pdf-in-rails