Reading text from binary file like PDF

后端 未结 2 1822
别跟我提以往
别跟我提以往 2021-01-17 07:23

I have a problem with reading binary file in C++. Currently my code is like this:

FILE *s=fopen(source, \"rb\");
fseek(s,0,SEEK_END);
size_file size=ftell(s)         


        
相关标签:
2条回答
  • 2021-01-17 07:43

    Only a few file formats like plain raw .TXT text files can be "read" and "understood" directly. Most of the file formats, including almost any binary format, is a .. format. This implies certain structure held inside the file. Completely contrary to the .TXT text file that is completely structure-less, or rather, it is one huge block of pure data.

    Open a WordPad or Word or any other a least somewhat intelligent text editor and write some text there and then save it as RTF, DOC, ODT or any other non-TXT file. Then save it as TXT file too.

    Download a HEX VIEWER/HEX EDITOR. Whatever one. Take one of those free, you don't need many features, just the one that displays raw binary values in one column and ASCII text in the other column. Almost any of free hex viewers/editors can do that.

    Open and compare those two files. You will immediatelly see difference.

    Back to the PDF:

    The PDF even can contain graphics interleaved with the text. How'd you expect to keep it, if the text were "just sitting in the file" like in TXT? How would the image position/description/data be embedded? The PDF can even contain scripts, if I remember well, similar to JavaScripts. Executable. In PDF-type document you can have buttons that do something. That's much more complicated than just text-in a-file.

    Binary files usually does not contain any plain-readable text for your eyes. They have that text structured in blocks, wrapped in metadata about colors, text layout, paging and such, or even special structures about document versioning, authoring, classification, (...). This everything has to be stored somewhere.

    Usually, binary files have sections. First section usually is called the HEADER. Inside, there will be information about: format type, format version, file/block/data length, image resolution, and similar. All those most probably will be kept in binary form: no "800x600" texts, just "|00|00|03|20|00|00|02|58|" assuming 32bit BE. After your have read, decoded and understood the description, then you will know where the actual data starts, how the data blocks are laid out, and how to decode them and understand what they contain.

    edit:

    After you understand what is the difference between text files and binary files, check out the absolute basics on http://en.wikipedia.org/wiki/Entropy_(information_theory). Then try playing with RLE (http://www.daniweb.com/software-development/cpp/code/216388/basic-rle-file-compression-routine) or Huffman (http://www.cprogramming.com/tutorial/computersciencetheory/huffman.html) just to start on something relatively simple. Then start reading more about Huffman codes, and then, well, you will be reasonably prepared to the task, like ZIP or LZH..

    0 讨论(0)
  • 2021-01-17 08:02

    To parse PDF as text, use some PDF library, such as gnupdf or poppler.

    0 讨论(0)
提交回复
热议问题