问题
1.I need to convert a PDF File into a txt.file. My Command seems to work, since i get the converted text on the screen, but somehow im incapable to direct the output into a textfile.
public static string[] GetArgs(string inputPath, string outputPath)
{
return new[] {
"-q", "-dNODISPLAY", "-dSAFER",
"-dDELAYBIND", "-dWRITESYSTEMDICT", "-dSIMPLE",
"-c", "save", "-f",
"ps2ascii.ps", inputPath, "-sDEVICE=txtwrite",
String.Format("-sOutputFile={0}", outputPath),
"-c", "quit"
};
}
2.Is there a unicode speficic .ps?
Update: Posting my complete Code, maybe the error is somewhere else.
public static string[] GetArgs(string inputPath, string outputPath)
{
return new[]
{ "-o c:/test.txt",
"-dSIMPLE",
"-sFONTPATH=c:/windows/fonts",
"-dNODISPLAY",
"-dDELAYBIND",
"-dWRITESYSTEMDICT",
"-f",
"C:/Program Files/gs/gs9.05/lib/ps2ascii.ps",
inputPath,
};
}
[DllImport("gsdll64.dll", EntryPoint = "gsapi_new_instance")]
private static extern int CreateAPIInstance(out IntPtr pinstance, IntPtr caller_handle);
[DllImport("gsdll64.dll", EntryPoint = "gsapi_init_with_args")]
private static extern int InitAPI(IntPtr instance, int argc, string[] argv);
[DllImport("gsdll64.dll", EntryPoint = "gsapi_exit")]
private static extern int ExitAPI(IntPtr instance);
[DllImport("gsdll64.dll", EntryPoint = "gsapi_delete_instance")]
private static extern void DeleteAPIInstance(IntPtr instance);`
private static object resourceLock = new object();
private static void Cleanup(IntPtr gsInstancePtr)
{
ExitAPI(gsInstancePtr);
DeleteAPIInstance(gsInstancePtr);
}
private static object resourceLock = new object();
public static void ConvertPdfToText(string inputPath, string outputPath)
{
CallAPI(GetArgs(inputPath, outputPath));
}
public static void ConvertPdfToText(string inputPath, string outputPath)
{
CallAPI(GetArgs(inputPath, outputPath));
}
private static void CallAPI(string[] args)
{
// Get a pointer to an instance of the Ghostscript API and run the API with the current arguments
IntPtr gsInstancePtr;
lock (resourceLock)
{
CreateAPIInstance(out gsInstancePtr, IntPtr.Zero);
try
{
int result = InitAPI(gsInstancePtr, args.Length, args);
if (result < 0)
{
throw new ExternalException("Ghostscript conversion error", result);
}
}
finally
{
Cleanup(gsInstancePtr);
}
}
}
回答1:
2 questions, 2 answers:
To get output to a file, use
-sOutputFile=/path/to/file
on the commandline, or add the line"-sOutputFile=/where/it/should/go",
to your
c#
code (can be the first argument, but should be before your first"-c"
. But first get rid of your other-sOutputFile
stuff you have already in there... :-)No, PostScript isn't aware of Unicode.
Update
(Remark: Extracting text from PDF reliably is (for various technical reasons) notoriously difficult. And it may not work at all, whichever tool you try...)
On the commandline, the following two should work for recent releases of Ghostscript (current version is v9.05). It would be your own job...
- ...to test which command works better for your use case, and
- ...to translate these into
c#
code.
1. txtwrite
device:
gswin32c.exe ^
-o c:/path/to/output.txt ^
-dTextFormat=3 ^
-sDEVICE=txtwrite ^
input.pdf
Notes:
- You may want to use
gswin64c.exe
(if available) on your system if it is 64bit. - The
-o
syntax for the output works only with recent versions of Ghostscript. - The
-o
syntax does implicitely also set the-dBATCH
and-dNOPAUSE
parameters. - If your Ghostscript is too old and the
-o
shorthand doesn't work, replace it with-dBATCH -dNOPAUSE -sOutputFile=...
. - Ghostscript can handle forward slashes inside path arguments even on Windows.
- The
-dTextFormat
is by default set to3
anyway, so it is not required here. 'Legal' values for it are:0
: This outputs XML-escaped Unicode along with info related to the format of the text (position, font name, point size, etc). Intended for developers only.1
: Same as0
, but will output blocks of text.2
: This outputs Unicode (UCS2) text with BMO (Byte Order Mark); tries to approximate layout of text in original document.3
: (default) Same as2
, but the text is encoded in UTF-8.
- The
txtwrite
device with this-dTextFormat
modifier is a rather new asset of Ghostscript, so please report bugs if you find ones.
2. Using ps2ascii.ps
gswin32c.exe ^
-sstdout=c:/path/to/output.txt ^
-dSIMPLE ^
-sFONTPATH=c:/windows/fonts ^
-dNODISPLAY
-dDELAYBIND ^
-dWRITESYSTEMDICT ^
-f /path/to/ps2ascii.ps ^
input.pdf
Notes:
- This is a completely different method from the
txtwrite
device one and cannot be mixed with it! ps2ascii.ps
is a file, a PostScript program that Ghostscript invokes to extract the text. It is usually located in the Ghostscript installdir's/lib
subdirectory. Go and see if it is really there.-dSIMPLE
may be replaced bydCOMPLEX
in order to print out extra info lines (current color, presence of an image, rectangular fills).-sstdout=...
is required because theps2ascii.ps
PostScript program does print to stdout only and can't be told to write to a file. So-sstdout=...
tells Ghostscript to redirect its stdout to a file.
3. Non-Ghostscript methods
Do not ignore other, non-Ghostscript methods that may be easier to work with. All of the following are cross-platform and should be available on Windows too:
mudraw -t
GPL licensed (or commercial, if you need). Commandline utility from MuPDF to extract text from PDF (which is developed by the same group of developers that do Ghostscript).pdftotext
GPL licensed. Commandline utility from Poppler (which is a fork from XPDF, that also provides apdftotext
).podofotxtextract
GPL licensed. Commandline utility based the PoDoFo PDF processing library.- TET
The Text Extraction Toolkit from PDFlib.com (commercial, but may be gratis for personal use -- I didn't check recent news). Probably the most powerful text extraction tool of them all...
来源:https://stackoverflow.com/questions/11754556/ghostscript-convert-a-pdf-and-output-in-a-textfile