Extract Embedded Image Object in RTF

后端 未结 3 1616
说谎
说谎 2020-12-01 09:05

I have rtf documents that include an embedded object (an image). I need to extract this as an Image object (or any other usable format). I have che

相关标签:
3条回答
  • 2020-12-01 09:16

    Here is a piece of code that can extract all objects ('Package' class objects) from an RTF stream:

        public static void ExtractPackageObjects(string filePath)
        {
            using (StreamReader sr = new StreamReader(filePath))
            {
                RtfReader reader = new RtfReader(sr);
                IEnumerator<RtfObject> enumerator = reader.Read().GetEnumerator();
                while(enumerator.MoveNext())
                {
                    if (enumerator.Current.Text == "object")
                    {
                        if (RtfReader.MoveToNextControlWord(enumerator, "objclass"))
                        {
                            string className = RtfReader.GetNextText(enumerator);
                            if (className == "Package")
                            {
                                if (RtfReader.MoveToNextControlWord(enumerator, "objdata"))
                                {
                                    byte[] data = RtfReader.GetNextTextAsByteArray(enumerator);
                                    using (MemoryStream packageData = new MemoryStream())
                                    {
                                        RtfReader.ExtractObjectData(new MemoryStream(data), packageData);
                                        packageData.Position = 0;
                                        PackagedObject po = PackagedObject.Extract(packageData);
                                        File.WriteAllBytes(po.DisplayName, po.Data);
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    

    And here are the utility classes that this code uses. There is a simple stream-based RTF parser that allows to get to the interesting control words.

    There is also a utility to extract data from a serialized Object Packager instance. Object Packager is an almost 20-years ago OLE1.0 thing and the serialized binary format is not documented (to my knowledge), but it's understandable.

    This works fine on your provided sample, but you may have to adapt things around.

    public class RtfReader
    {
        public RtfReader(TextReader reader)
        {
            if (reader == null)
                throw new ArgumentNullException("reader");
    
            Reader = reader;
        }
    
        public TextReader Reader { get; private set; }
    
        public IEnumerable<RtfObject> Read()
        {
            StringBuilder controlWord = new StringBuilder();
            StringBuilder text = new StringBuilder();
            Stack<RtfParseState> stack = new Stack<RtfParseState>();
            RtfParseState state = RtfParseState.Group;
    
            do
            {
                int i = Reader.Read();
                if (i < 0)
                {
                    if (!string.IsNullOrWhiteSpace(controlWord.ToString()))
                        yield return new RtfControlWord(controlWord.ToString());
    
                    if (!string.IsNullOrWhiteSpace(text.ToString()))
                        yield return new RtfText(text.ToString());
    
                    yield break;
                }
    
                char c = (char)i;
    
                // noise chars
                if ((c == '\r') ||
                    (c == '\n'))
                    continue;
    
                switch (state)
                {
                    case RtfParseState.Group:
                        if (c == '{')
                        {
                            stack.Push(state);
                            break;
                        }
    
                        if (c == '\\')
                        {
                            state = RtfParseState.ControlWord;
                            break;
                        }
                        break;
    
                    case RtfParseState.ControlWord:
                        if (c == '\\')
                        {
                            // another controlWord
                            if (!string.IsNullOrWhiteSpace(controlWord.ToString()))
                            {
                                yield return new RtfControlWord(controlWord.ToString());
                                controlWord.Clear();
                            }
                            break;
                        }
    
                        if (c == '{')
                        {
                            // a new group
                            state = RtfParseState.Group;
                            if (!string.IsNullOrWhiteSpace(controlWord.ToString()))
                            {
                                yield return new RtfControlWord(controlWord.ToString());
                                controlWord.Clear();
                            }
                            break;
                        }
    
                        if (c == '}')
                        {
                            // close group
                            state = stack.Count > 0 ? stack.Pop() : RtfParseState.Group;
                            if (!string.IsNullOrWhiteSpace(controlWord.ToString()))
                            {
                                yield return new RtfControlWord(controlWord.ToString());
                                controlWord.Clear();
                            }
                            break;
                        }
    
                        if (!Char.IsLetterOrDigit(c))
                        {
                            state = RtfParseState.Text;
                            text.Append(c);
                            if (!string.IsNullOrWhiteSpace(controlWord.ToString()))
                            {
                                yield return new RtfControlWord(controlWord.ToString());
                                controlWord.Clear();
                            }
                            break;
                        }
    
                        controlWord.Append(c);
                        break;
    
                    case RtfParseState.Text:
                        if (c == '\\')
                        {
                            state = RtfParseState.EscapedText;
                            break;
                        }
    
                        if (c == '{')
                        {
                            if (!string.IsNullOrWhiteSpace(text.ToString()))
                            {
                                yield return new RtfText(text.ToString());
                                text.Clear();
                            }
    
                            // a new group
                            state = RtfParseState.Group;
                            break;
                        }
    
                        if (c == '}')
                        {
                            if (!string.IsNullOrWhiteSpace(text.ToString()))
                            {
                                yield return new RtfText(text.ToString());
                                text.Clear();
                            }
    
                            // close group
                            state = stack.Count > 0 ? stack.Pop() : RtfParseState.Group;
                            break;
                        }
                        text.Append(c);
                        break;
    
                    case RtfParseState.EscapedText:
                        if ((c == '\\') || (c == '}') || (c == '{'))
                        {
                            state = RtfParseState.Text;
                            text.Append(c);
                            break;
                        }
    
                        // ansi character escape
                        if (c == '\'')
                        {
                            text.Append(FromHexa((char)Reader.Read(), (char)Reader.Read()));
                            break;
                        }
    
                        if (!string.IsNullOrWhiteSpace(text.ToString()))
                        {
                            yield return new RtfText(text.ToString());
                            text.Clear();
                        }
    
                        // in fact, it's a normal controlWord
                        controlWord.Append(c);
                        state = RtfParseState.ControlWord;
                        break;
                }
            }
            while (true);
        }
    
        public static bool MoveToNextControlWord(IEnumerator<RtfObject> enumerator, string word)
        {
            if (enumerator == null)
                throw new ArgumentNullException("enumerator");
    
            while (enumerator.MoveNext())
            {
                if (enumerator.Current.Text == word)
                    return true;
            }
            return false;
        }
    
        public static string GetNextText(IEnumerator<RtfObject> enumerator)
        {
            if (enumerator == null)
                throw new ArgumentNullException("enumerator");
    
            while (enumerator.MoveNext())
            {
                RtfText text = enumerator.Current as RtfText;
                if (text != null)
                    return text.Text;
            }
            return null;
        }
    
        public static byte[] GetNextTextAsByteArray(IEnumerator<RtfObject> enumerator)
        {
            if (enumerator == null)
                throw new ArgumentNullException("enumerator");
    
            while (enumerator.MoveNext())
            {
                RtfText text = enumerator.Current as RtfText;
                if (text != null)
                {
                    List<byte> bytes = new List<byte>();
                    for (int i = 0; i < text.Text.Length; i += 2)
                    {
                        bytes.Add((byte)FromHexa(text.Text[i], text.Text[i + 1]));
                    }
                    return bytes.ToArray();
                }
            }
            return null;
        }
    
        // Extracts an EmbeddedObject/ObjectHeader from a stream
        // see [MS -OLEDS]: Object Linking and Embedding (OLE) Data Structures for more information
        // chapter 2.2: OLE1.0 Format Structures 
        public static void ExtractObjectData(Stream inputStream, Stream outputStream)
        {
            if (inputStream == null)
                throw new ArgumentNullException("inputStream");
    
            if (outputStream == null)
                throw new ArgumentNullException("outputStream");
    
            BinaryReader reader = new BinaryReader(inputStream);
            reader.ReadInt32(); // OLEVersion
            int formatId = reader.ReadInt32(); // FormatID
            if (formatId != 2) // see 2.2.4 Object Header. 2 means EmbeddedObject
                throw new NotSupportedException();
    
            ReadLengthPrefixedAnsiString(reader); // className
            ReadLengthPrefixedAnsiString(reader); // topicName
            ReadLengthPrefixedAnsiString(reader); // itemName
    
            int nativeDataSize = reader.ReadInt32();
            byte[] bytes = reader.ReadBytes(nativeDataSize);
            outputStream.Write(bytes, 0, bytes.Length);
        }
    
        // see chapter 2.1.4 LengthPrefixedAnsiString
        private static string ReadLengthPrefixedAnsiString(BinaryReader reader)
        {
            int length = reader.ReadInt32();
            if (length == 0)
                return string.Empty;
    
            byte[] bytes = reader.ReadBytes(length);
            return Encoding.Default.GetString(bytes, 0, length - 1);
        }
    
        private enum RtfParseState
        {
            ControlWord,
            Text,
            EscapedText,
            Group
        }
    
        private static char FromHexa(char hi, char lo)
        {
            return (char)byte.Parse(hi.ToString() + lo, NumberStyles.HexNumber);
        }
    }
    
    // Utility class to parse an OLE1.0 OLEOBJECT
    public class PackagedObject
    {
        private PackagedObject()
        {
        }
    
        public string DisplayName { get; private set; }
        public string IconFilePath { get; private set; }
        public int IconIndex { get; private set; }
        public string FilePath { get; private set; }
        public byte[] Data { get; private set; }
    
        private static string ReadAnsiString(BinaryReader reader)
        {
            StringBuilder sb = new StringBuilder();
            do
            {
                byte b = reader.ReadByte();
                if (b == 0)
                    return sb.ToString();
    
                sb.Append((char)b);
            }
            while (true);
        }
    
        public static PackagedObject Extract(Stream inputStream)
        {
            if (inputStream == null)
                throw new ArgumentNullException("inputStream");
    
            BinaryReader reader = new BinaryReader(inputStream);
            reader.ReadUInt16(); // sig
            PackagedObject po = new PackagedObject();
            po.DisplayName = ReadAnsiString(reader);
            po.IconFilePath = ReadAnsiString(reader);
            po.IconIndex = reader.ReadUInt16();
            int type = reader.ReadUInt16();
            if (type != 3) // 3 is file, 1 is link
                throw new NotSupportedException();
    
            reader.ReadInt32(); // nextsize
            po.FilePath = ReadAnsiString(reader);
            int dataSize = reader.ReadInt32();
            po.Data = reader.ReadBytes(dataSize);
            // note after that, there may be unicode + long path info
            return po;
        }
    }
    
    public class RtfObject
    {
        public RtfObject(string text)
        {
            if (text == null)
                throw new ArgumentNullException("text");
    
            Text = text.Trim();
        }
    
        public string Text { get; private set; }
    }
    
    public class RtfText : RtfObject
    {
        public RtfText(string text)
            : base(text)
        {
        }
    }
    
    public class RtfControlWord : RtfObject
    {
        public RtfControlWord(string name)
            : base(name)
        {
        }
    }
    
    0 讨论(0)
  • 2020-12-01 09:20

    OK, this should work for you. To demonstrate my solution, I created a WinForms project with a PictureBox whose paint event handler was mapped to the following function:

     private void rtfImage_Paint(object sender, PaintEventArgs e)
        {
            string rtfStr = System.IO.File.ReadAllText("MySampleFile.rtf");
            string imageDataHex = ExtractImgHex(rtfStr);
            byte[] imageBuffer = ToBinary(imageDataHex);
            Image image;
            using (MemoryStream stream = new MemoryStream(imageBuffer))
            {
                image = Image.FromStream(stream);
            }
            Rectangle rect = new Rectangle(0, 0, 100, 100);
            e.Graphics.DrawImage(image, rect);                        
        }
    

    This code relies the on the System.Drawing.Image.FromStream() method, along with two "helper" functions:

    A string extractor:

        string ExtractImgHex(string s)
        {
            // I'm sure you could use regex here, but this works.
            // This assumes one picture per file; loops required otherwise
            int pictTagIdx = s.IndexOf("{\\pict\\");
            int startIndex = s.IndexOf(" ", pictTagIdx)+1;
            int endIndex = s.IndexOf("}", startIndex);
            return s.Substring(startIndex, endIndex - startIndex);
        }
    

    ... and a binary converter:

        public static byte[] ToBinary(string imageDataHex)
        {
            //this function taken entirely from:
            // http://www.codeproject.com/Articles/27431/Writing-Your-Own-RTF-Converter
            if (imageDataHex == null)
            {
                throw new ArgumentNullException("imageDataHex");
            }
    
            int hexDigits = imageDataHex.Length;
            int dataSize = hexDigits / 2;
            byte[] imageDataBinary = new byte[dataSize];
    
            StringBuilder hex = new StringBuilder(2);
    
            int dataPos = 0;
            for (int i = 0; i < hexDigits; i++)
            {
                char c = imageDataHex[i];
                if (char.IsWhiteSpace(c))
                {
                    continue;
                }
                hex.Append(imageDataHex[i]);
                if (hex.Length == 2)
                {
                    imageDataBinary[dataPos] = byte.Parse(hex.ToString(), System.Globalization.NumberStyles.HexNumber);
                    dataPos++;
                    hex.Remove(0, 2);
                }
            }
            return imageDataBinary;
        }
    
    0 讨论(0)
  • 2020-12-01 09:26

    Below code can extract all type of embedded objects. including image/docs/mails etc with original file name. And save them in a local path.

    string MyDir = @"E:\temp\";
    Document doc = new Document(MyDir + "Requirement#4.rtf");
    
    NodeCollection nodeColl = doc.GetChildNodes(NodeType.Shape, true);
    foreach (var node in nodeColl)
    {
        Shape shape1 = (Shape)node;
        if (shape1.OleFormat != null)
        {
            shape1.OleFormat.Save(MyDir + shape1.OleFormat.SuggestedFileName + shape1.OleFormat.SuggestedExtension);
        }
    }

    0 讨论(0)
提交回复
热议问题