How can I read a Lync conversation file containing HTML?

↘锁芯ラ 提交于 2019-12-05 13:46:26

Your task would be easier if you could use an API or the SDK or even would have a description of the format you try to read. However the binary format looks not to be that complicated and with an hexviewer installed I got this far to get the html out of the example you provided.

To parse non-text files you fall-back to the BinaryReader and then use one of the Read methods to read the correct type from the bytestream. I used ReadByte and ReadInt32. Notice how in the description of the method is explained how many bytes are read. That becomes handy when you try to decipher your file.

    private string ParseHist(string file)
    {
        using (var f = File.Open(file, FileMode.Open))
        {
            using (var br = new BinaryReader(f))
            {
                // read 4 bytes as an int
                var first = br.ReadInt32();
                // read integer / zero ended byte arrays as string
                var lead = br.ReadInt32();
                // until we have 4 zero bytes
                while (lead != 0)
                {
                    var user = ParseString(br);
                    Trace.Write(lead);
                    Trace.Write(":");
                    Trace.Write(user.Length);
                    Trace.Write(":");
                    Trace.WriteLine(user);
                    lead = br.ReadInt32();
                    // weird special case
                    if (lead == 2)
                    {
                        lead = br.ReadInt32();
                    }
                }

                // at the start of the html block
                var htmllen = br.ReadInt32();
                Trace.WriteLine(htmllen);
                // parse the html
                var html = ParseString(br);
                Trace.Write(len);
                Trace.Write(":");
                Trace.Write(html.Length);
                Trace.Write(":");
                Trace.WriteLine(html);
                // other structures follow, left unparsed

                return html.ToString();
            }
        }
    }

    // a string seems to be ascii encoded and ends with a zero byte.
    private static string ParseString(BinaryReader br)
    {
        var ch = br.ReadByte();
        var sb = new StringBuilder();
        while (ch != 0)
        {
            sb.Append((char)ch);
            ch = br.ReadByte();
        }
        return sb.ToString();
    }

You could use the simple parsing logic in a winform application as follows:

    private void button1_Click(object sender, EventArgs e)
    {
        webBrowser1.DocumentText = ParseHist(@"5461EC8C-89E6-40D1-8525-774340083829-Copia.html");
    }

Keep in mind that this is not bullet proof or the recommended way but it should get you started. For files that don't parse well you'll need to go back to the hexviewer and work-out what other byte structures are new or different from what you already had. That is not something I intend to help you with, that is left as an exercise for you to figure out.

I don't know if it's the right way to answer this, but here's what I've managed to do so far:

        string file = @"C:\script_test\{1C0365BC-54C6-4D31-A1C1-586C4575F9EA}.hist";
                    string outText = "";
        //Encoding iso = Encoding.GetEncoding("ISO-8859-1");
        Encoding utf8 = Encoding.UTF8;
        StreamReader reader = new StreamReader(file, utf8);
        char[] text = reader.ReadToEnd().ToCharArray();
        //skip first n chars
        /*
        for (int i = 250; i < text.Length; i++)
        {
            outText += text[i];
        }
        */
        for (int i = 0; i < text.Length; i++)
        {
            //skips non printable characters
            if (!Char.IsControl(text[i]))
            {
                outText += text[i];
            }
        }
        string source = "";
        source = WebUtility.HtmlDecode(outText);
        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        htmlDoc.LoadHtml(source);

        string html = "<html><style>";
        foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//style"))
        {
            html += node.InnerHtml+ Environment.NewLine;
        }
        html += "</style><body>";
        foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//body"))
        {
            html += node.InnerHtml + Environment.NewLine;
        }
        html += "</body></html>";
        richTextBox1.Text += html+Environment.NewLine;

        webBrowser1.DocumentText = html;

The conversation displays correctly, both style and encoding.

So it's a start for me.

Thank you all for the support!

EDIT

Char.IsControl(char)

skips non printable characters :)

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!