How can I read a Lync conversation file containing HTML?

I'm having trouble reading a local file, into a string, in c#.

Here's what I came up with till now:

 string file = @"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
 using (StreamReader reader = new StreamReader(file))
 {
      string line = "";
      while ((line = reader.ReadLine()) != null)
      {
           textBox1.Text += line.ToString();
      }
 }

And it's the only solution that seems to work.

I've tried some other suggested methods for reading a file, such as:

string file = @"C:\script_test\{5461EC8C-89E6-40D1-8525-774340083829}.html";
string html = File.ReadAllText(file).ToString();
textBox1.Text += html;

Yet it does not work as expected.

Here are the first few lines of the file i'm trying to read:

as you can see, it has some funky characters, honestly I don't know if that's the cause of this weird behavior.

But in the first case, the code seems to skip those lines, printing only "Document generated by Office Communicator..."

Your task would be easier if you could use an API or the SDK or even would have a description of the format you try to read. However the binary format looks not to be that complicated and with an hexviewer installed I got this far to get the html out of the example you provided.

To parse non-text files you fall-back to the BinaryReader and then use one of the Read methods to read the correct type from the bytestream. I used ReadByte and ReadInt32. Notice how in the description of the method is explained how many bytes are read. That becomes handy when you try to decipher your file.

    private string ParseHist(string file)
    {
        using (var f = File.Open(file, FileMode.Open))
        {
            using (var br = new BinaryReader(f))
            {
                // read 4 bytes as an int
                var first = br.ReadInt32();
                // read integer / zero ended byte arrays as string
                var lead = br.ReadInt32();
                // until we have 4 zero bytes
                while (lead != 0)
                {
                    var user = ParseString(br);
                    Trace.Write(lead);
                    Trace.Write(":");
                    Trace.Write(user.Length);
                    Trace.Write(":");
                    Trace.WriteLine(user);
                    lead = br.ReadInt32();
                    // weird special case
                    if (lead == 2)
                    {
                        lead = br.ReadInt32();
                    }
                }

                // at the start of the html block
                var htmllen = br.ReadInt32();
                Trace.WriteLine(htmllen);
                // parse the html
                var html = ParseString(br);
                Trace.Write(len);
                Trace.Write(":");
                Trace.Write(html.Length);
                Trace.Write(":");
                Trace.WriteLine(html);
                // other structures follow, left unparsed

                return html.ToString();
            }
        }
    }

    // a string seems to be ascii encoded and ends with a zero byte.
    private static string ParseString(BinaryReader br)
    {
        var ch = br.ReadByte();
        var sb = new StringBuilder();
        while (ch != 0)
        {
            sb.Append((char)ch);
            ch = br.ReadByte();
        }
        return sb.ToString();
    }

You could use the simple parsing logic in a winform application as follows:

    private void button1_Click(object sender, EventArgs e)
    {
        webBrowser1.DocumentText = ParseHist(@"5461EC8C-89E6-40D1-8525-774340083829-Copia.html");
    }

Keep in mind that this is not bullet proof or the recommended way but it should get you started. For files that don't parse well you'll need to go back to the hexviewer and work-out what other byte structures are new or different from what you already had. That is not something I intend to help you with, that is left as an exercise for you to figure out.

I don't know if it's the right way to answer this, but here's what I've managed to do so far:

        string file = @"C:\script_test\{1C0365BC-54C6-4D31-A1C1-586C4575F9EA}.hist";
                    string outText = "";
        //Encoding iso = Encoding.GetEncoding("ISO-8859-1");
        Encoding utf8 = Encoding.UTF8;
        StreamReader reader = new StreamReader(file, utf8);
        char[] text = reader.ReadToEnd().ToCharArray();
        //skip first n chars
        /*
        for (int i = 250; i < text.Length; i++)
        {
            outText += text[i];
        }
        */
        for (int i = 0; i < text.Length; i++)
        {
            //skips non printable characters
            if (!Char.IsControl(text[i]))
            {
                outText += text[i];
            }
        }
        string source = "";
        source = WebUtility.HtmlDecode(outText);
        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        htmlDoc.LoadHtml(source);

        string html = "<html><style>";
        foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//style"))
        {
            html += node.InnerHtml+ Environment.NewLine;
        }
        html += "</style><body>";
        foreach (HtmlNode node in htmlDoc.DocumentNode.SelectNodes("//body"))
        {
            html += node.InnerHtml + Environment.NewLine;
        }
        html += "</body></html>";
        richTextBox1.Text += html+Environment.NewLine;

        webBrowser1.DocumentText = html;

The conversation displays correctly, both style and encoding.

So it's a start for me.

Thank you all for the support!

EDIT

Char.IsControl(char)

skips non printable characters :)

来源：https://stackoverflow.com/questions/31649545/how-can-i-read-a-lync-conversation-file-containing-html

标签

file

lync