Remove Duplicate Lines From Text File?

后端 未结 5 2092
孤街浪徒
孤街浪徒 2020-12-09 05:58

Given an input file of text lines, I want duplicate lines to be identified and removed. Please show a simple snippet of C# that accomplishes this.

相关标签:
5条回答
  • 2020-12-09 06:32

    For small files:

    string[] lines = File.ReadAllLines("filename.txt");
    File.WriteAllLines("filename.txt", lines.Distinct().ToArray());
    
    0 讨论(0)
  • 2020-12-09 06:33

    I am new to .net & have written something more simpler,may not be very efficient.Please fill free to share your thoughts.

    class Program
    {
        static void Main(string[] args)
        {
            string[] emp_names = File.ReadAllLines("D:\\Employee Names.txt");
            List<string> newemp1 = new List<string>();
    
            for (int i = 0; i < emp_names.Length; i++)
            {
                newemp1.Add(emp_names[i]);  //passing data to newemp1 from emp_names
            }
    
            for (int i = 0; i < emp_names.Length; i++)
            {
                List<string> temp = new List<string>();
                int duplicate_count = 0;
    
                for (int j = newemp1.Count - 1; j >= 0; j--)
                {
                    if (emp_names[i] != newemp1[j])  //checking for duplicate records
                        temp.Add(newemp1[j]);
                    else
                    {
                        duplicate_count++;
                        if (duplicate_count == 1)
                            temp.Add(emp_names[i]);
                    }
                }
                newemp1 = temp;
            }
            string[] newemp = newemp1.ToArray();  //assigning into a string array
            Array.Sort(newemp);
            File.WriteAllLines("D:\\Employee Names.txt", newemp); //now writing the data to a text file
            Console.ReadLine();
        }
    }
    
    0 讨论(0)
  • 2020-12-09 06:39

    Here's a streaming approach that should incur less overhead than reading all unique strings into memory.

        var sr = new StreamReader(File.OpenRead(@"C:\Temp\in.txt"));
        var sw = new StreamWriter(File.OpenWrite(@"C:\Temp\out.txt"));
        var lines = new HashSet<int>();
        while (!sr.EndOfStream)
        {
            string line = sr.ReadLine();
            int hc = line.GetHashCode();
            if(lines.Contains(hc))
                continue;
    
            lines.Add(hc);
            sw.WriteLine(line);
        }
        sw.Flush();
        sw.Close();
        sr.Close();
    
    0 讨论(0)
  • 2020-12-09 06:40

    For a long file (and non consecutive duplications) I'd copy the files line by line building a hash // position lookup table as I went.

    As each line is copied check for the hashed value, if there is a collision double check that the line is the same and move to the next. (

    Only worth it for fairly large files though.

    0 讨论(0)
  • 2020-12-09 06:48

    This should do (and will copy with large files).

    Note that it only removes duplicate consecutive lines, i.e.

    a
    b
    b
    c
    b
    d
    

    will end up as

    a
    b
    c
    b
    d
    

    If you want no duplicates anywhere, you'll need to keep a set of lines you've already seen.

    using System;
    using System.IO;
    
    class DeDuper
    {
        static void Main(string[] args)
        {
            if (args.Length != 2)
            {
                Console.WriteLine("Usage: DeDuper <input file> <output file>");
                return;
            }
            using (TextReader reader = File.OpenText(args[0]))
            using (TextWriter writer = File.CreateText(args[1]))
            {
                string currentLine;
                string lastLine = null;
    
                while ((currentLine = reader.ReadLine()) != null)
                {
                    if (currentLine != lastLine)
                    {
                        writer.WriteLine(currentLine);
                        lastLine = currentLine;
                    }
                }
            }
        }
    }
    

    Note that this assumes Encoding.UTF8, and that you want to use files. It's easy to generalize as a method though:

    static void CopyLinesRemovingConsecutiveDupes
        (TextReader reader, TextWriter writer)
    {
        string currentLine;
        string lastLine = null;
    
        while ((currentLine = reader.ReadLine()) != null)
        {
            if (currentLine != lastLine)
            {
                writer.WriteLine(currentLine);
                lastLine = currentLine;
            }
        }
    }
    

    (Note that that doesn't close anything - the caller should do that.)

    Here's a version that will remove all duplicates, rather than just consecutive ones:

    static void CopyLinesRemovingAllDupes(TextReader reader, TextWriter writer)
    {
        string currentLine;
        HashSet<string> previousLines = new HashSet<string>();
    
        while ((currentLine = reader.ReadLine()) != null)
        {
            // Add returns true if it was actually added,
            // false if it was already there
            if (previousLines.Add(currentLine))
            {
                writer.WriteLine(currentLine);
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题