1. Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Identical File Finder

Discussion in 'Programming & Webmastering' started by xbonez, Feb 12, 2012.

  1. xbonez New Member

    Joined:
    Nov 29, 2010
    Messages:
    1,182 (0.83/day)
    Thanks Received:
    315
    Location:
    Philly, PA (US)
    I needed to sort through a directory which contained about 500 or so files (many of which were identical in content), and pick out only the unique files from the directory. I ended up writing a short program in C# to do the task for me. Posting the executable and code below, should anyone ever require it.

    Summary

    The program iterates over all the files in the input directory and checks for uniqueness by calculating the MD5 hash of the file contents. All unique files are copied over to the output directory.

    Executable: http://dl.dropbox.com/u/1276196/DND/IdenticalFileFinder.exe

    Usage

    The program expects two command line arguments.
    The first argument is the input directory.
    The second argument is the output directory where the unique files will get copied.

    A trailing slash is optional.

    Source

    Code:
    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Security.Cryptography;
    using System.Text;
    
    namespace IdenticalFileFinder
    {
        class Program
        {
            static string baseDir = String.Empty;
            static string outputDir = String.Empty;
    
            static void Main(string[] args)
            {
                getArgs(args);
                normalizeDirectories();
    
                List<string> hashes = new List<string>();
                string[] files = Directory.GetFiles(baseDir);
    
                if (!Directory.Exists(outputDir))
                {
                    try
                    {
                        Directory.CreateDirectory(outputDir);
                    }
                    catch(Exception e)
                    {
                        Console.WriteLine(e.ToString());
                        Environment.Exit((int)ExitScenarios.READWRITEERROR);
                    }
                }
    
                foreach(string fileName in files)
                {
                    string hash = GetMD5HashFromFile(fileName);
    
                    if (!hashes.Contains(hash))
                    {
                        string dest = outputDir + Path.GetFileName(fileName);
                        hashes.Add(hash);
                        try
                        {
                            File.Copy(fileName, dest);
                        }
                        catch (Exception e)
                        {
                            Console.WriteLine(e.ToString());
                            Environment.Exit((int)ExitScenarios.READWRITEERROR);
                        }
                    }
                }
                Console.WriteLine(String.Format("{0} unique files found in {1} total files.",hashes.Count,files.Length));
                Environment.Exit((int)ExitScenarios.SUCCESS);
            }
    
            public static void normalizeDirectories()
            {            
                if (baseDir[baseDir.Length - 1] != '\\')
                    baseDir += "\\";
                if (outputDir[outputDir.Length - 1] != '\\')
                    outputDir += "\\";
            }       
    
            public static void getArgs(string[] args)
            {
                if (args.Length == 2)
                {
                    baseDir = args[0];
                    outputDir = args[1];
                }
                else
                {
                    Console.WriteLine("Incorrect arguments passed in. First argument should be input directory and second should be output directory.");
                    Environment.Exit((int)ExitScenarios.ARGFAIL);
                }
    
                if (!Directory.Exists(baseDir)) 
                {
                    Console.WriteLine("Input directory not found.");
                    Environment.Exit((int)ExitScenarios.ARGFAIL);
                }
            }
    
            static public string GetMD5HashFromFile(string fileName)
            {
                FileStream file = new FileStream(fileName, FileMode.Open);
                MD5 md5 = new MD5CryptoServiceProvider();
                byte[] retVal = md5.ComputeHash(file);
                file.Close();
    
                StringBuilder sb = new StringBuilder();
                for (int i = 0; i < retVal.Length; i++)
                {
                    sb.Append(retVal[i].ToString("x2"));
                }
                return sb.ToString();
            }
        }
    
        public enum ExitScenarios
        {
            SUCCESS,
            ARGFAIL,
            READWRITEERROR
        }
    }
    
     
    Last edited: Feb 12, 2012
  2. Kreij

    Kreij Senior Monkey Moderator Staff Member

    Joined:
    Feb 6, 2007
    Messages:
    13,881 (4.92/day)
    Thanks Received:
    5,616
    Location:
    Cheeseland (Wisconsin, USA)
    What if two different files generate the same hash? (I know it's unlinkely, but it is possible).
     
    xbonez says thanks.
  3. FordGT90Concept

    FordGT90Concept "I go fast!1!11!1!"

    Joined:
    Oct 13, 2008
    Messages:
    13,830 (6.26/day)
    Thanks Received:
    3,700
    Location:
    IA, USA
    Neat idea but...I would recommend doing a 1:1 compare...

    1. If filesize is equal, move on to step two; else, not identical.
    2. Read one chunk (4096 bytes) from file A and file B into separate buffers.
    3. Use a for loop to make sure all bytes are equal: fileAbuffer == fileBbuffer
    4. If an inequality is ever discovered, break the loop, close the file streams, and output non-match. If they are equal, rince and repeat until End of File is reached.

    Using this method, you could even have a percentage match option where you count the differences and use the filesize as the denominator in creating the percent. If not above a user specified percentage (100% if not given), reject it.


    The method described above would likely be faster than md5.ComputeHash(file) on large files because it could stop on the first byte.
     
    Last edited: Mar 20, 2012
    xbonez says thanks.
    Crunching for Team TPU
  4. xbonez New Member

    Joined:
    Nov 29, 2010
    Messages:
    1,182 (0.83/day)
    Thanks Received:
    315
    Location:
    Philly, PA (US)
    @Kreiji: It'll probably crash and burn in case of a hash collision..lol. I didn't cater for that.

    @Ford: Those are some good suggestions. When I first wrote the program, I just needed to filter through the 500 files I had, and threw together the first thing that came to my mind, but I might decide to re-do it (taking some of your suggestions) and package it into a WPF app.
     
  5. Dolph New Member

    Joined:
    Feb 8, 2012
    Messages:
    114 (0.11/day)
    Thanks Received:
    11
    I know this thread is a few weeks old, and correct me if im wrong but

    Code:
         public static void normalizeDirectories()
            {            
                if (baseDir[baseDir.Length - 1] != '\\')
                    baseDir += "\\";
                if (outputDir[outputDir.Length - 1] != '\\')
                    outputDir += "\\";
            }       
    
    
    Shouldn't you be using a substring for that? Considering your taking the last character array of the string comparing it to a 2 character string ie '\\' then that will always be true. '\' != '\\' = true

    In the case of the user entering in a pathing with the '\\' or even '\' it will add another '\\'

    EX: program.exe c:\Projects\ d:\

    woudln't that see it, and correct it to c:\Projects\\\ d:\\\ ?
     
  6. FordGT90Concept

    FordGT90Concept "I go fast!1!11!1!"

    Joined:
    Oct 13, 2008
    Messages:
    13,830 (6.26/day)
    Thanks Received:
    3,700
    Location:
    IA, USA
    It is correct. That's how C# escapes backslashes. It makes sure that both directories always end with a backslash. .NET's directory methods like to omit it.
     
    Crunching for Team TPU
  7. Dolph New Member

    Joined:
    Feb 8, 2012
    Messages:
    114 (0.11/day)
    Thanks Received:
    11
    Ahh, my appologies.
     
  8. Aquinus

    Aquinus Resident Wat-man

    Joined:
    Jan 28, 2012
    Messages:
    6,497 (6.45/day)
    Thanks Received:
    2,197
    Location:
    Concord, NH
    Where is the rest of the source? I might be blind, but I'm not seeing where the hash is generated. You could use a different hashing algorithm or a combination of two to mitigate collisions. Maybe an MD5+SHA1 or a SHA256. SHA256 will run fast on newer Sandy Bridge and practically any AMD chip (AMD seems to do encryption pretty well, even against SB.)

    Also if the directory gets too big, the List interface may be inefficient to use. There are a lot of things you can do to speed this up, but I don't know how fast it is already. For *really* large directories, using set-sized arrays stored in 2d array along with threading to process even and odd items from one array could speed it up since not a whole lot of locking is required (just on the output "list".)

    Good work though, its been some time since I've written some C#.

    Have you stored any numbers to see how many files your application can do per second? You could use that to judge improvements or slowdowns in your application.
     
    Last edited: Mar 19, 2012
  9. Kreij

    Kreij Senior Monkey Moderator Staff Member

    Joined:
    Feb 6, 2007
    Messages:
    13,881 (4.92/day)
    Thanks Received:
    5,616
    Location:
    Cheeseland (Wisconsin, USA)
    The hash is being generated in the GetMD5HashFromFile() method.

    The odds of a hash collision are so infitesimally small it's not really worth testing for it.

    Lists (the List class) is extremely fast for sorting and searching. What would you change Aquinus?
     
  10. Aquinus

    Aquinus Resident Wat-man

    Joined:
    Jan 28, 2012
    Messages:
    6,497 (6.45/day)
    Thanks Received:
    2,197
    Location:
    Concord, NH
    I would use a 2D array and define a new "row" of static size every time the previous row is filled. That way you're saving space in the list class and you're not resizing the array every time you exceed the maximum size. The overhead is small and you stick with just what you need. The List class has a lot of overhead to expand it beyond a standard array, where a regular array should do just as well.

    If the op where to post the full source up, I wouldn't mind trying to alter it. I would have to install Visual Studio again and I wouldn't be able to get to it until I get back from work later today which is also dependent on how fussy the daughter is when I get home.

    It is my understanding that the List is being used just to store values, not sort them in this instance. I'm just saying it can be optimized if it takes a long time for large sums of files.
     
  11. Kreij

    Kreij Senior Monkey Moderator Staff Member

    Joined:
    Feb 6, 2007
    Messages:
    13,881 (4.92/day)
    Thanks Received:
    5,616
    Location:
    Cheeseland (Wisconsin, USA)
    You are right, Aquinus, but it is implimentation dependant. You know this, but I'll add the details for anyone else who might read this ...

    2D arrays are faster if the upper limit of the array is know before hand (count the number of files prior to creating the array), so no array resizing has to occur.
    String[] myArray = new String[500];

    Generic List<> is also faster if the bounds are known
    List<string> myList = new List<string>(500) as opposed to just initializing an empty List.
    The .net implementation of Lists is actually using an array behind the scenes.

    Using for loops on both Lists and Arrays is faster than using foreach, but again the bounds needs to be know before hand.

    Lists are a bit slower as they do bounds checking all the time, while in a predefined array no bounds checking is done (unless you code it yourself).
    Lists also have a bit larger memory footprint than arrays.

    The question then becomes is the performance increase enough to warrant the additional coding to impliment arrays over generic Lists. This is completely dependant upon the implimentation and what is occuring within the for loop. Complex calculations in the for loop may make the performance increase with arrays irrelevant (when each for iteration is analyzed).

    Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.

    On a side note, it looks to me like all the code is in the OP. Am I overlooking something?
     
    Last edited: Mar 19, 2012
    Aquinus says thanks.
  12. Aquinus

    Aquinus Resident Wat-man

    Joined:
    Jan 28, 2012
    Messages:
    6,497 (6.45/day)
    Thanks Received:
    2,197
    Location:
    Concord, NH
    Great post. You're right, my stupid mac just likes hiding the scroll bars on textareas. :banghead:
     
  13. FordGT90Concept

    FordGT90Concept "I go fast!1!11!1!"

    Joined:
    Oct 13, 2008
    Messages:
    13,830 (6.26/day)
    Thanks Received:
    3,700
    Location:
    IA, USA
    I used lists for animation tables having 32,000+ ushort values. It was quick. I used the default start size too (I think it's 64). Performance is getting less and less important when memory is so fast and plentiful. XD


    Yeah, the haystack (folders/files)! :roll:
     
    Crunching for Team TPU
  14. xbonez New Member

    Joined:
    Nov 29, 2010
    Messages:
    1,182 (0.83/day)
    Thanks Received:
    315
    Location:
    Philly, PA (US)
    I tend to stay away from micro-optimizations. They are almost always pointless unless you work in an extremely time sensitive context.

    As for using another hashing algorithm: The SHA2 family is a better hashing algorithm for security purposes, primarily because it is not as fast as MD5. This makes brute forcing or generating rainbow tables for SHA2 hashes much slower. SHA2, however, also has the possibility of hash collisions.

    So far, there does not exist a perfect hashing algorithm that guarantees no collisions over an infinitely large sample set.

     
    Last edited: Mar 21, 2012
  15. Aquinus

    Aquinus Resident Wat-man

    Joined:
    Jan 28, 2012
    Messages:
    6,497 (6.45/day)
    Thanks Received:
    2,197
    Location:
    Concord, NH
    It depends on what kind of data you're working with and what language you're using. I mostly write PHP and a lot of SQL. Now the PHP usually runs fast enough unless you're going through a large array like a huge sql result set, but on the other hand I could have SQL that take a long time to run because I might be joining 6 or 8 very large tables to each other. So optimizations will depends, which is why I asked how many files he was comparing because if you're doing 100,000+ files the method that lists search may be inefficient for what you're trying to do.

    But you're right, premature optimizations are usually a bad idea.

    Also, nice quote by Donald Knuth.
     
  16. djxinator

    djxinator New Member

    Joined:
    Mar 1, 2012
    Messages:
    40 (0.04/day)
    Thanks Received:
    10
    I know its a bit off topic, and I have pretty much NO IDEA whats going on in here but its sparked up an old interest of mine.

    Is it feasible to self-teach C#, and if so, how long to get a basic understanding and foundation for creating simple yet useful applications?
     
  17. Kreij

    Kreij Senior Monkey Moderator Staff Member

    Joined:
    Feb 6, 2007
    Messages:
    13,881 (4.92/day)
    Thanks Received:
    5,616
    Location:
    Cheeseland (Wisconsin, USA)
    Yes, I taught myself C# (and all the rest of languages I know for that matter).
    If you are familiar with OOP concepts and programming I would estimate you would be reasonably comfortable with it in a couple of weeks.
     
  18. FordGT90Concept

    FordGT90Concept "I go fast!1!11!1!"

    Joined:
    Oct 13, 2008
    Messages:
    13,830 (6.26/day)
    Thanks Received:
    3,700
    Location:
    IA, USA
    The hard drive/solid state drive itself is the bottleneck, not the code. You could spend days improving the code but performance gains will be virtually nonexistant when 99.9% of the time spent reading the files.


    The most efficient method of comparison (100% match) would be a hybrid:
    1) check filesize (file system look ups are fast and simple)
    2) hash (hashes spot check the file and take a snapshot of it)
    3) if there is a hash collision (match), do 1:1 compare to be 100% certain it is a match.

    This is slower than performing a single check but it is faster than always reading the entire file too.
     
    Last edited: Mar 21, 2012
    Crunching for Team TPU
  19. Kreij

    Kreij Senior Monkey Moderator Staff Member

    Joined:
    Feb 6, 2007
    Messages:
    13,881 (4.92/day)
    Thanks Received:
    5,616
    Location:
    Cheeseland (Wisconsin, USA)
    @Ford
    I don't think that step #3 is even needed. The probabiliy of two MD5 hashes from differing files colliding is what, 2^-128 ?
    Even if your were hashing a million files the odds of a hash collision on differing files is astronomically small.
     
  20. FordGT90Concept

    FordGT90Concept "I go fast!1!11!1!"

    Joined:
    Oct 13, 2008
    Messages:
    13,830 (6.26/day)
    Thanks Received:
    3,700
    Location:
    IA, USA
    Actually, i just did a test using points 1-3 above. Times for each method:
    1) 0.0010000 seconds
    2) 9.4845425 seconds
    3) 8.9955145 seconds
    Result = Match

    Second run just to make sure it is no flook:
    1) 0.0000000 seconds
    2) 10.6946117 seconds
    3) 8.1694673 seconds
    Result = Match

    Both files are 594,778,936 bytes (567 MiB). Original filename: C:\Program Files (x86)\Mass Effect 3\BIOGame\DLC\DLC_HEN_PR\CookedPCConsole\Default.sfar

    I'm going to change 1 byte and see if that impacts performance. I will have to test 2 and 3 separately because 2 won't fall to 3 if the hash isn't a match. In order to give 2 the best possible chance, the byte I modify will be at the very end so 3 must process the entire file to find it (when in actuallity, it could stop on the first byte)...

    File: b
    Offset: 594778935
    Old Value: 0x7B
    New Value: 0x00

    1 & 2 test:
    1) 0.0000000 seconds
    2) 26.9975442 seconds
    Result = FailedFileHash

    1 & 3 test:
    1) 0.0010000 seconds
    3) 12.2507007 seconds
    Result = FailedFile

    That's no joke, it's 220% faster and 100% accurate every time!


    To prove that method 3 can return no match in an instant, I'm undoing the change above and adding the following:
    File: b
    Offset: 1048576 (1 MiB in)
    Old Value: 0xB0
    New Value: 0x00

    1 & 2 test:
    1) 0.0010001 seconds
    2) 24.3733941 seconds
    Result = FailedFileHash

    1 & 3 test:
    1) 0.0010001 seconds
    3) 0.0610035 seconds
    Result = FailedFile

    Hash. Got. Owned. 1:1 compare with break was 39,954% faster!


    Code:
    using System;
    using System.IO;
    using System.Security.Cryptography;
    
    namespace ConsoleApplication1
    {
        public enum CompareResult : sbyte
        {
            Match = 0,
            FailedFileSize = -1,
            FailedFileHash = -2,
            FailedFile = -4
        }
        class Program
        {
            static void Main(string[] args)
            {
                string a = @"C:\Users\Admin\Desktop\a";
                string b = @"C:\Users\Admin\Desktop\b";
    
                DateTime start = DateTime.Now;
                Console.WriteLine("       Result: " + DoCompare(a, b).ToString());
                Console.WriteLine("   Total time: " + (DateTime.Now - start).ToString());
                Console.ReadKey();
            }
    
            public static CompareResult DoCompare(string a, string b)
            {
                DateTime start = DateTime.Now;
                if (CompareFileSize(a, b))
                {
                    Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());
    
                    start = DateTime.Now;
                    if (CompareFileHash(a, b))
                    {
                        Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());
    
                        start = DateTime.Now;
                        if (CompareFile(a, b))
                        {
                            Console.WriteLine("    File time: " + (DateTime.Now - start).ToString());
                            return CompareResult.Match;
                        }
                        else
                        {
                            Console.WriteLine("    File time: " + (DateTime.Now - start).ToString());
                            return CompareResult.FailedFile;
                        }
                    }
                    else
                    {
                        Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());
                        return CompareResult.FailedFileHash;
                    }
                }
                else
                {
                    Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());
                    return CompareResult.FailedFileSize;
                }
            }
    
            public static bool CompareFileSize(string a, string b)
            {
                if (new FileInfo(a).Length == new FileInfo(b).Length)
                    return true;
                else
                    return false;
            }
            public static bool CompareFileHash(string a, string b)
            {
                MD5 md5 = new MD5CryptoServiceProvider();
    
                FileStream fs = new FileStream(a, FileMode.Open, FileAccess.Read);
                byte[] ah = md5.ComputeHash(fs);
                fs.Close();
    
                fs = new FileStream(b, FileMode.Open, FileAccess.Read);
                byte[] bh = md5.ComputeHash(fs);
                fs.Close();
    
                for (byte i = 0; i < ah.Length; i++)
                {
                    if (ah[i] != bh[i])
                        return false;
                }
                return true;
            }
            public static bool CompareFile(string a, string b)
            {
                const int BUFFER_SIZE = 4096;
                byte[] buffera = new byte[BUFFER_SIZE];
                byte[] bufferb = new byte[BUFFER_SIZE];
    
                FileStream fsa = new FileStream(a, FileMode.Open, FileAccess.Read);
                FileStream fsb = new FileStream(b, FileMode.Open, FileAccess.Read);
    
                long reada, readb;
    
                bool match = true;
                do
                {
                    reada = fsa.Read(buffera, 0, BUFFER_SIZE);
                    readb = fsb.Read(bufferb, 0, BUFFER_SIZE);
    
                    for (short i = 0; i < BUFFER_SIZE; i++)
                    {
                        if (buffera[i] != bufferb[i])
                        {
                            match = false;
                            break;
                        }
                    }
    
                    if (!match)
                        break;
                } while ((reada != 0) && (readb != 0));
    
                fsa.Close();
                fsb.Close();
    
                return match;
            }
        }
    }
    Morale of the story: don't waste your time on hashes. FileSize -> 1:1 compare is best in every scenario.

    #2 above should be replaced with a custom function that spot-checks the file (e.g. every 1024 bytes) instead of using hashes. If the spot-check doesn't find a difference, move on to the 1:1.
     
    Last edited: Mar 21, 2012
    Kreij says thanks.
    Crunching for Team TPU
  21. Kreij

    Kreij Senior Monkey Moderator Staff Member

    Joined:
    Feb 6, 2007
    Messages:
    13,881 (4.92/day)
    Thanks Received:
    5,616
    Location:
    Cheeseland (Wisconsin, USA)
    Makes sense to me. Sequential reading of a pair of arrays is fast (and of course, doesn't have to do an encryption algorithm even if the whole file is read).

    In the last scenario, it probably would be even faster as the odds of two differing files being identical for the 1MiB is unlikely also and it would find the difference sooner.

    Now get in there and optimize it more to make it even faster. :D

    In a worse case scenario (let's say 5 files with a differing last byte) you would have to do full file reads 10 times for the array method, but only 5 hash runs and then do a hash List compare.
    Seems in this case they would be a lot closer.
     
    Last edited: Mar 21, 2012
  22. FordGT90Concept

    FordGT90Concept "I go fast!1!11!1!"

    Joined:
    Oct 13, 2008
    Messages:
    13,830 (6.26/day)
    Thanks Received:
    3,700
    Location:
    IA, USA
    Yeah, 90% of files would be different on offset 0, 9% at offset 3-31, and 1% in the rest of the file.


    I dunno what I could do to make it faster except replace my 2x250 GB HDDs for an SSD. :p <10 seconds for a >500 MiB file is pretty fast.


    True, which is why the spot-check function would be important. It could store the spot-check for every file making it much faster when comparing many files to many files.
     
    Crunching for Team TPU
  23. Kreij

    Kreij Senior Monkey Moderator Staff Member

    Joined:
    Feb 6, 2007
    Messages:
    13,881 (4.92/day)
    Thanks Received:
    5,616
    Location:
    Cheeseland (Wisconsin, USA)
    Multi-threading FTW !!!
    (since disk access if the bottleneck that would not really improve anything unless the files were on seperate disks/controllers.)

    BTW ... I still use Nettool all of the time. What a spectacular little utility. :rockout:
    I'm using it today to help troubleshoot a workstation connections problem (logged into work network from home through TeamViewer).
    It's my goto utility for tier 1 network troubles.
    Tier 2 is a big hammer. :D
     
  24. FordGT90Concept

    FordGT90Concept "I go fast!1!11!1!"

    Joined:
    Oct 13, 2008
    Messages:
    13,830 (6.26/day)
    Thanks Received:
    3,700
    Location:
    IA, USA
    Multithreading only benefits memory/CPU tasks. Like I said before, the bottleneck is the HDD/SDD/RAID. Multithreading wouldn't improve performance.


    Heh, I'm glad someone is using it. XD
     
    Crunching for Team TPU
  25. TheMailMan78

    TheMailMan78 Big Member

    Joined:
    Jun 3, 2007
    Messages:
    21,147 (7.81/day)
    Thanks Received:
    7,674
    I tried to teach myself but I had no basic understanding of any of the concepts. It was like trying to paint without knowing what a paint brush was.
     

Currently Active Users Viewing This Thread: 1 (0 members and 1 guest)

Share This Page