• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Identical File Finder

xbonez

New Member
Joined
Nov 29, 2010
Messages
1,182 (0.24/day)
Location
Philly, PA (US)
System Name Winter
Processor AMD Phenom II x4 965 BE @ 4.0Ghz
Motherboard MSI 790FX-GD70
Cooling Corsair H50 Liquid Cooling
Memory 2 x 2Gb Gskill Ripjaws 1600Mhz (7-7-7-24@1.6V)
Video Card(s) Asus GTX 470 @ Stock (Zalman VF3000 cooler)
Storage 2 x Samsung Spinpoint F3 500GB (RAID 0)
Display(s) Hanns G 28" @ 1920x1200
Case Antec 1200
Audio Device(s) Onboard -- TosLink --> Z5500
Power Supply Corsair 850TX 850W PSU
Software Win 7 64-bit Ultimate
I needed to sort through a directory which contained about 500 or so files (many of which were identical in content), and pick out only the unique files from the directory. I ended up writing a short program in C# to do the task for me. Posting the executable and code below, should anyone ever require it.

Summary

The program iterates over all the files in the input directory and checks for uniqueness by calculating the MD5 hash of the file contents. All unique files are copied over to the output directory.

Executable: http://dl.dropbox.com/u/1276196/DND/IdenticalFileFinder.exe

Usage

The program expects two command line arguments.
The first argument is the input directory.
The second argument is the output directory where the unique files will get copied.

A trailing slash is optional.

Source

Code:
using System;
using System.Collections.Generic;
using System.IO;
using System.Security.Cryptography;
using System.Text;

namespace IdenticalFileFinder
{
    class Program
    {
        static string baseDir = String.Empty;
        static string outputDir = String.Empty;

        static void Main(string[] args)
        {
            getArgs(args);
            normalizeDirectories();

            List<string> hashes = new List<string>();
            string[] files = Directory.GetFiles(baseDir);

            if (!Directory.Exists(outputDir))
            {
                try
                {
                    Directory.CreateDirectory(outputDir);
                }
                catch(Exception e)
                {
                    Console.WriteLine(e.ToString());
                    Environment.Exit((int)ExitScenarios.READWRITEERROR);
                }
            }

            foreach(string fileName in files)
            {
                string hash = GetMD5HashFromFile(fileName);

                if (!hashes.Contains(hash))
                {
                    string dest = outputDir + Path.GetFileName(fileName);
                    hashes.Add(hash);
                    try
                    {
                        File.Copy(fileName, dest);
                    }
                    catch (Exception e)
                    {
                        Console.WriteLine(e.ToString());
                        Environment.Exit((int)ExitScenarios.READWRITEERROR);
                    }
                }
            }
            Console.WriteLine(String.Format("{0} unique files found in {1} total files.",hashes.Count,files.Length));
            Environment.Exit((int)ExitScenarios.SUCCESS);
        }

        public static void normalizeDirectories()
        {            
            if (baseDir[baseDir.Length - 1] != '\\')
                baseDir += "\\";
            if (outputDir[outputDir.Length - 1] != '\\')
                outputDir += "\\";
        }       

        public static void getArgs(string[] args)
        {
            if (args.Length == 2)
            {
                baseDir = args[0];
                outputDir = args[1];
            }
            else
            {
                Console.WriteLine("Incorrect arguments passed in. First argument should be input directory and second should be output directory.");
                Environment.Exit((int)ExitScenarios.ARGFAIL);
            }

            if (!Directory.Exists(baseDir)) 
            {
                Console.WriteLine("Input directory not found.");
                Environment.Exit((int)ExitScenarios.ARGFAIL);
            }
        }

        static public string GetMD5HashFromFile(string fileName)
        {
            FileStream file = new FileStream(fileName, FileMode.Open);
            MD5 md5 = new MD5CryptoServiceProvider();
            byte[] retVal = md5.ComputeHash(file);
            file.Close();

            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < retVal.Length; i++)
            {
                sb.Append(retVal[i].ToString("x2"));
            }
            return sb.ToString();
        }
    }

    public enum ExitScenarios
    {
        SUCCESS,
        ARGFAIL,
        READWRITEERROR
    }
}
 
Last edited:

Kreij

Senior Monkey Moderator
Joined
Feb 6, 2007
Messages
13,817 (2.21/day)
Location
Cheeseland (Wisconsin, USA)
What if two different files generate the same hash? (I know it's unlinkely, but it is possible).
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.65/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Neat idea but...I would recommend doing a 1:1 compare...

1. If filesize is equal, move on to step two; else, not identical.
2. Read one chunk (4096 bytes) from file A and file B into separate buffers.
3. Use a for loop to make sure all bytes are equal: fileAbuffer == fileBbuffer
4. If an inequality is ever discovered, break the loop, close the file streams, and output non-match. If they are equal, rince and repeat until End of File is reached.

Using this method, you could even have a percentage match option where you count the differences and use the filesize as the denominator in creating the percent. If not above a user specified percentage (100% if not given), reject it.


The method described above would likely be faster than md5.ComputeHash(file) on large files because it could stop on the first byte.
 
Last edited:

xbonez

New Member
Joined
Nov 29, 2010
Messages
1,182 (0.24/day)
Location
Philly, PA (US)
System Name Winter
Processor AMD Phenom II x4 965 BE @ 4.0Ghz
Motherboard MSI 790FX-GD70
Cooling Corsair H50 Liquid Cooling
Memory 2 x 2Gb Gskill Ripjaws 1600Mhz (7-7-7-24@1.6V)
Video Card(s) Asus GTX 470 @ Stock (Zalman VF3000 cooler)
Storage 2 x Samsung Spinpoint F3 500GB (RAID 0)
Display(s) Hanns G 28" @ 1920x1200
Case Antec 1200
Audio Device(s) Onboard -- TosLink --> Z5500
Power Supply Corsair 850TX 850W PSU
Software Win 7 64-bit Ultimate
@Kreiji: It'll probably crash and burn in case of a hash collision..lol. I didn't cater for that.

@Ford: Those are some good suggestions. When I first wrote the program, I just needed to filter through the 500 files I had, and threw together the first thing that came to my mind, but I might decide to re-do it (taking some of your suggestions) and package it into a WPF app.
 

Dolph

New Member
Joined
Feb 8, 2012
Messages
114 (0.03/day)
I know this thread is a few weeks old, and correct me if im wrong but

Code:
     public static void normalizeDirectories()
        {            
            if (baseDir[baseDir.Length - 1] != '\\')
                baseDir += "\\";
            if (outputDir[outputDir.Length - 1] != '\\')
                outputDir += "\\";
        }

Shouldn't you be using a substring for that? Considering your taking the last character array of the string comparing it to a 2 character string ie '\\' then that will always be true. '\' != '\\' = true

In the case of the user entering in a pathing with the '\\' or even '\' it will add another '\\'

EX: program.exe c:\Projects\ d:\

woudln't that see it, and correct it to c:\Projects\\\ d:\\\ ?
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.65/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
It is correct. That's how C# escapes backslashes. It makes sure that both directories always end with a backslash. .NET's directory methods like to omit it.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.96/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Where is the rest of the source? I might be blind, but I'm not seeing where the hash is generated. You could use a different hashing algorithm or a combination of two to mitigate collisions. Maybe an MD5+SHA1 or a SHA256. SHA256 will run fast on newer Sandy Bridge and practically any AMD chip (AMD seems to do encryption pretty well, even against SB.)

Also if the directory gets too big, the List interface may be inefficient to use. There are a lot of things you can do to speed this up, but I don't know how fast it is already. For *really* large directories, using set-sized arrays stored in 2d array along with threading to process even and odd items from one array could speed it up since not a whole lot of locking is required (just on the output "list".)

Good work though, its been some time since I've written some C#.

Have you stored any numbers to see how many files your application can do per second? You could use that to judge improvements or slowdowns in your application.
 
Last edited:

Kreij

Senior Monkey Moderator
Joined
Feb 6, 2007
Messages
13,817 (2.21/day)
Location
Cheeseland (Wisconsin, USA)
The hash is being generated in the GetMD5HashFromFile() method.

The odds of a hash collision are so infitesimally small it's not really worth testing for it.

Lists (the List class) is extremely fast for sorting and searching. What would you change Aquinus?
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.96/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
The hash is being generated in the GetMD5HashFromFile() method.

The odds of a hash collision are so infitesimally small it's not really worth testing for it.

Lists (the List class) is extremely fast for sorting and searching. What would you change Aquinus?

I would use a 2D array and define a new "row" of static size every time the previous row is filled. That way you're saving space in the list class and you're not resizing the array every time you exceed the maximum size. The overhead is small and you stick with just what you need. The List class has a lot of overhead to expand it beyond a standard array, where a regular array should do just as well.

If the op where to post the full source up, I wouldn't mind trying to alter it. I would have to install Visual Studio again and I wouldn't be able to get to it until I get back from work later today which is also dependent on how fussy the daughter is when I get home.

It is my understanding that the List is being used just to store values, not sort them in this instance. I'm just saying it can be optimized if it takes a long time for large sums of files.
 

Kreij

Senior Monkey Moderator
Joined
Feb 6, 2007
Messages
13,817 (2.21/day)
Location
Cheeseland (Wisconsin, USA)
You are right, Aquinus, but it is implimentation dependant. You know this, but I'll add the details for anyone else who might read this ...

2D arrays are faster if the upper limit of the array is know before hand (count the number of files prior to creating the array), so no array resizing has to occur.
String[] myArray = new String[500];

Generic List<> is also faster if the bounds are known
List<string> myList = new List<string>(500) as opposed to just initializing an empty List.
The .net implementation of Lists is actually using an array behind the scenes.

Using for loops on both Lists and Arrays is faster than using foreach, but again the bounds needs to be know before hand.

Lists are a bit slower as they do bounds checking all the time, while in a predefined array no bounds checking is done (unless you code it yourself).
Lists also have a bit larger memory footprint than arrays.

The question then becomes is the performance increase enough to warrant the additional coding to impliment arrays over generic Lists. This is completely dependant upon the implimentation and what is occuring within the for loop. Complex calculations in the for loop may make the performance increase with arrays irrelevant (when each for iteration is analyzed).

Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.

On a side note, it looks to me like all the code is in the OP. Am I overlooking something?
 
Last edited:

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.96/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
You are right, Aquinus, but it is implimentation dependant. You know this, but I'll add the details for anyone else who might read this ...

2D arrays are faster if the upper limit of the array is know before hand (count the number of files prior to creating the array), so no array resizing has to occur.
String[] myArray = new String[500];

Generic List<> is also faster if the bounds are known
List<string> myList = new List<string>(500) as opposed to just initializing an empty List.
The .net implementation of Lists is actually using an array behind the scenes.

Using for loops on both Lists and Arrays is faster than using foreach, but again the bounds needs to be know before hand.

Lists are a bit slower as they do bounds checking all the time, while in a predefined array no bounds checking is done (unless you code it yourself).
Lists also have a bit larger memory footprint than arrays.

The question then becomes is the performance increase enough to warrant the additional coding to impliment arrays over generic Lists. This is completely dependant upon the implimentation and what is occuring within the for loop. Complex calculations in the for loop may make the performance increase with arrays irrelevant (when each for iteration is analyzed).

Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.

On a side note, it looks to me like all the code is in the OP. Am I overlooking something?

Great post. You're right, my stupid mac just likes hiding the scroll bars on textareas. :banghead:
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.65/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.
I used lists for animation tables having 32,000+ ushort values. It was quick. I used the default start size too (I think it's 64). Performance is getting less and less important when memory is so fast and plentiful. XD


On a side note, it looks to me like all the code is in the OP. Am I overlooking something?
Yeah, the haystack (folders/files)! :roll:
 

xbonez

New Member
Joined
Nov 29, 2010
Messages
1,182 (0.24/day)
Location
Philly, PA (US)
System Name Winter
Processor AMD Phenom II x4 965 BE @ 4.0Ghz
Motherboard MSI 790FX-GD70
Cooling Corsair H50 Liquid Cooling
Memory 2 x 2Gb Gskill Ripjaws 1600Mhz (7-7-7-24@1.6V)
Video Card(s) Asus GTX 470 @ Stock (Zalman VF3000 cooler)
Storage 2 x Samsung Spinpoint F3 500GB (RAID 0)
Display(s) Hanns G 28" @ 1920x1200
Case Antec 1200
Audio Device(s) Onboard -- TosLink --> Z5500
Power Supply Corsair 850TX 850W PSU
Software Win 7 64-bit Ultimate
I tend to stay away from micro-optimizations. They are almost always pointless unless you work in an extremely time sensitive context.

As for using another hashing algorithm: The SHA2 family is a better hashing algorithm for security purposes, primarily because it is not as fast as MD5. This makes brute forcing or generating rainbow tables for SHA2 hashes much slower. SHA2, however, also has the possibility of hash collisions.

So far, there does not exist a perfect hashing algorithm that guarantees no collisions over an infinitely large sample set.

"More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason — including blind stupidity." — W.A. Wulf

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." — Donald Knuth
 
Last edited:

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.96/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
I tend to stay away from micro-optimizations. They are almost always pointless unless you work in an extremely time sensitive context.

As for using another hashing algorithm: The SHA2 family is a better hashing algorithm for security purposes, primarily because it is not as fast as MD5. This makes brute forcing or generating rainbow tables for SHA2 hashes much slower. SHA2, however, also has the possibility of hash collisions.

So far, there does not exist a perfect hashing algorithm that guarantees no collisions over an infinitely large sample set.

It depends on what kind of data you're working with and what language you're using. I mostly write PHP and a lot of SQL. Now the PHP usually runs fast enough unless you're going through a large array like a huge sql result set, but on the other hand I could have SQL that take a long time to run because I might be joining 6 or 8 very large tables to each other. So optimizations will depends, which is why I asked how many files he was comparing because if you're doing 100,000+ files the method that lists search may be inefficient for what you're trying to do.

But you're right, premature optimizations are usually a bad idea.

Also, nice quote by Donald Knuth.
 
Joined
Mar 1, 2012
Messages
41 (0.01/day)
System Name The Xinator
Processor i5 2500k @ 4.6Ghz
Motherboard ASUS P8Z68-V-PRO
Cooling Phanteks PH-TC14PE
Memory 8GB Corsair Vengeance White LP 1600Mhz 1.35v
Video Card(s) Gigabyte GTX780 Windforce OC (Rev 2.0)
Storage OCZ Vector 256GB
Display(s) Hazro HZ27WD(V2) 2560x1440
Case Antec 1200 (V1)
Audio Device(s) ASUS Xonar DGX
Power Supply Antec CP-850
Software Windows 7 Pro 64bit
I know its a bit off topic, and I have pretty much NO IDEA whats going on in here but its sparked up an old interest of mine.

Is it feasible to self-teach C#, and if so, how long to get a basic understanding and foundation for creating simple yet useful applications?
 

Kreij

Senior Monkey Moderator
Joined
Feb 6, 2007
Messages
13,817 (2.21/day)
Location
Cheeseland (Wisconsin, USA)
Yes, I taught myself C# (and all the rest of languages I know for that matter).
If you are familiar with OOP concepts and programming I would estimate you would be reasonably comfortable with it in a couple of weeks.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.65/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
So optimizations will depends, which is why I asked how many files he was comparing because if you're doing 100,000+ files the method that lists search may be inefficient for what you're trying to do.
The hard drive/solid state drive itself is the bottleneck, not the code. You could spend days improving the code but performance gains will be virtually nonexistant when 99.9% of the time spent reading the files.


The most efficient method of comparison (100% match) would be a hybrid:
1) check filesize (file system look ups are fast and simple)
2) hash (hashes spot check the file and take a snapshot of it)
3) if there is a hash collision (match), do 1:1 compare to be 100% certain it is a match.

This is slower than performing a single check but it is faster than always reading the entire file too.
 
Last edited:

Kreij

Senior Monkey Moderator
Joined
Feb 6, 2007
Messages
13,817 (2.21/day)
Location
Cheeseland (Wisconsin, USA)
@Ford
I don't think that step #3 is even needed. The probabiliy of two MD5 hashes from differing files colliding is what, 2^-128 ?
Even if your were hashing a million files the odds of a hash collision on differing files is astronomically small.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.65/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Actually, i just did a test using points 1-3 above. Times for each method:
1) 0.0010000 seconds
2) 9.4845425 seconds
3) 8.9955145 seconds
Result = Match

Second run just to make sure it is no flook:
1) 0.0000000 seconds
2) 10.6946117 seconds
3) 8.1694673 seconds
Result = Match

Both files are 594,778,936 bytes (567 MiB). Original filename: C:\Program Files (x86)\Mass Effect 3\BIOGame\DLC\DLC_HEN_PR\CookedPCConsole\Default.sfar

I'm going to change 1 byte and see if that impacts performance. I will have to test 2 and 3 separately because 2 won't fall to 3 if the hash isn't a match. In order to give 2 the best possible chance, the byte I modify will be at the very end so 3 must process the entire file to find it (when in actuallity, it could stop on the first byte)...

File: b
Offset: 594778935
Old Value: 0x7B
New Value: 0x00

1 & 2 test:
1) 0.0000000 seconds
2) 26.9975442 seconds
Result = FailedFileHash

1 & 3 test:
1) 0.0010000 seconds
3) 12.2507007 seconds
Result = FailedFile

That's no joke, it's 220% faster and 100% accurate every time!


To prove that method 3 can return no match in an instant, I'm undoing the change above and adding the following:
File: b
Offset: 1048576 (1 MiB in)
Old Value: 0xB0
New Value: 0x00

1 & 2 test:
1) 0.0010001 seconds
2) 24.3733941 seconds
Result = FailedFileHash

1 & 3 test:
1) 0.0010001 seconds
3) 0.0610035 seconds
Result = FailedFile

Hash. Got. Owned. 1:1 compare with break was 39,954% faster!


Code:
using System;
using System.IO;
using System.Security.Cryptography;

namespace ConsoleApplication1
{
    public enum CompareResult : sbyte
    {
        Match = 0,
        FailedFileSize = -1,
        FailedFileHash = -2,
        FailedFile = -4
    }
    class Program
    {
        static void Main(string[] args)
        {
            string a = @"C:\Users\Admin\Desktop\a";
            string b = @"C:\Users\Admin\Desktop\b";

            DateTime start = DateTime.Now;
            Console.WriteLine("       Result: " + DoCompare(a, b).ToString());
            Console.WriteLine("   Total time: " + (DateTime.Now - start).ToString());
            Console.ReadKey();
        }

        public static CompareResult DoCompare(string a, string b)
        {
            DateTime start = DateTime.Now;
            if (CompareFileSize(a, b))
            {
                Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());

                start = DateTime.Now;
                if (CompareFileHash(a, b))
                {
                    Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());

                    start = DateTime.Now;
                    if (CompareFile(a, b))
                    {
                        Console.WriteLine("    File time: " + (DateTime.Now - start).ToString());
                        return CompareResult.Match;
                    }
                    else
                    {
                        Console.WriteLine("    File time: " + (DateTime.Now - start).ToString());
                        return CompareResult.FailedFile;
                    }
                }
                else
                {
                    Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());
                    return CompareResult.FailedFileHash;
                }
            }
            else
            {
                Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());
                return CompareResult.FailedFileSize;
            }
        }

        public static bool CompareFileSize(string a, string b)
        {
            if (new FileInfo(a).Length == new FileInfo(b).Length)
                return true;
            else
                return false;
        }
        public static bool CompareFileHash(string a, string b)
        {
            MD5 md5 = new MD5CryptoServiceProvider();

            FileStream fs = new FileStream(a, FileMode.Open, FileAccess.Read);
            byte[] ah = md5.ComputeHash(fs);
            fs.Close();

            fs = new FileStream(b, FileMode.Open, FileAccess.Read);
            byte[] bh = md5.ComputeHash(fs);
            fs.Close();

            for (byte i = 0; i < ah.Length; i++)
            {
                if (ah[i] != bh[i])
                    return false;
            }
            return true;
        }
        public static bool CompareFile(string a, string b)
        {
            const int BUFFER_SIZE = 4096;
            byte[] buffera = new byte[BUFFER_SIZE];
            byte[] bufferb = new byte[BUFFER_SIZE];

            FileStream fsa = new FileStream(a, FileMode.Open, FileAccess.Read);
            FileStream fsb = new FileStream(b, FileMode.Open, FileAccess.Read);

            long reada, readb;

            bool match = true;
            do
            {
                reada = fsa.Read(buffera, 0, BUFFER_SIZE);
                readb = fsb.Read(bufferb, 0, BUFFER_SIZE);

                for (short i = 0; i < BUFFER_SIZE; i++)
                {
                    if (buffera[i] != bufferb[i])
                    {
                        match = false;
                        break;
                    }
                }

                if (!match)
                    break;
            } while ((reada != 0) && (readb != 0));

            fsa.Close();
            fsb.Close();

            return match;
        }
    }
}
Morale of the story: don't waste your time on hashes. FileSize -> 1:1 compare is best in every scenario.

#2 above should be replaced with a custom function that spot-checks the file (e.g. every 1024 bytes) instead of using hashes. If the spot-check doesn't find a difference, move on to the 1:1.
 
Last edited:

Kreij

Senior Monkey Moderator
Joined
Feb 6, 2007
Messages
13,817 (2.21/day)
Location
Cheeseland (Wisconsin, USA)
Makes sense to me. Sequential reading of a pair of arrays is fast (and of course, doesn't have to do an encryption algorithm even if the whole file is read).

In the last scenario, it probably would be even faster as the odds of two differing files being identical for the 1MiB is unlikely also and it would find the difference sooner.

Now get in there and optimize it more to make it even faster. :D

In a worse case scenario (let's say 5 files with a differing last byte) you would have to do full file reads 10 times for the array method, but only 5 hash runs and then do a hash List compare.
Seems in this case they would be a lot closer.
 
Last edited:

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.65/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Yeah, 90% of files would be different on offset 0, 9% at offset 3-31, and 1% in the rest of the file.


I dunno what I could do to make it faster except replace my 2x250 GB HDDs for an SSD. :p <10 seconds for a >500 MiB file is pretty fast.


In a worse case scenario (let's say 5 files with a differing last byte) you would have to do full file reads 10 times for the array method, but only 5 hash runs and then do a hash List compare.
Seems in this case they would be a lot closer.
True, which is why the spot-check function would be important. It could store the spot-check for every file making it much faster when comparing many files to many files.
 

Kreij

Senior Monkey Moderator
Joined
Feb 6, 2007
Messages
13,817 (2.21/day)
Location
Cheeseland (Wisconsin, USA)
Ford said:
I dunno what I could do to make it faster

Multi-threading FTW !!!
(since disk access if the bottleneck that would not really improve anything unless the files were on seperate disks/controllers.)

BTW ... I still use Nettool all of the time. What a spectacular little utility. :rockout:
I'm using it today to help troubleshoot a workstation connections problem (logged into work network from home through TeamViewer).
It's my goto utility for tier 1 network troubles.
Tier 2 is a big hammer. :D
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.65/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Multithreading only benefits memory/CPU tasks. Like I said before, the bottleneck is the HDD/SDD/RAID. Multithreading wouldn't improve performance.


Heh, I'm glad someone is using it. XD
 

TheMailMan78

Big Member
Joined
Jun 3, 2007
Messages
22,599 (3.68/day)
Location
'Merica. The Great SOUTH!
System Name TheMailbox 5.0 / The Mailbox 4.5
Processor RYZEN 1700X / Intel i7 2600k @ 4.2GHz
Motherboard Fatal1ty X370 Gaming K4 / Gigabyte Z77X-UP5 TH Intel LGA 1155
Cooling MasterLiquid PRO 280 / Scythe Katana 4
Memory ADATA RGB 16GB DDR4 2666 16-16-16-39 / G.SKILL Sniper Series 16GB DDR3 1866: 9-9-9-24
Video Card(s) MSI 1080 "Duke" with 8Gb of RAM. Boost Clock 1847 MHz / ASUS 780ti
Storage 256Gb M4 SSD / 128Gb Agelity 4 SSD , 500Gb WD (7200)
Display(s) LG 29" Class 21:9 UltraWide® IPS LED Monitor 2560 x 1080 / Dell 27"
Case Cooler Master MASTERBOX 5t / Cooler Master 922 HAF
Audio Device(s) Realtek ALC1220 Audio Codec / SupremeFX X-Fi with Bose Companion 2 speakers.
Power Supply Seasonic FOCUS Plus Series SSR-750PX 750W Platinum / SeaSonic X Series X650 Gold
Mouse SteelSeries Sensei (RAW) / Logitech G5
Keyboard Razer BlackWidow / Logitech (Unknown)
Software Windows 10 Pro (64-bit)
Benchmark Scores Benching is for bitches.
Yes, I taught myself C# (and all the rest of languages I know for that matter).
If you are familiar with OOP concepts and programming I would estimate you would be reasonably comfortable with it in a couple of weeks.

I tried to teach myself but I had no basic understanding of any of the concepts. It was like trying to paint without knowing what a paint brush was.
 
Top