techPowerUp! Forums

Go Back   techPowerUp! Forums > Software > Programming & Webmastering

Reply
 
Thread Tools
Old Feb 12, 2012, 06:20 AM   #1
xbonez
1000 Posts
 
Join Date: Nov 2010
Location: Philly, PA (US)
Posts: 1,084 (1.19/day)
Thanks: 277
Thanked 323 Times in 197 Posts

System Specs

Identical File Finder

I needed to sort through a directory which contained about 500 or so files (many of which were identical in content), and pick out only the unique files from the directory. I ended up writing a short program in C# to do the task for me. Posting the executable and code below, should anyone ever require it.

Summary

The program iterates over all the files in the input directory and checks for uniqueness by calculating the MD5 hash of the file contents. All unique files are copied over to the output directory.

Executable: http://dl.dropbox.com/u/1276196/DND/...FileFinder.exe

Usage

The program expects two command line arguments.
The first argument is the input directory.
The second argument is the output directory where the unique files will get copied.

A trailing slash is optional.

Source

Code:
using System;
using System.Collections.Generic;
using System.IO;
using System.Security.Cryptography;
using System.Text;

namespace IdenticalFileFinder
{
    class Program
    {
        static string baseDir = String.Empty;
        static string outputDir = String.Empty;

        static void Main(string[] args)
        {
            getArgs(args);
            normalizeDirectories();

            List<string> hashes = new List<string>();
            string[] files = Directory.GetFiles(baseDir);

            if (!Directory.Exists(outputDir))
            {
                try
                {
                    Directory.CreateDirectory(outputDir);
                }
                catch(Exception e)
                {
                    Console.WriteLine(e.ToString());
                    Environment.Exit((int)ExitScenarios.READWRITEERROR);
                }
            }

            foreach(string fileName in files)
            {
                string hash = GetMD5HashFromFile(fileName);

                if (!hashes.Contains(hash))
                {
                    string dest = outputDir + Path.GetFileName(fileName);
                    hashes.Add(hash);
                    try
                    {
                        File.Copy(fileName, dest);
                    }
                    catch (Exception e)
                    {
                        Console.WriteLine(e.ToString());
                        Environment.Exit((int)ExitScenarios.READWRITEERROR);
                    }
                }
            }
            Console.WriteLine(String.Format("{0} unique files found in {1} total files.",hashes.Count,files.Length));
            Environment.Exit((int)ExitScenarios.SUCCESS);
        }

        public static void normalizeDirectories()
        {            
            if (baseDir[baseDir.Length - 1] != '\\')
                baseDir += "\\";
            if (outputDir[outputDir.Length - 1] != '\\')
                outputDir += "\\";
        }       

        public static void getArgs(string[] args)
        {
            if (args.Length == 2)
            {
                baseDir = args[0];
                outputDir = args[1];
            }
            else
            {
                Console.WriteLine("Incorrect arguments passed in. First argument should be input directory and second should be output directory.");
                Environment.Exit((int)ExitScenarios.ARGFAIL);
            }

            if (!Directory.Exists(baseDir)) 
            {
                Console.WriteLine("Input directory not found.");
                Environment.Exit((int)ExitScenarios.ARGFAIL);
            }
        }

        static public string GetMD5HashFromFile(string fileName)
        {
            FileStream file = new FileStream(fileName, FileMode.Open);
            MD5 md5 = new MD5CryptoServiceProvider();
            byte[] retVal = md5.ComputeHash(file);
            file.Close();

            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < retVal.Length; i++)
            {
                sb.Append(retVal[i].ToString("x2"));
            }
            return sb.ToString();
        }
    }

    public enum ExitScenarios
    {
        SUCCESS,
        ARGFAIL,
        READWRITEERROR
    }
}

Last edited by xbonez; Feb 12, 2012 at 06:27 AM.
xbonez is offline  
Reply With Quote
The Following 4 Users Say Thank You to xbonez For This Useful Post:
Old Feb 17, 2012, 01:53 PM   #2
Kreij
Hardcore Monkey Moderator
 
Kreij's Avatar
 
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts

System Specs

What if two different files generate the same hash? (I know it's unlinkely, but it is possible).
__________________

Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other.


Get more tech news on a wide variety of topics at NextPowerUp
Kreij is offline  
Reply With Quote
The Following User Says Thank You to Kreij For This Useful Post:
Old Feb 17, 2012, 02:49 PM   #3
FordGT90Concept
"I go fast!1!11!1!"
 
FordGT90Concept's Avatar
 
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts

System Specs

Neat idea but...I would recommend doing a 1:1 compare...

1. If filesize is equal, move on to step two; else, not identical.
2. Read one chunk (4096 bytes) from file A and file B into separate buffers.
3. Use a for loop to make sure all bytes are equal: fileAbuffer[i] == fileBbuffer[i]
4. If an inequality is ever discovered, break the loop, close the file streams, and output non-match. If they are equal, rince and repeat until End of File is reached.

Using this method, you could even have a percentage match option where you count the differences and use the filesize as the denominator in creating the percent. If not above a user specified percentage (100% if not given), reject it.


The method described above would likely be faster than md5.ComputeHash(file) on large files because it could stop on the first byte.
__________________
Golden Rule of Programming: Never assume.

try { SteamDownload(); }
catch (Steamception ex) { RageQuit(); }

Last edited by FordGT90Concept; Mar 20, 2012 at 11:19 PM.
FordGT90Concept is offline  
Crunching for Team TPU
Reply With Quote
The Following User Says Thank You to FordGT90Concept For This Useful Post:
Old Feb 20, 2012, 02:25 AM   #4
xbonez
1000 Posts
 
Join Date: Nov 2010
Location: Philly, PA (US)
Posts: 1,084 (1.19/day)
Thanks: 277
Thanked 323 Times in 197 Posts

System Specs

@Kreiji: It'll probably crash and burn in case of a hash collision..lol. I didn't cater for that.

@Ford: Those are some good suggestions. When I first wrote the program, I just needed to filter through the 500 files I had, and threw together the first thing that came to my mind, but I might decide to re-do it (taking some of your suggestions) and package it into a WPF app.
xbonez is offline  
Reply With Quote
Old Mar 9, 2012, 06:00 PM   #5
Dolph
75 Posts
 
Join Date: Feb 2012
Posts: 114 (0.24/day)
Thanks: 6
Thanked 11 Times in 11 Posts

I know this thread is a few weeks old, and correct me if im wrong but

Code:
     public static void normalizeDirectories()
        {            
            if (baseDir[baseDir.Length - 1] != '\\')
                baseDir += "\\";
            if (outputDir[outputDir.Length - 1] != '\\')
                outputDir += "\\";
        }
Shouldn't you be using a substring for that? Considering your taking the last character array of the string comparing it to a 2 character string ie '\\' then that will always be true. '\' != '\\' = true

In the case of the user entering in a pathing with the '\\' or even '\' it will add another '\\'

EX: program.exe c:\Projects\ d:\

woudln't that see it, and correct it to c:\Projects\\\ d:\\\ ?
Dolph is offline  
Reply With Quote
Old Mar 9, 2012, 08:23 PM   #6
FordGT90Concept
"I go fast!1!11!1!"
 
FordGT90Concept's Avatar
 
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts

System Specs

It is correct. That's how C# escapes backslashes. It makes sure that both directories always end with a backslash. .NET's directory methods like to omit it.
__________________
Golden Rule of Programming: Never assume.

try { SteamDownload(); }
catch (Steamception ex) { RageQuit(); }
FordGT90Concept is offline  
Crunching for Team TPU
Reply With Quote
Old Mar 10, 2012, 03:12 AM   #7
Dolph
75 Posts
 
Join Date: Feb 2012
Posts: 114 (0.24/day)
Thanks: 6
Thanked 11 Times in 11 Posts

Ahh, my appologies.
Dolph is offline  
Reply With Quote
Old Mar 19, 2012, 12:17 PM   #8
Aquinus
3500 Posts
 
Aquinus's Avatar
 
Join Date: Jan 2012
Location: Dover, New Hampshire, USA
Posts: 4,277 (8.84/day)
Thanks: 1,284
Thanked 1,336 Times in 992 Posts

System Specs

Where is the rest of the source? I might be blind, but I'm not seeing where the hash is generated. You could use a different hashing algorithm or a combination of two to mitigate collisions. Maybe an MD5+SHA1 or a SHA256. SHA256 will run fast on newer Sandy Bridge and practically any AMD chip (AMD seems to do encryption pretty well, even against SB.)

Also if the directory gets too big, the List interface may be inefficient to use. There are a lot of things you can do to speed this up, but I don't know how fast it is already. For *really* large directories, using set-sized arrays stored in 2d array along with threading to process even and odd items from one array could speed it up since not a whole lot of locking is required (just on the output "list".)

Good work though, its been some time since I've written some C#.

Have you stored any numbers to see how many files your application can do per second? You could use that to judge improvements or slowdowns in your application.

Last edited by Aquinus; Mar 19, 2012 at 01:40 PM.
Aquinus is offline  
Crunching for Team TPU
Reply With Quote
Old Mar 19, 2012, 12:24 PM   #9
Kreij
Hardcore Monkey Moderator
 
Kreij's Avatar
 
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts

System Specs

The hash is being generated in the GetMD5HashFromFile() method.

The odds of a hash collision are so infitesimally small it's not really worth testing for it.

Lists (the List class) is extremely fast for sorting and searching. What would you change Aquinus?
__________________

Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other.


Get more tech news on a wide variety of topics at NextPowerUp
Kreij is offline  
Reply With Quote
Old Mar 19, 2012, 12:59 PM   #10
Aquinus
3500 Posts
 
Aquinus's Avatar
 
Join Date: Jan 2012
Location: Dover, New Hampshire, USA
Posts: 4,277 (8.84/day)
Thanks: 1,284
Thanked 1,336 Times in 992 Posts

System Specs

Quote:
Originally Posted by Kreij View Post
The hash is being generated in the GetMD5HashFromFile() method.

The odds of a hash collision are so infitesimally small it's not really worth testing for it.

Lists (the List class) is extremely fast for sorting and searching. What would you change Aquinus?
I would use a 2D array and define a new "row" of static size every time the previous row is filled. That way you're saving space in the list class and you're not resizing the array every time you exceed the maximum size. The overhead is small and you stick with just what you need. The List class has a lot of overhead to expand it beyond a standard array, where a regular array should do just as well.

If the op where to post the full source up, I wouldn't mind trying to alter it. I would have to install Visual Studio again and I wouldn't be able to get to it until I get back from work later today which is also dependent on how fussy the daughter is when I get home.

It is my understanding that the List is being used just to store values, not sort them in this instance. I'm just saying it can be optimized if it takes a long time for large sums of files.
Aquinus is offline  
Crunching for Team TPU
Reply With Quote
Old Mar 19, 2012, 01:58 PM   #11
Kreij
Hardcore Monkey Moderator
 
Kreij's Avatar
 
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts

System Specs

You are right, Aquinus, but it is implimentation dependant. You know this, but I'll add the details for anyone else who might read this ...

2D arrays are faster if the upper limit of the array is know before hand (count the number of files prior to creating the array), so no array resizing has to occur.
String[] myArray = new String[500];

Generic List<> is also faster if the bounds are known
List<string> myList = new List<string>(500) as opposed to just initializing an empty List.
The .net implementation of Lists is actually using an array behind the scenes.

Using for loops on both Lists and Arrays is faster than using foreach, but again the bounds needs to be know before hand.

Lists are a bit slower as they do bounds checking all the time, while in a predefined array no bounds checking is done (unless you code it yourself).
Lists also have a bit larger memory footprint than arrays.

The question then becomes is the performance increase enough to warrant the additional coding to impliment arrays over generic Lists. This is completely dependant upon the implimentation and what is occuring within the for loop. Complex calculations in the for loop may make the performance increase with arrays irrelevant (when each for iteration is analyzed).

Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.

On a side note, it looks to me like all the code is in the OP. Am I overlooking something?
__________________

Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other.


Get more tech news on a wide variety of topics at NextPowerUp

Last edited by Kreij; Mar 19, 2012 at 02:04 PM.
Kreij is offline  
Reply With Quote
The Following User Says Thank You to Kreij For This Useful Post:
Old Mar 20, 2012, 12:57 PM   #12
Aquinus
3500 Posts
 
Aquinus's Avatar
 
Join Date: Jan 2012
Location: Dover, New Hampshire, USA
Posts: 4,277 (8.84/day)
Thanks: 1,284
Thanked 1,336 Times in 992 Posts

System Specs

Quote:
Originally Posted by Kreij View Post
You are right, Aquinus, but it is implimentation dependant. You know this, but I'll add the details for anyone else who might read this ...

2D arrays are faster if the upper limit of the array is know before hand (count the number of files prior to creating the array), so no array resizing has to occur.
String[] myArray = new String[500];

Generic List<> is also faster if the bounds are known
List<string> myList = new List<string>(500) as opposed to just initializing an empty List.
The .net implementation of Lists is actually using an array behind the scenes.

Using for loops on both Lists and Arrays is faster than using foreach, but again the bounds needs to be know before hand.

Lists are a bit slower as they do bounds checking all the time, while in a predefined array no bounds checking is done (unless you code it yourself).
Lists also have a bit larger memory footprint than arrays.

The question then becomes is the performance increase enough to warrant the additional coding to impliment arrays over generic Lists. This is completely dependant upon the implimentation and what is occuring within the for loop. Complex calculations in the for loop may make the performance increase with arrays irrelevant (when each for iteration is analyzed).

Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.

On a side note, it looks to me like all the code is in the OP. Am I overlooking something?
Great post. You're right, my stupid mac just likes hiding the scroll bars on textareas.
Aquinus is offline  
Crunching for Team TPU
Reply With Quote
Old Mar 20, 2012, 11:25 PM   #13
FordGT90Concept
"I go fast!1!11!1!"
 
FordGT90Concept's Avatar
 
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts

System Specs

Quote:
Originally Posted by Kreij View Post
Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds.
I used lists for animation tables having 32,000+ ushort values. It was quick. I used the default start size too (I think it's 64). Performance is getting less and less important when memory is so fast and plentiful. XD


Quote:
Originally Posted by Kreij View Post
On a side note, it looks to me like all the code is in the OP. Am I overlooking something?
Yeah, the haystack (folders/files)!
__________________
Golden Rule of Programming: Never assume.

try { SteamDownload(); }
catch (Steamception ex) { RageQuit(); }
FordGT90Concept is offline  
Crunching for Team TPU
Reply With Quote
Old Mar 21, 2012, 03:42 AM   #14
xbonez
1000 Posts
 
Join Date: Nov 2010
Location: Philly, PA (US)
Posts: 1,084 (1.19/day)
Thanks: 277
Thanked 323 Times in 197 Posts

System Specs

I tend to stay away from micro-optimizations. They are almost always pointless unless you work in an extremely time sensitive context.

As for using another hashing algorithm: The SHA2 family is a better hashing algorithm for security purposes, primarily because it is not as fast as MD5. This makes brute forcing or generating rainbow tables for SHA2 hashes much slower. SHA2, however, also has the possibility of hash collisions.

So far, there does not exist a perfect hashing algorithm that guarantees no collisions over an infinitely large sample set.

Quote:
"More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason — including blind stupidity." — W.A. Wulf
Quote:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." — Donald Knuth

Last edited by xbonez; Mar 21, 2012 at 03:59 AM.
xbonez is offline  
Reply With Quote
Old Mar 21, 2012, 12:13 PM   #15
Aquinus
3500 Posts
 
Aquinus's Avatar
 
Join Date: Jan 2012
Location: Dover, New Hampshire, USA
Posts: 4,277 (8.84/day)
Thanks: 1,284
Thanked 1,336 Times in 992 Posts

System Specs

Quote:
Originally Posted by xbonez View Post
I tend to stay away from micro-optimizations. They are almost always pointless unless you work in an extremely time sensitive context.

As for using another hashing algorithm: The SHA2 family is a better hashing algorithm for security purposes, primarily because it is not as fast as MD5. This makes brute forcing or generating rainbow tables for SHA2 hashes much slower. SHA2, however, also has the possibility of hash collisions.

So far, there does not exist a perfect hashing algorithm that guarantees no collisions over an infinitely large sample set.
It depends on what kind of data you're working with and what language you're using. I mostly write PHP and a lot of SQL. Now the PHP usually runs fast enough unless you're going through a large array like a huge sql result set, but on the other hand I could have SQL that take a long time to run because I might be joining 6 or 8 very large tables to each other. So optimizations will depends, which is why I asked how many files he was comparing because if you're doing 100,000+ files the method that lists search may be inefficient for what you're trying to do.

But you're right, premature optimizations are usually a bad idea.

Also, nice quote by Donald Knuth.
Aquinus is offline  
Crunching for Team TPU
Reply With Quote
Old Mar 21, 2012, 12:35 PM   #16
djxinator
25 Posts
 
djxinator's Avatar
 
Join Date: Mar 2012
Posts: 35 (0.08/day)
Thanks: 0
Thanked 10 Times in 6 Posts

System Specs

I know its a bit off topic, and I have pretty much NO IDEA whats going on in here but its sparked up an old interest of mine.

Is it feasible to self-teach C#, and if so, how long to get a basic understanding and foundation for creating simple yet useful applications?
djxinator is offline  
Reply With Quote
Old Mar 21, 2012, 12:48 PM   #17
Kreij
Hardcore Monkey Moderator
 
Kreij's Avatar
 
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts

System Specs

Yes, I taught myself C# (and all the rest of languages I know for that matter).
If you are familiar with OOP concepts and programming I would estimate you would be reasonably comfortable with it in a couple of weeks.
__________________

Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other.


Get more tech news on a wide variety of topics at NextPowerUp
Kreij is offline  
Reply With Quote
Old Mar 21, 2012, 01:13 PM   #18
FordGT90Concept
"I go fast!1!11!1!"
 
FordGT90Concept's Avatar
 
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts

System Specs

Quote:
Originally Posted by Aquinus View Post
So optimizations will depends, which is why I asked how many files he was comparing because if you're doing 100,000+ files the method that lists search may be inefficient for what you're trying to do.
The hard drive/solid state drive itself is the bottleneck, not the code. You could spend days improving the code but performance gains will be virtually nonexistant when 99.9% of the time spent reading the files.


The most efficient method of comparison (100% match) would be a hybrid:
1) check filesize (file system look ups are fast and simple)
2) hash (hashes spot check the file and take a snapshot of it)
3) if there is a hash collision (match), do 1:1 compare to be 100% certain it is a match.

This is slower than performing a single check but it is faster than always reading the entire file too.
__________________
Golden Rule of Programming: Never assume.

try { SteamDownload(); }
catch (Steamception ex) { RageQuit(); }

Last edited by FordGT90Concept; Mar 21, 2012 at 01:19 PM.
FordGT90Concept is offline  
Crunching for Team TPU
Reply With Quote
Old Mar 21, 2012, 01:53 PM   #19
Kreij
Hardcore Monkey Moderator
 
Kreij's Avatar
 
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts

System Specs

@Ford
I don't think that step #3 is even needed. The probabiliy of two MD5 hashes from differing files colliding is what, 2^-128 ?
Even if your were hashing a million files the odds of a hash collision on differing files is astronomically small.
__________________

Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other.


Get more tech news on a wide variety of topics at NextPowerUp
Kreij is offline  
Reply With Quote
Old Mar 21, 2012, 02:19 PM   #20
FordGT90Concept
"I go fast!1!11!1!"
 
FordGT90Concept's Avatar
 
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts

System Specs

Actually, i just did a test using points 1-3 above. Times for each method:
1) 0.0010000 seconds
2) 9.4845425 seconds
3) 8.9955145 seconds
Result = Match

Second run just to make sure it is no flook:
1) 0.0000000 seconds
2) 10.6946117 seconds
3) 8.1694673 seconds
Result = Match

Both files are 594,778,936 bytes (567 MiB). Original filename: C:\Program Files (x86)\Mass Effect 3\BIOGame\DLC\DLC_HEN_PR\CookedPCConsole\Default.s far

I'm going to change 1 byte and see if that impacts performance. I will have to test 2 and 3 separately because 2 won't fall to 3 if the hash isn't a match. In order to give 2 the best possible chance, the byte I modify will be at the very end so 3 must process the entire file to find it (when in actuallity, it could stop on the first byte)...

File: b
Offset: 594778935
Old Value: 0x7B
New Value: 0x00

1 & 2 test:
1) 0.0000000 seconds
2) 26.9975442 seconds
Result = FailedFileHash

1 & 3 test:
1) 0.0010000 seconds
3) 12.2507007 seconds
Result = FailedFile

That's no joke, it's 220% faster and 100% accurate every time!


To prove that method 3 can return no match in an instant, I'm undoing the change above and adding the following:
File: b
Offset: 1048576 (1 MiB in)
Old Value: 0xB0
New Value: 0x00

1 & 2 test:
1) 0.0010001 seconds
2) 24.3733941 seconds
Result = FailedFileHash

1 & 3 test:
1) 0.0010001 seconds
3) 0.0610035 seconds
Result = FailedFile

Hash. Got. Owned. 1:1 compare with break was 39,954% faster!


Code:
using System;
using System.IO;
using System.Security.Cryptography;

namespace ConsoleApplication1
{
    public enum CompareResult : sbyte
    {
        Match = 0,
        FailedFileSize = -1,
        FailedFileHash = -2,
        FailedFile = -4
    }
    class Program
    {
        static void Main(string[] args)
        {
            string a = @"C:\Users\Admin\Desktop\a";
            string b = @"C:\Users\Admin\Desktop\b";

            DateTime start = DateTime.Now;
            Console.WriteLine("       Result: " + DoCompare(a, b).ToString());
            Console.WriteLine("   Total time: " + (DateTime.Now - start).ToString());
            Console.ReadKey();
        }

        public static CompareResult DoCompare(string a, string b)
        {
            DateTime start = DateTime.Now;
            if (CompareFileSize(a, b))
            {
                Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());

                start = DateTime.Now;
                if (CompareFileHash(a, b))
                {
                    Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());

                    start = DateTime.Now;
                    if (CompareFile(a, b))
                    {
                        Console.WriteLine("    File time: " + (DateTime.Now - start).ToString());
                        return CompareResult.Match;
                    }
                    else
                    {
                        Console.WriteLine("    File time: " + (DateTime.Now - start).ToString());
                        return CompareResult.FailedFile;
                    }
                }
                else
                {
                    Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());
                    return CompareResult.FailedFileHash;
                }
            }
            else
            {
                Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());
                return CompareResult.FailedFileSize;
            }
        }

        public static bool CompareFileSize(string a, string b)
        {
            if (new FileInfo(a).Length == new FileInfo(b).Length)
                return true;
            else
                return false;
        }
        public static bool CompareFileHash(string a, string b)
        {
            MD5 md5 = new MD5CryptoServiceProvider();

            FileStream fs = new FileStream(a, FileMode.Open, FileAccess.Read);
            byte[] ah = md5.ComputeHash(fs);
            fs.Close();

            fs = new FileStream(b, FileMode.Open, FileAccess.Read);
            byte[] bh = md5.ComputeHash(fs);
            fs.Close();

            for (byte i = 0; i < ah.Length; i++)
            {
                if (ah[i] != bh[i])
                    return false;
            }
            return true;
        }
        public static bool CompareFile(string a, string b)
        {
            const int BUFFER_SIZE = 4096;
            byte[] buffera = new byte[BUFFER_SIZE];
            byte[] bufferb = new byte[BUFFER_SIZE];

            FileStream fsa = new FileStream(a, FileMode.Open, FileAccess.Read);
            FileStream fsb = new FileStream(b, FileMode.Open, FileAccess.Read);

            long reada, readb;

            bool match = true;
            do
            {
                reada = fsa.Read(buffera, 0, BUFFER_SIZE);
                readb = fsb.Read(bufferb, 0, BUFFER_SIZE);

                for (short i = 0; i < BUFFER_SIZE; i++)
                {
                    if (buffera[i] != bufferb[i])
                    {
                        match = false;
                        break;
                    }
                }

                if (!match)
                    break;
            } while ((reada != 0) && (readb != 0));

            fsa.Close();
            fsb.Close();

            return match;
        }
    }
}
Morale of the story: don't waste your time on hashes. FileSize -> 1:1 compare is best in every scenario.

#2 above should be replaced with a custom function that spot-checks the file (e.g. every 1024 bytes) instead of using hashes. If the spot-check doesn't find a difference, move on to the 1:1.
__________________
Golden Rule of Programming: Never assume.

try { SteamDownload(); }
catch (Steamception ex) { RageQuit(); }

Last edited by FordGT90Concept; Mar 21, 2012 at 02:32 PM.
FordGT90Concept is offline  
Crunching for Team TPU
Reply With Quote
The Following User Says Thank You to FordGT90Concept For This Useful Post:
Old Mar 21, 2012, 02:41 PM   #21
Kreij
Hardcore Monkey Moderator
 
Kreij's Avatar
 
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts

System Specs

Makes sense to me. Sequential reading of a pair of arrays is fast (and of course, doesn't have to do an encryption algorithm even if the whole file is read).

In the last scenario, it probably would be even faster as the odds of two differing files being identical for the 1MiB is unlikely also and it would find the difference sooner.

Now get in there and optimize it more to make it even faster.

In a worse case scenario (let's say 5 files with a differing last byte) you would have to do full file reads 10 times for the array method, but only 5 hash runs and then do a hash List compare.
Seems in this case they would be a lot closer.
__________________

Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other.


Get more tech news on a wide variety of topics at NextPowerUp

Last edited by Kreij; Mar 21, 2012 at 02:51 PM.
Kreij is offline  
Reply With Quote
Old Mar 21, 2012, 02:50 PM   #22
FordGT90Concept
"I go fast!1!11!1!"
 
FordGT90Concept's Avatar
 
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts

System Specs

Yeah, 90% of files would be different on offset 0, 9% at offset 3-31, and 1% in the rest of the file.


I dunno what I could do to make it faster except replace my 2x250 GB HDDs for an SSD. <10 seconds for a >500 MiB file is pretty fast.


Quote:
Originally Posted by Kreij View Post
In a worse case scenario (let's say 5 files with a differing last byte) you would have to do full file reads 10 times for the array method, but only 5 hash runs and then do a hash List compare.
Seems in this case they would be a lot closer.
True, which is why the spot-check function would be important. It could store the spot-check for every file making it much faster when comparing many files to many files.
__________________
Golden Rule of Programming: Never assume.

try { SteamDownload(); }
catch (Steamception ex) { RageQuit(); }
FordGT90Concept is offline  
Crunching for Team TPU
Reply With Quote
Old Mar 21, 2012, 05:57 PM   #23
Kreij
Hardcore Monkey Moderator
 
Kreij's Avatar
 
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts

System Specs

Quote:
Originally Posted by Ford
I dunno what I could do to make it faster
Multi-threading FTW !!!
(since disk access if the bottleneck that would not really improve anything unless the files were on seperate disks/controllers.)

BTW ... I still use Nettool all of the time. What a spectacular little utility.
I'm using it today to help troubleshoot a workstation connections problem (logged into work network from home through TeamViewer).
It's my goto utility for tier 1 network troubles.
Tier 2 is a big hammer.
__________________

Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other.


Get more tech news on a wide variety of topics at NextPowerUp
Kreij is offline  
Reply With Quote
Old Mar 21, 2012, 06:13 PM   #24
FordGT90Concept
"I go fast!1!11!1!"
 
FordGT90Concept's Avatar
 
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts

System Specs

Multithreading only benefits memory/CPU tasks. Like I said before, the bottleneck is the HDD/SDD/RAID. Multithreading wouldn't improve performance.


Heh, I'm glad someone is using it. XD
__________________
Golden Rule of Programming: Never assume.

try { SteamDownload(); }
catch (Steamception ex) { RageQuit(); }
FordGT90Concept is offline  
Crunching for Team TPU
Reply With Quote
Old Apr 18, 2012, 07:19 PM   #25
TheMailMan78
Banstick Dummy
 
TheMailMan78's Avatar
 
Join Date: Jun 2007
Location: Crystal River, FL
Posts: 15,112 (6.92/day)
Thanks: 1,337
Thanked 6,834 Times in 3,741 Posts

System Specs

Quote:
Originally Posted by Kreij View Post
Yes, I taught myself C# (and all the rest of languages I know for that matter).
If you are familiar with OOP concepts and programming I would estimate you would be reasonably comfortable with it in a couple of weeks.
I tried to teach myself but I had no basic understanding of any of the concepts. It was like trying to paint without knowing what a paint brush was.
TheMailMan78 is offline  
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Driver Finder Solaris17 Programming & Webmastering 168 Dec 31, 2011 02:38 AM
2 identical LCDs, problems Fitseries3 General Hardware 24 Sep 29, 2010 03:46 PM
[WTB] 2x Identical Sata HDDs Tau Buy/Sell/Trade/Giveaway Forum 0 May 6, 2009 11:06 AM
Core i7 Processors Now Listed in Intel sSpec Finder AuDioFreaK39 General Hardware 0 Nov 16, 2008 08:08 AM
Ati tools artifact finder.... bios ATITool 2 Nov 9, 2004 06:16 AM


All times are GMT. The time now is 06:14 AM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
no new posts