![]() |
|
|
#1 |
![]() Join Date: Nov 2010
Location: Philly, PA (US)
Posts: 1,084 (1.19/day)
Thanks: 277
Thanked 323 Times in 197 Posts
|
Identical File Finder
I needed to sort through a directory which contained about 500 or so files (many of which were identical in content), and pick out only the unique files from the directory. I ended up writing a short program in C# to do the task for me. Posting the executable and code below, should anyone ever require it.
Summary The program iterates over all the files in the input directory and checks for uniqueness by calculating the MD5 hash of the file contents. All unique files are copied over to the output directory. Executable: http://dl.dropbox.com/u/1276196/DND/...FileFinder.exe Usage The program expects two command line arguments. The first argument is the input directory. The second argument is the output directory where the unique files will get copied. A trailing slash is optional. Source Code:
using System;
using System.Collections.Generic;
using System.IO;
using System.Security.Cryptography;
using System.Text;
namespace IdenticalFileFinder
{
class Program
{
static string baseDir = String.Empty;
static string outputDir = String.Empty;
static void Main(string[] args)
{
getArgs(args);
normalizeDirectories();
List<string> hashes = new List<string>();
string[] files = Directory.GetFiles(baseDir);
if (!Directory.Exists(outputDir))
{
try
{
Directory.CreateDirectory(outputDir);
}
catch(Exception e)
{
Console.WriteLine(e.ToString());
Environment.Exit((int)ExitScenarios.READWRITEERROR);
}
}
foreach(string fileName in files)
{
string hash = GetMD5HashFromFile(fileName);
if (!hashes.Contains(hash))
{
string dest = outputDir + Path.GetFileName(fileName);
hashes.Add(hash);
try
{
File.Copy(fileName, dest);
}
catch (Exception e)
{
Console.WriteLine(e.ToString());
Environment.Exit((int)ExitScenarios.READWRITEERROR);
}
}
}
Console.WriteLine(String.Format("{0} unique files found in {1} total files.",hashes.Count,files.Length));
Environment.Exit((int)ExitScenarios.SUCCESS);
}
public static void normalizeDirectories()
{
if (baseDir[baseDir.Length - 1] != '\\')
baseDir += "\\";
if (outputDir[outputDir.Length - 1] != '\\')
outputDir += "\\";
}
public static void getArgs(string[] args)
{
if (args.Length == 2)
{
baseDir = args[0];
outputDir = args[1];
}
else
{
Console.WriteLine("Incorrect arguments passed in. First argument should be input directory and second should be output directory.");
Environment.Exit((int)ExitScenarios.ARGFAIL);
}
if (!Directory.Exists(baseDir))
{
Console.WriteLine("Input directory not found.");
Environment.Exit((int)ExitScenarios.ARGFAIL);
}
}
static public string GetMD5HashFromFile(string fileName)
{
FileStream file = new FileStream(fileName, FileMode.Open);
MD5 md5 = new MD5CryptoServiceProvider();
byte[] retVal = md5.ComputeHash(file);
file.Close();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < retVal.Length; i++)
{
sb.Append(retVal[i].ToString("x2"));
}
return sb.ToString();
}
}
public enum ExitScenarios
{
SUCCESS,
ARGFAIL,
READWRITEERROR
}
}
Last edited by xbonez; Feb 12, 2012 at 06:27 AM. |
|
|
|
| The Following 4 Users Say Thank You to xbonez For This Useful Post: |
|
|
#2 |
|
Hardcore Monkey Moderator
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts
|
What if two different files generate the same hash? (I know it's unlinkely, but it is possible).
__________________
Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other. Get more tech news on a wide variety of topics at NextPowerUp
|
|
|
|
| The Following User Says Thank You to Kreij For This Useful Post: |
|
|
#3 |
|
"I go fast!1!11!1!"
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts
|
Neat idea but...I would recommend doing a 1:1 compare...
1. If filesize is equal, move on to step two; else, not identical. 2. Read one chunk (4096 bytes) from file A and file B into separate buffers. 3. Use a for loop to make sure all bytes are equal: fileAbuffer[i] == fileBbuffer[i] 4. If an inequality is ever discovered, break the loop, close the file streams, and output non-match. If they are equal, rince and repeat until End of File is reached. Using this method, you could even have a percentage match option where you count the differences and use the filesize as the denominator in creating the percent. If not above a user specified percentage (100% if not given), reject it. The method described above would likely be faster than md5.ComputeHash(file) on large files because it could stop on the first byte.
__________________
Golden Rule of Programming: Never assume. try { SteamDownload(); } catch (Steamception ex) { RageQuit(); } Last edited by FordGT90Concept; Mar 20, 2012 at 11:19 PM. |
|
|
|
| The Following User Says Thank You to FordGT90Concept For This Useful Post: |
|
|
#4 |
![]() Join Date: Nov 2010
Location: Philly, PA (US)
Posts: 1,084 (1.19/day)
Thanks: 277
Thanked 323 Times in 197 Posts
|
@Kreiji: It'll probably crash and burn in case of a hash collision..lol. I didn't cater for that.
@Ford: Those are some good suggestions. When I first wrote the program, I just needed to filter through the 500 files I had, and threw together the first thing that came to my mind, but I might decide to re-do it (taking some of your suggestions) and package it into a WPF app. |
|
|
|
|
|
#5 |
![]() Join Date: Feb 2012
Posts: 114 (0.24/day)
Thanks: 6
Thanked 11 Times in 11 Posts
|
I know this thread is a few weeks old, and correct me if im wrong but
Code:
public static void normalizeDirectories()
{
if (baseDir[baseDir.Length - 1] != '\\')
baseDir += "\\";
if (outputDir[outputDir.Length - 1] != '\\')
outputDir += "\\";
}
In the case of the user entering in a pathing with the '\\' or even '\' it will add another '\\' EX: program.exe c:\Projects\ d:\ woudln't that see it, and correct it to c:\Projects\\\ d:\\\ ? |
|
|
|
|
|
#6 |
|
"I go fast!1!11!1!"
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts
|
It is correct. That's how C# escapes backslashes. It makes sure that both directories always end with a backslash. .NET's directory methods like to omit it.
__________________
Golden Rule of Programming: Never assume. try { SteamDownload(); } catch (Steamception ex) { RageQuit(); } |
|
|
|
|
|
#7 |
![]() Join Date: Feb 2012
Posts: 114 (0.24/day)
Thanks: 6
Thanked 11 Times in 11 Posts
|
Ahh, my appologies.
|
|
|
|
|
|
#8 |
![]() Join Date: Jan 2012
Location: Dover, New Hampshire, USA
Posts: 4,277 (8.84/day)
Thanks: 1,284
Thanked 1,336 Times in 992 Posts
|
Where is the rest of the source? I might be blind, but I'm not seeing where the hash is generated. You could use a different hashing algorithm or a combination of two to mitigate collisions. Maybe an MD5+SHA1 or a SHA256. SHA256 will run fast on newer Sandy Bridge and practically any AMD chip (AMD seems to do encryption pretty well, even against SB.)
Also if the directory gets too big, the List interface may be inefficient to use. There are a lot of things you can do to speed this up, but I don't know how fast it is already. For *really* large directories, using set-sized arrays stored in 2d array along with threading to process even and odd items from one array could speed it up since not a whole lot of locking is required (just on the output "list".) Good work though, its been some time since I've written some C#. Have you stored any numbers to see how many files your application can do per second? You could use that to judge improvements or slowdowns in your application. Last edited by Aquinus; Mar 19, 2012 at 01:40 PM. |
|
|
|
|
|
#9 |
|
Hardcore Monkey Moderator
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts
|
The hash is being generated in the GetMD5HashFromFile() method.
The odds of a hash collision are so infitesimally small it's not really worth testing for it. Lists (the List class) is extremely fast for sorting and searching. What would you change Aquinus?
__________________
Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other. Get more tech news on a wide variety of topics at NextPowerUp
|
|
|
|
|
|
#10 | |
![]() Join Date: Jan 2012
Location: Dover, New Hampshire, USA
Posts: 4,277 (8.84/day)
Thanks: 1,284
Thanked 1,336 Times in 992 Posts
|
Quote:
If the op where to post the full source up, I wouldn't mind trying to alter it. I would have to install Visual Studio again and I wouldn't be able to get to it until I get back from work later today which is also dependent on how fussy the daughter is when I get home. It is my understanding that the List is being used just to store values, not sort them in this instance. I'm just saying it can be optimized if it takes a long time for large sums of files. |
|
|
|
|
|
|
#11 |
|
Hardcore Monkey Moderator
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts
|
You are right, Aquinus, but it is implimentation dependant. You know this, but I'll add the details for anyone else who might read this ...
2D arrays are faster if the upper limit of the array is know before hand (count the number of files prior to creating the array), so no array resizing has to occur. String[] myArray = new String[500]; Generic List<> is also faster if the bounds are known List<string> myList = new List<string>(500) as opposed to just initializing an empty List. The .net implementation of Lists is actually using an array behind the scenes. Using for loops on both Lists and Arrays is faster than using foreach, but again the bounds needs to be know before hand. Lists are a bit slower as they do bounds checking all the time, while in a predefined array no bounds checking is done (unless you code it yourself). Lists also have a bit larger memory footprint than arrays. The question then becomes is the performance increase enough to warrant the additional coding to impliment arrays over generic Lists. This is completely dependant upon the implimentation and what is occuring within the for loop. Complex calculations in the for loop may make the performance increase with arrays irrelevant (when each for iteration is analyzed). Personally, I usually use Lists as the code is cleaner, but then I am rarely using them for thousands or millions of elements, and the difference would be some number of milliseconds. On a side note, it looks to me like all the code is in the OP. Am I overlooking something?
__________________
Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other. Get more tech news on a wide variety of topics at NextPowerUp
Last edited by Kreij; Mar 19, 2012 at 02:04 PM. |
|
|
|
| The Following User Says Thank You to Kreij For This Useful Post: |
|
|
#12 | |
![]() Join Date: Jan 2012
Location: Dover, New Hampshire, USA
Posts: 4,277 (8.84/day)
Thanks: 1,284
Thanked 1,336 Times in 992 Posts
|
Quote:
|
|
|
|
|
|
|
#13 | ||
|
"I go fast!1!11!1!"
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts
|
Quote:
Quote:
__________________
Golden Rule of Programming: Never assume. try { SteamDownload(); } catch (Steamception ex) { RageQuit(); } |
||
|
|
|
|
|
#14 | ||
![]() Join Date: Nov 2010
Location: Philly, PA (US)
Posts: 1,084 (1.19/day)
Thanks: 277
Thanked 323 Times in 197 Posts
|
I tend to stay away from micro-optimizations. They are almost always pointless unless you work in an extremely time sensitive context.
As for using another hashing algorithm: The SHA2 family is a better hashing algorithm for security purposes, primarily because it is not as fast as MD5. This makes brute forcing or generating rainbow tables for SHA2 hashes much slower. SHA2, however, also has the possibility of hash collisions. So far, there does not exist a perfect hashing algorithm that guarantees no collisions over an infinitely large sample set. Quote:
Quote:
Last edited by xbonez; Mar 21, 2012 at 03:59 AM. |
||
|
|
|
|
|
#15 | |
![]() Join Date: Jan 2012
Location: Dover, New Hampshire, USA
Posts: 4,277 (8.84/day)
Thanks: 1,284
Thanked 1,336 Times in 992 Posts
|
Quote:
But you're right, premature optimizations are usually a bad idea. Also, nice quote by Donald Knuth. |
|
|
|
|
|
|
#16 |
![]() |
I know its a bit off topic, and I have pretty much NO IDEA whats going on in here but its sparked up an old interest of mine.
Is it feasible to self-teach C#, and if so, how long to get a basic understanding and foundation for creating simple yet useful applications? |
|
|
|
|
|
#17 |
|
Hardcore Monkey Moderator
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts
|
Yes, I taught myself C# (and all the rest of languages I know for that matter).
If you are familiar with OOP concepts and programming I would estimate you would be reasonably comfortable with it in a couple of weeks.
__________________
Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other. Get more tech news on a wide variety of topics at NextPowerUp
|
|
|
|
|
|
#18 | |
|
"I go fast!1!11!1!"
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts
|
Quote:
The most efficient method of comparison (100% match) would be a hybrid: 1) check filesize (file system look ups are fast and simple) 2) hash (hashes spot check the file and take a snapshot of it) 3) if there is a hash collision (match), do 1:1 compare to be 100% certain it is a match. This is slower than performing a single check but it is faster than always reading the entire file too.
__________________
Golden Rule of Programming: Never assume. try { SteamDownload(); } catch (Steamception ex) { RageQuit(); } Last edited by FordGT90Concept; Mar 21, 2012 at 01:19 PM. |
|
|
|
|
|
|
#19 |
|
Hardcore Monkey Moderator
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts
|
@Ford
I don't think that step #3 is even needed. The probabiliy of two MD5 hashes from differing files colliding is what, 2^-128 ? Even if your were hashing a million files the odds of a hash collision on differing files is astronomically small.
__________________
Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other. Get more tech news on a wide variety of topics at NextPowerUp
|
|
|
|
|
|
#20 |
|
"I go fast!1!11!1!"
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts
|
Actually, i just did a test using points 1-3 above. Times for each method:
1) 0.0010000 seconds 2) 9.4845425 seconds 3) 8.9955145 seconds Result = Match Second run just to make sure it is no flook: 1) 0.0000000 seconds 2) 10.6946117 seconds 3) 8.1694673 seconds Result = Match Both files are 594,778,936 bytes (567 MiB). Original filename: C:\Program Files (x86)\Mass Effect 3\BIOGame\DLC\DLC_HEN_PR\CookedPCConsole\Default.s far I'm going to change 1 byte and see if that impacts performance. I will have to test 2 and 3 separately because 2 won't fall to 3 if the hash isn't a match. In order to give 2 the best possible chance, the byte I modify will be at the very end so 3 must process the entire file to find it (when in actuallity, it could stop on the first byte)... File: b Offset: 594778935 Old Value: 0x7B New Value: 0x00 1 & 2 test: 1) 0.0000000 seconds 2) 26.9975442 seconds Result = FailedFileHash 1 & 3 test: 1) 0.0010000 seconds 3) 12.2507007 seconds Result = FailedFile That's no joke, it's 220% faster and 100% accurate every time! To prove that method 3 can return no match in an instant, I'm undoing the change above and adding the following: File: b Offset: 1048576 (1 MiB in) Old Value: 0xB0 New Value: 0x00 1 & 2 test: 1) 0.0010001 seconds 2) 24.3733941 seconds Result = FailedFileHash 1 & 3 test: 1) 0.0010001 seconds 3) 0.0610035 seconds Result = FailedFile Hash. Got. Owned. 1:1 compare with break was 39,954% faster! Code:
using System;
using System.IO;
using System.Security.Cryptography;
namespace ConsoleApplication1
{
public enum CompareResult : sbyte
{
Match = 0,
FailedFileSize = -1,
FailedFileHash = -2,
FailedFile = -4
}
class Program
{
static void Main(string[] args)
{
string a = @"C:\Users\Admin\Desktop\a";
string b = @"C:\Users\Admin\Desktop\b";
DateTime start = DateTime.Now;
Console.WriteLine(" Result: " + DoCompare(a, b).ToString());
Console.WriteLine(" Total time: " + (DateTime.Now - start).ToString());
Console.ReadKey();
}
public static CompareResult DoCompare(string a, string b)
{
DateTime start = DateTime.Now;
if (CompareFileSize(a, b))
{
Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());
start = DateTime.Now;
if (CompareFileHash(a, b))
{
Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());
start = DateTime.Now;
if (CompareFile(a, b))
{
Console.WriteLine(" File time: " + (DateTime.Now - start).ToString());
return CompareResult.Match;
}
else
{
Console.WriteLine(" File time: " + (DateTime.Now - start).ToString());
return CompareResult.FailedFile;
}
}
else
{
Console.WriteLine("FileHash time: " + (DateTime.Now - start).ToString());
return CompareResult.FailedFileHash;
}
}
else
{
Console.WriteLine("FileSize time: " + (DateTime.Now - start).ToString());
return CompareResult.FailedFileSize;
}
}
public static bool CompareFileSize(string a, string b)
{
if (new FileInfo(a).Length == new FileInfo(b).Length)
return true;
else
return false;
}
public static bool CompareFileHash(string a, string b)
{
MD5 md5 = new MD5CryptoServiceProvider();
FileStream fs = new FileStream(a, FileMode.Open, FileAccess.Read);
byte[] ah = md5.ComputeHash(fs);
fs.Close();
fs = new FileStream(b, FileMode.Open, FileAccess.Read);
byte[] bh = md5.ComputeHash(fs);
fs.Close();
for (byte i = 0; i < ah.Length; i++)
{
if (ah[i] != bh[i])
return false;
}
return true;
}
public static bool CompareFile(string a, string b)
{
const int BUFFER_SIZE = 4096;
byte[] buffera = new byte[BUFFER_SIZE];
byte[] bufferb = new byte[BUFFER_SIZE];
FileStream fsa = new FileStream(a, FileMode.Open, FileAccess.Read);
FileStream fsb = new FileStream(b, FileMode.Open, FileAccess.Read);
long reada, readb;
bool match = true;
do
{
reada = fsa.Read(buffera, 0, BUFFER_SIZE);
readb = fsb.Read(bufferb, 0, BUFFER_SIZE);
for (short i = 0; i < BUFFER_SIZE; i++)
{
if (buffera[i] != bufferb[i])
{
match = false;
break;
}
}
if (!match)
break;
} while ((reada != 0) && (readb != 0));
fsa.Close();
fsb.Close();
return match;
}
}
}
#2 above should be replaced with a custom function that spot-checks the file (e.g. every 1024 bytes) instead of using hashes. If the spot-check doesn't find a difference, move on to the 1:1.
__________________
Golden Rule of Programming: Never assume. try { SteamDownload(); } catch (Steamception ex) { RageQuit(); } Last edited by FordGT90Concept; Mar 21, 2012 at 02:32 PM. |
|
|
|
| The Following User Says Thank You to FordGT90Concept For This Useful Post: |
|
|
#21 |
|
Hardcore Monkey Moderator
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts
|
Makes sense to me. Sequential reading of a pair of arrays is fast (and of course, doesn't have to do an encryption algorithm even if the whole file is read).
In the last scenario, it probably would be even faster as the odds of two differing files being identical for the 1MiB is unlikely also and it would find the difference sooner. Now get in there and optimize it more to make it even faster. ![]() In a worse case scenario (let's say 5 files with a differing last byte) you would have to do full file reads 10 times for the array method, but only 5 hash runs and then do a hash List compare. Seems in this case they would be a lot closer.
__________________
Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other. Get more tech news on a wide variety of topics at NextPowerUp
Last edited by Kreij; Mar 21, 2012 at 02:51 PM. |
|
|
|
|
|
#22 |
|
"I go fast!1!11!1!"
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts
|
Yeah, 90% of files would be different on offset 0, 9% at offset 3-31, and 1% in the rest of the file.
I dunno what I could do to make it faster except replace my 2x250 GB HDDs for an SSD. <10 seconds for a >500 MiB file is pretty fast.True, which is why the spot-check function would be important. It could store the spot-check for every file making it much faster when comparing many files to many files.
__________________
Golden Rule of Programming: Never assume. try { SteamDownload(); } catch (Steamception ex) { RageQuit(); } |
|
|
|
|
|
#23 | |
|
Hardcore Monkey Moderator
Join Date: Feb 2007
Location: Cheeseland (Wisconsin, USA)
Posts: 12,133 (5.27/day)
Thanks: 591
Thanked 5,494 Times in 2,938 Posts
|
Quote:
(since disk access if the bottleneck that would not really improve anything unless the files were on seperate disks/controllers.) BTW ... I still use Nettool all of the time. What a spectacular little utility. ![]() I'm using it today to help troubleshoot a workstation connections problem (logged into work network from home through TeamViewer). It's my goto utility for tier 1 network troubles. Tier 2 is a big hammer.
__________________
Cloud (noun, singular): A dynamic arrangement of multiple potential single points of failure, with a user at one end and their data at the other. Get more tech news on a wide variety of topics at NextPowerUp
|
|
|
|
|
|
|
#24 |
|
"I go fast!1!11!1!"
Join Date: Oct 2008
Location: IA, USA
Posts: 10,583 (6.28/day)
Thanks: 1,755
Thanked 2,601 Times in 1,963 Posts
|
Multithreading only benefits memory/CPU tasks. Like I said before, the bottleneck is the HDD/SDD/RAID. Multithreading wouldn't improve performance.
Heh, I'm glad someone is using it. XD
__________________
Golden Rule of Programming: Never assume. try { SteamDownload(); } catch (Steamception ex) { RageQuit(); } |
|
|
|
|
|
#25 |
|
Banstick Dummy
Join Date: Jun 2007
Location: Crystal River, FL
Posts: 15,112 (6.92/day)
Thanks: 1,337
Thanked 6,834 Times in 3,741 Posts
|
I tried to teach myself but I had no basic understanding of any of the concepts. It was like trying to paint without knowing what a paint brush was.
|
|
|
|
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Driver Finder | Solaris17 | Programming & Webmastering | 168 | Dec 31, 2011 02:38 AM |
| 2 identical LCDs, problems | Fitseries3 | General Hardware | 24 | Sep 29, 2010 03:46 PM |
| [WTB] 2x Identical Sata HDDs | Tau | Buy/Sell/Trade/Giveaway Forum | 0 | May 6, 2009 11:06 AM |
| Core i7 Processors Now Listed in Intel sSpec Finder | AuDioFreaK39 | General Hardware | 0 | Nov 16, 2008 08:08 AM |
| Ati tools artifact finder.... | bios | ATITool | 2 | Nov 9, 2004 06:16 AM |