Continuing slowly towards some file management tooling

Writing in C# I’m pushing the pieces together to build a cryptographic hash based duplicate file locator. I’m expecting to back-end this into MongoDB to avoid having to re-scan every time it is run and to persist information about offline file systems.

WPF Frame

I’m currently framing this as a WPF application and expect that the initial version will simply present duplicates in a list with a keep/delete option for each file. I expect to move removed files to a ‘trash’ folder with unique names in order to avoid sym-link or hard-link related issues. I can add code to properly identify and navigate such things, but for the moment I’m keeping things simpler. It is a bit painful to get to the relevant Win32 APIs from C# in many cases. In the long run I’m thinking that I’ll use C++/CLI to hop over.

Overall Structure

Still a work in progress.

I’m building components that own worker threads, start running their processing on construction and provide status and payload through accessors. Change of state notifications are provided through events that provide a reference to the base object. It is expected that any additional information needed will be extracted by the recipient of the event notification.

I’ve currently built one object that enumerates files in a directory tree and another that enumerates storage volumes available on the system. The storage volumes object uses PInvoke to acquire the volume serial number from each volume as this is the best ‘unique identifier’ available at the volume level. At some point I’ll likely add in code to provide access to the unique file identifier (windows equivalent to an inode number) that can be extracted for files.

Next Steps

I’ve now got to create an object that receives the lists of files produced by the file finder and retrieves file information and generates hashes. I’ll probably build this to simply create the hashes first.

Once I have the hashing and information retrieval running, I’ll add in MongoDB support.

First step will be to populate MongoDB with the relevant information. Second step will be to check for the file path/name in MongoDB first and then only hash it if the last modified time and size don’t match those stored. I expect this should dramatically speed up processing once a given area has been scanned.

I’ll also want to persist volume information to MongoDB to support offline volumes and detect changes in the target system. Initially I think this will be write-mostly. In the longer term I’d like to detect changes (volumes with the same VSN but different total capacity or volume label) and at least present that information to the user.

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.