Experience – Tech & Software Development

Smart Hasher is an Open Source command-line tool with many convenient features.

It is available on GitHub https://github.com/sergtk/smart_hasher

I have just completed the implementation of many features I wanted to have and I didn’t find in other tools.

The story of this project

Before starting this project I needed to calculate a hash function for many large files to be confident that they are not corrupted. These files were in the cloud.

So I tried some Windows native tools, and found some non-native ones, and started to use them. Many of them support just MD5 hash which is now considered obsolete.
I would prefer the SHA-1. I also added support for some other popular hash algorithms, the full list follows: md5, sha1, sha224, sha256, sha384, sha512. All of them are supported by python out of the box by the library hashlib.

The following thing happened when I tried to calculate hash for files in the cloud. Some of the 4 Gb files were handled about 10-20 minutes, some are much longer. I could not realize what is going on. During the whole night the hash was calculated for several files only. Something similar happened in the next several days.
Sometimes tools calculated hash fastly, but not always. Speed was very unstable.
Moreover, other sites, like Youtube, were working fast. When I took my notebook to another network, everything was good with the calculation of the hashes. But again, it was difficult to understand the situation.

So I decided to write a tool to calculate the hashes. And I wanted to see progress easily and conveniently.
I didn’t found any of the hash calculation tools with such features. I could consider using some other tools for diagnostics of network speed but it was not clear what is faster: to use other tools or to write my own which allows me to change it as I want without any restrictions by feature set.

I implemented the tool with the feature to show speed for the whole current file and the last several seconds.
With this data, I already started to mail to cloud support and to call my ISP.
After all, I reached a guy from my ISP. It was quite easy to go through the call center “guards” with speed numbers. That sysadmin guy said to me that there is some strange bug in their software, that occurs on a large amount of data. So actually that guy just needed to close my session and everything worked fast again from my side.
To close the session without reaching the support he suggested me to shut down the Wi-Fi router for half an hour.
Strange story.

Another convenient feature for a large amount of data is to resume calculation after interruption.
This is implemented just by skipping calculation of the hashes for the files for which the hash is already calculated.

Another issue with existing tools is that sometimes a network connection is interrupted for a small period of time. So it is good to retry to read data from the file after a small pause. It is supported by Smart Hasher.

By default for every input data file, the one hash file is created as many other tools do. But it is not always convenient.
For example, if you have a lot of user files, it is not convenient to bloat the directory with a lot of hash files. So I implemented a feature to store all hashes in a single file. After some time, say a year, the hashes can be recalculated again. New and old files may be compared to find differences and to get an idea how user files are changed.
To simplify finding differences, file names are sorted in the hash file.
To find renames there is an option to sort the hash file by hash values because hashes for files are not changed if files are just renamed or moved.
These features are very convenient to check the integrity of our valuable data, e.g. photo archives.

To make it easy to parse file programmatically I also implemented saving data in JSON with python json library. But actually, this is more for practice with python, I didn’t use this feature yet.

There are other features that I didn’t describe here. You may find a description of them in file USAGE.md.

Continue reading “Smart Hasher – hasher with many convenient features implemented on python”