This is part 2 of the FAQ “What is a Digital Fingerprint?” If you haven’t read part 1 yet, head over and review that now. In part 1, we explained how “direct” digital fingerprinting works.
Derived Fingerprints
A more common approach to digital fingerprinting is to derive a fingerprint from data. The key to deriving a fingerprint involves an operation called a Cryptographic Hash Function (CHF). We’ll look at the technical details in a bit, but first, let’s look at what a CHF does.
A CHF is an algorithm that takes an arbitrary size of data, performs some math on it, and outputs a fixed-length string.
Example
For example, the composer Johannes Brahms said of Mozart:
If we cannot write with the beauty of Mozart, let us at least try to write with his purity.
Running this through a CHF yields the following:
b2ab311f2590fb0c2bc535f5b98bc884
What if Brahms had been writing about Bach instead?
If we cannot write with the beauty of Bach, let us at least try to write with his purity.
The CHF gives us a completely different value:
32d7c3399faca812bf3cc7184ec01092
One more example. What if Brahms forgot to capitalize Bach? What would the CHF yield?
If we cannot write with the beauty of bach, let us at least try to write with his purity.
We get an entirely different value yet again!
7157afbf2565aec432199a6ed858df49
The cryptographic hash function reads in some text and outputs a seemingly “random” string of characters, always the same length, and completely unique to the original text.
This is a digital fingerprint!
Cryptographic Hashes
Those not familiar with data security and cryptographic functions may be surprised to learn that these kinds of hashes, or fingerprints, are used all the time in computing applications. Cryptographic hashes are used extensively in information security applications. They are used for integrity checking, and authentication. They are used to prove data has not been tampered with. They are used to determine uniqueness of data. They are implemented pervasively in communication protocols, and machine-to-machine data exchange. They also serve as a primitive in different encryption algorithms to verify decryption integrity.
Let’s say I wanted to transfer a file to you that was absolutely critical, and we wanted to ensure it was not tampered with during transit. I calculate the hash of the file, then I send you the file as well as the hash by some other channel. Upon receiving the file, you calculate the hash. Since the hash is an absolutely unique fingerprint of the file, if the file is exactly the same as when I sent it, our hashes will match. If for some reason the hashes don’t match, we have a major red flag because somehow, somewhere, the file was modified.
There are several characteristics of an effective cryptographic hash function:
- It is deterministic. For every message input, the same hash will always be produced.
- It is non-reversible. It is impossible to reconstruct the original message from a hash.
- It is unique. No two messages will ever generate the same hash value.
- Small input changes result in large output changes. As in our example above, even a single character change resulted in a completely different hash (the avalanche effect).
Hashing functions are said to be effective to the extend that they ensure these characteristics.
There are many different CHF’s, all with varying integrity to the above conditions:
- MD5
- SHA-1
- Bcrypt
- SHA-2
- SHA-3
- BLAKE2
Vulnerabilities
All of the many hashing functions implement the above key characteristics in different ways and to different extents. MD5, for example, designed by Ron Rivest in 1991, was for many years the gold standard for hashing, until security experts were able to demonstrate that MD5 was not collision-resistant. In 2007, researchers were able to create a rogue CA certificate using vulnerabilities in MD5, and as additional weaknesses were found, MD5 has become obsolete for critical use cases like PKI and digital signatures.
Advanced hashing algorithms, such as SHA-1, built on the basic mechanisms of MD5 yet results in a higher-bit hash value (160 bits as opposed to MD5 128 bits). While the computational requirements to break SHA-1 are significantly greater than MD5, there have steadily been weaknesses demonstrated by security researchers, and as of now (2019), SHA-1 is not recommended for production use. As of 2017, all major web browsers ceased acceptance of SHA-1 SSL certificates.
Cryptographers and security researchers continue to find advanced mathematical vulnerabilities in hashing algorithms, but many of them really boil down to computational power required to brute force attack the algorithm. For example, Bruce Schneier used Jesse Walker’s breakdown of computational cracking of a collision attack against SHA-1, and estimated it would take $2.77M to break a hash value using commodity cloud compute servers.
Another vulnerability for using hashes is that since they have been used for so long, folks have already indexed the hash value for every dictionary word, and used massive password and phrase dictionaries to compute all hash values for them. You can use md5hashing.net to lookup any hash, and it will reverse search the word or phrase, if it knows about it.
Name That Tune
By 2007, Google had a major problem on its hands. YouTube, which Google acquired in 2006, became the target of mountains of lawsuits by companies including Viacom, Mediaset, Premier League, and others, claiming that YouTube had not adequately prevented the uploading (and thus, sharing) of copyrighted material. Viacom alone demanded over $1 billion in damages.
To curb this, YouTube introduced a system for automatically detecting uploaded copyrighted content. Using proprietary algorithms, YouTube catalogs copyrighted content and has amassed a huge database of content “ID’s” that represent each copyrighted sound or video. It has, essentially, created digital fingerprints of copyrighted material, all stored in a massive system that enables content providers to register their copyrighted works. When a YouTube video makes use of these, the content provider has the choice to exert control over what happens – either block the use of it entirely, inject ad’s into the video, the proceeds of which go back to the creator, or other behaviors.
Google has spent over $100 million developing this system of digital fingerprints, and has provided billions of dollars back to copyright holders through its use.
The “Workhorse” of Cryptography
Bruce Schneier calls hash functions the “workhorses of modern cryptography,” but we could expand this greatly. Digital fingerprints, in all their forms, are the workhorses of identity, authentication, and trust across applications of all kinds, and continue to be a critical part of computing infrastructure.
Newly emerging use cases, such as Internet of Things, will ensure that digital fingerprints are gaining in use. As such, it’s more important than ever that they hold integrity and cannot be compromised. Cryptographers will continue to break, and improve, digital fingerprint technology, and the mathematics behind it.
More Information:
Why your business should encrypt its Data
Thales — moving from Vormetric to Ciphertrust
Difference between TPM and HSM Security
Reach out to us if you need help managing your data protection needs: