Blockchain: Data Integrity with Decentralized Cryptography

expanding ML data: maintaining security & privacy

More data is generated every day, and the companies gathering your data aren’t exactly slowing down. It’s so damn , you see.

Data is “the new oil” of the Information Age, and companies are willing to pay for it: corporations spent about $19 billion in 2018 purchasing and harvesting personal data of potential users. They wouldn’t be doing this (for more than a year, anyway) if it didn’t make them .

I wrote how I figured automation will usurp plenty of jobs in the future, but there’s still plenty of time on the road ahead to experience a mildly dystopian future where individual data is over-collected, stolen, corrupted and generally used against us.

Because we seem to have quickly charged through the Information Age to the Disinformation Age. Anyone can write news, anyone can spread ‘alleged’ rumors. How do you know what’s true?

You trust the talking heads on the telescreen to give you the right take. Trust the people up top to tell you what really happened.
That works fine, right up until it doesn’t.

Decentralized ‘Truth’

What really happens in crime and politics (I could have just wrote ‘politics’) is a complex topic. But for more statistically workable data — crunchable numbers and labels and dates — a solution may appear from a surprising direction.

Enter Blockchain, a framework for distributed-trust public ledger transactions with cryptographically protected currency. There’s thousands of articles on why Bitcoin threw the financial world for a loop — it breaks all the long-revered Wall Street Commandments that have ran the world for over a century.

It’s that’s why I like it. If you engrave every transaction a quarter has ever been used in on the coin itself, you can’t pretend to sell the same quarter to 2 different parties.

This is sort of how blockchain works, but a bunch of different, unconnected people do the engraving by using powerful computers to solve complex math equations, and they get some coins for their trouble. Strange times require strange solutions.

Instead of trusting a centralized authority not to swindle everyone it can, blockchain-based finance distributes the authority — and distributes the incentive. Anyone who wants to falsify the public ledger would have to take over more than half of all involved computers, and there’s an awful lot now — not to mention further security development with newer cryptocurrencies (Chainlink, etc).

The one thing you can always trust is that the average individual will take a 100% to get $5 instead of a 0.0000002% chance to get $1,000. This is why mining verification works — you can safely rely on the human impulse to get something instead of nothing for the same amount of effort.

So if we can trust blockchain, what can we trust it to do for us?

Data Security through Distributed Cryptography

About 42 million patient health records were breached in 2019. This includes ransomware attacks, where hackers gain access to hospital databases and threaten to steal, encrypt or delete vital information. This is potentially disastrous to patient health (allergies, medication and surgery histories) as well as hospital operations.

Image for post
Image for post
Computerized Tomography scans provide detailed looks into cross-sections of the body

A paper came out earlier this month (Feldman et. al) in the Journal of Oral Surgery, Medicine, Pathology, and Radiology which proposes a blockchain based solution for patient health data security. They split ~92,000 CT scans into 2 equally sized folders: one using a normal storage procedure, the other by converting the data to a blockchain format.

The conversion mechanism is quite fascinating: each CT scan becomes a cryptographic hash after syncing the original data with a DDSBlockchain folder and connecting the whole thing to a Hyperledger-based semi-private ledger.
As an added bonus, the hashed data takes up far less space. The resulting blockchain data was 1.22MB, while the original-format scans were 5.36GB. I presume you’d still need access to the chain to convert the hashes back to a medically-useful format, of course.

This is essentially a strange and promising mix between decentralized, secure cloud backups & compression algorithms. They did some fancy F-testing to ensure the data integrity and upload speed of both methods matched, and found that the blockchain transition didn’t breach HIPAA regulations.

So if a hospital encrypts & stores their patient records on a ledger in this manner, in the case of a ransomware attack, they’re no longer at risk of permanently losing valuable data. The hackers only managed to steal a bunch of hashed jargon, which they aren’t able to delete (distributed storage) or publicly expose (it’s a hash!).

Privately Analyzing Private Data

In the same vein of healthcare, data science has potential to revolutionize medical procedures and improve lives.

But the data needed to train next-generation models is private, and to a certain degree it has to be. Hospitals can’t go around handing patient data to everyone who emails them saying “I’ve installed scikit-learn, please give me your private records”.

So how do you make more data available for machine learning while preserving HIPAA? There are data marketplaces, but you have to trust a middleman organization. The answer is to feed the data to your model without looking at the data, so to speak. Federated learning achieves this in both centralized (bad!) and decentralized (better!) ways.

However, I’ve got my eye on The Ocean Foundation’s ”Compute-to-Data” framework, which uses blockchain as the base of a decentralized, ‘blind’ connector marketplace.
This pipeline market ledger can funnel private data to your ML model without anyone laying eyes on sensitive information. Your neural network sees the numbers, but it doesn’t have eyes, so no worries.

CtD can also make use of “Differential Privacy”, a slick way of anonymizing individuals in datasets, which takes the burden of training/computation off data owners and onto model builders (or their preferred cloud computing solution).

Its founder, Trent McConaghy, describes the tech as turning data from a potential liability to a useful resource, and shifting companies from “Don’t be evil” to “Can’t be evil”.

Moving Beyond “Analog Trust”

Our business and financial systems have quickly evolved to the realm of abstraction. Money isn’t a gold coin in your hand, information isn’t a sentence on paper; it’s transistors in a computer chip, vibrations in radio waves, colorful pixels on a screen.

But our idea of trust is still the same one we’ve evolved with. We trust someone’s word, we form contracts with people or companies or governments. We have a notion of another party‘telling the truth’ and intentionally deceiving us.

For a while now, we’ve been smashing these two systems together and ignoring or bearing the resulting friction. But since we’re not going to move our daily operations out of the tech-cloud and back to earth, our concept of trust needs to ascend to abstraction to join the rest of our information.

Blockchain, as of now, is the most promising way forward to create a robust standard of computationally-verifiable “trust”. And the underlying assumption — put your faith in the common man’s base instinct to get something rather than nothing — is much more reassuring than telling yourself that some faceless corporation is really looking out for your well-being.

As the digital landscape grows and data blooms further, I would recommend investigating any sort of digital ledger as a solution for privacy and security.

Written by

data scientist, machine learning engineer. passionate about ecology, biotech and AI. https://www.linkedin.com/in/mark-s-cleverley/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store