“The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom.”
– Isaac Asimov, “Isaac Asimov’s Book of Science and Nature Questions”, 1988
The Storage Dilemma
The picture to the right shows a 5 MB hard drive being shipped out by IBM in 1956. Five megabyte. My first work desktop computer (an Ericsson PC) in 1985 had a 20 megabyte hard drive. After a few months I had filled it, and the Ericsson Information Systems technician could not believe his eyes as it was so much storage.
Today the MacBook Pro I write this article on has a 500 gigabyte flash storage, which is 102,400 times larger than the 5 MB drive in the picture, and it has 22.4 GB free space. This means my laptop storage has 477.6 GB of data. If all this data was text, it would be about 31 million A4 size pages, or 182 Encyclopaedia Britannica (which is 17,000 pages in total). I could not possibly read so much text in my lifetime and fortunately the data on my hard dive is a combination of movies, pictures, databases, software programs and work data. 16 GB of the data are Microsoft Word documents, and this is approximately 247,000 pages, a mighty number of text to write or read.
People have been storing information since the stone age, ever since they have been writing or putting art on tablets and walls. With the invention of paper and ink, the “density of information” increased significantly, packing a lot more information into a tighter space, such scrolls and eventually bound books, as we still use today.
The invention of printing did not substantially increase the density of information, though it greatly contributed to its dissemination by making information easier to copy. In the 20th century, the benchmark for a sizable chunk of information became the above mentioned Encyclopaedia Britannica.
The Rosetta Stone
I made my first visit to the British Museum in London in 1985. One of my memories from my visit was the Rosetta Stone, a 1,700 pound piece of rock discovered by Napoleon Bonaparte’s soldiers in the sands of Egypt in 1799. This stone became the key to modern understanding of ancient Egyptian hieroglyphs.
On the Rosetta Stone stone the old Egyptians had systematically engraved the same text in three different languages: Ancient Greek, Demotic, and hieroglyphs. A full translation of the Greek versions was completed in 1803, but it was 20 years before linguists worked out the details of the hieroglyphs.
The use and understanding of hieroglyphs had gone out of fashion in the early centuries A.D. A few Arab historians made an attempt in the 9th and 10th Centuries but without success. A few more European historians tried again in the 16th, 17th, and 18th Centuries, still without success.
A French scholar was the one who ultimately cracked the code. In 1814 Jean-Francois Champollion identified the phonetic characters spelling the name of Cleopatra in two inscriptions on a famous obelisk. One inscription was Greek, and the other was in hieroglyphs. Champollion turned his attention to the Rosetta Stone and eventually found similarities for the pharaoic names Ramses and Thutmose. In 1822, he announced a full translation of the hieroglyphs.
Champollion eventually constructed a hieroglyphic dictionary and a grammar of Ancient Egyptian writings. By cracking the code and creating a dictionary of hieroglyphs he allowed us to decipher other inscriptions. What he did, essentially, was to utilise “big data” to decipher by association. Now in the 20th century information has gone from scarce to superabundant and similar approaches to gain new knowledge from big data has become one of the new holy grails of data science.
The Growth of Big Data
Today the world contains an unimaginably vast amount of digital information which is getting vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the data can be used to unlock new sources of economic value, provide fresh insights into science and hold authorities to account.
But the amount of data is also creating a host of new problems. Despite the abundance of tools to capture, process and share all this information, sensors, computers, mobile phones and the like, the amount of data already exceeds the globally available storage space (see chart on the right from The Economist). Moreover, ensuring data security and protecting privacy is becoming harder as the information multiplies and is shared ever more widely around the world.
The business of information management, helping organisations to make sense of their proliferating data, is growing by leaps and bounds. In recent years Oracle, IBM, Microsoft and SAP between them have spent more than $15 billion on buying software firms specialising in data management science and analytics and specialised educations like the Data Science Tech Institute in Sophia-Antipolis, France are being established. The data science industry is estimated to be worth more than $100 billion and is growing by almost 10% a year, roughly twice as fast as the software business as a whole.
There are many reasons for the information explosion. The most obvious one is technology. As the capabilities of digital devices soar and storage prices fall, sensors and gadgets are digitising lots of information that was previously unavailable. Multinationals like Ericsson are building new business lines on this. Also, many more people have access to far more powerful tools. For example, there are 4.6 billion mobile-phone subscriptions worldwide (though many people have more than one, so the world’s 6.8 billion people are not quite as well supplied as these figures suggest), and between 1 billion and 2 billion people use the internet.
‘Moreover, there are now many more people who interact with information. Between 1990 and 2005 more than 1 billion people worldwide entered the middle class. As they get richer they become more literate, which fuels information growth. The results are showing up in politics and economics and legislation as well.
The amount of digital information increases tenfold every five years. Moore’s law, which the computer industry now takes for granted, says that the processing power and storage capacity of computer chips double or their prices halve roughly every 18 months. The software programs are getting better too. Researchers are also mining social media sites for useful leading economic indicators.
Unlike the traditional approach to make predictions through sample survey data, which currently drive econometric forecasts, these newly available data reflect the real-time behaviour of economic actors, revealing previously undetectable shifts in the economy. For example, data on job searches and job postings could be used to predict employment for the following month.
Properly used, new data sources have the potential to revolutionize economic forecasts. In the past, predictions have had to extrapolate from a few unreliable data points. In the age of Big Data, the challenge will lie in carefully filtering and analysing large amounts of information. It will not be enough simply to gather data; in order to yield meaningful predictions, the data must be placed in an analytical framework.
Asimov and the first and second Foundation
In the 1940s, the science fiction author and scientist Isaac Asimov wrote three books often referred to as The Foundation Trilogy. I first read them in 1974, when I was ten years old. The story begins on Trantor, the capital planet of the 12,000 year old Galactic Empire. Though it has endured for so long and appears outwardly to be strong and stable, the empire is corrupt and exhausted and has been declining for centuries.
The only one who realizes this is Hari Seldon, a mathematician who has created the science of Psychohistory by which it is possible to predict future events by extrapolating from historic trends. He has set up a project which is increasingly harassed by Imperial officials.
Utilising big data analysis, Seldon predicts that Trantor will be destroyed within 300 years as the climax to the fall of the Galactic Empire, leading to a 30,000 year period of anarchy before a new civilisation is established. The purpose of his project is to influence events so that the bridging period will be only 1,000 years and not 30,000. This will be done, he says, by the production and dissemination by his team of an Encyclopaedia Galactica which will contain all human knowledge.
The Imperial commission is satisfied that Seldon’s project is not a threat to the Empire but wants to quiet him. He and his team are exiled to Terminus, a small planet on the periphery of the galaxy, to work on the encyclopaedia.
The novels follow the unfolding of Seldon´s plan. For the first book and a half all goes well. Then the plot makes a twist as the plan goes off course due to the impact of the highly improbable. It is worth while to keep in mind in the present time of nascent big data science that there are always black swans around.
The Credo for Black Swans
I never forget when I was having breakfast with Myron Scholes in 1997 as he came to Stockholm to collect his Nobel Prize in Economics, and he told me and my colleagues about his new venture, Long Term Capital Management, the super-advanced hedge fund set up by John Meriwether, Scholes and some other Masters of The Universe and intended to be fail-safe in immunised fixed-income arbitrage.
One year later the fund went bankrupt in the global impact of the 1998 Russian financial crises. Incidentally as this happened I was in St Petersburg for some time, experiencing the crises first hand.
In Imperial Rome The corona civica, the “civic crown”, was usually held above the head of a Roman general during a triumph, with the individual holding the crown charged to continually repeat “memento mori“, or, “Remember, you are mortal”. This credo should be remembered by the financial industry´s Masters of the Universe as well as big data scientists.
Even though big data and better analytical methods help to make incredibly accurate predictions, we should all beware the inevitable black swans, as events like Harrisburg, Chernobyl and Fukushima, as well as the inevitably degenerating processes of debt-ratio increase, pollution and environmental change, ozone layer depletion and so on shows us, in both small and large scale. Having knowledge to the edge of certainty is not enough, as the highly improbable is after all possible. With more advanced methods we also need to grow up and learn to handle complexity and the risks of certainty hybris better.
Having read this article and absorbed how the chain of ideas connects information density and data volumes with disruption and inevitable risk for hybris, then I suspect you feel a need for a stiff gin and tonic… If you understand this, then here is a challenging relief to put depth to these thoughts in perspective.