What exactly is dark data?
a. The information you hold on individuals that is subject to data protection rules.
b. Documents, emails, images, and videos the organization has amassed over time.
c. Information you have acquired but which has no formal record regarding its source.
Answer: (b). It sounds sinister but dark data, as defined by Gartner, simply refers to “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes, such as analytics, business relationships and direct monetizing.”
What volume of data is unused and becomes dark?
a. 10%
b. 30%
c. 95%
Answer: (c). Staggeringly, less than 5% of the huge data reservoirs that organizations collect are used, while the rest sit untapped in vast lakes. Given that data centers now account for 2-3% of global greenhouse gas emissions, this presents a huge challenge to meeting carbon-reduction goals.
Why does data “go dark”?
a. Because someone decided it would be a good idea to store all the information that passed through the organization in a vast data warehouse but didn’t label it appropriately.
b. Because people who could exploit its potential don’t know it’s there or don’t understand its value.
c. Because it’s stored in different departments, leading to data silos, fragmentation, isolation, and a general lack of visibility.
Answer: All of the above – and much more. Many factors contribute to the build-up of dark data. Other significant causes include a lack of robust data governance, failure to integrate legacy systems, changing business priorities, organizational data illiteracy, and the accumulation of ROT (data that is redundant, obsolete, or trivial).
Managing your dark data: three recommendations for business leaders
- Evaluate your AI use cases not just for their potential business value, but also for their environmental impact. Consider the carbon footprint of your AI applications and opt for alternatives that use less data (data minimalism) and less energy.
- Adopt a structured approach to categorize your data and implement stringent policies for its lifecycle management, including archival and deletion protocols.
- Rethink how you store the data and, most importantly, how you design and maintain your data pipelines (the process whereby raw data is captured from various data sources and then ported to data storage, such as a data lake or data warehouse, for analysis).
Further reading
AI and the hidden climate cost of ‘dark data’
Monetizing data in the AI era: increasing profits while preserving privacy
Data collectives are the next frontier of labor relations