Dark data is collected information and intelligence that is unused or serves no purpose, therefore it is usually not analyzed. Gartner’s IT Glossary defines the term as “information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.”
Organizations often gather large volumes of data and much of it, though related to the organization’s business goals is often the extraneous results of data generation or outdated versions of raw data—such as profiles of ex-employees, web blogs, email correspondence, old financial records, etc. Also known as “dusky data”, organizations generally store all of their unstructured data in repositories, even if they haven’t made future plans for it. As a result, the IDC estimates that 90% of the unstructured data are never analyzed. But dark data is important to store properly because it may contain sensitive information that can lead to data breaches and regulatory compliance issues and cause harm to organizations.
Dark data is often just untapped data that can be valuable but is not recovered because of a lack of resources, skilled analysts and the sheer volume of dark data that exists. Now because of big data and AI tools, Machine Learning and data mining techniques, precious insights and a treasure trove of information can be excavated from dark data and turned into optimized data. Software such as RPA (Robotic Process Automation) automates and streamline operations. Dark data needs to be tended to regularly and organized within the repository.
Data analytics is traditionally linked to structured data but dark data analytics is the process of unearthing untapped data to find hidden opportunities.
Dark data analytics sifts through three categories of information; traditional instructed data that already organizations have already stored (i.e. emails, documents); non-traditional unstructured data are usually media assets that cannot be processed through big data methods; and huge volumes of data found in the deep web which is curated from a variety of sources (government agencies, third-party domains, etc.).