AI-assisted data management

Krishna Thiagarajan,

Senior Vice President, Data and Analytics Service Line

Published: May 17, 2024

Data is a fundamental part of business decision-making and accurate outcomes require quality inputs. To the engineers in pursuit of a solid data bedrock, many of the available technologies can seem inadequate for the job. This is true for Hadoop, blockchain, and appliances like IBM Netezza and Greenplum.

Initially, the combination of cloud data and mesh architectures seemed promising. However, new and developing artificial intelligence (AI) technologies precipitated a new discourse, and the quest for a strong data bedrock continues. In the following discussion, AI is synonymous with machine learning (ML).

In November 2022, AWS announced a zero-ETL (extract, transform, and load) future. Soon, everyone was discussing the application of AI across the data management life cycle. The prospect of a zero-ETL future inspired discussions surrounding the data engineering life cycle and AI’s ability to automate aspects of the life cycle (e.g., data integration, data modeling, data quality, master data management [MDM], metadata management, and consumption).

It was an interesting concept. Initially, engineers hoped for quality AI predictions and prescriptions that resulted from trustworthy data; now, to improve those predictions, they’re examining the technology itself. Given AI’s near-ubiquitous presence in industries worldwide, it’s not surprising that engineers would eventually employ AI to accelerate and automate the data management life cycle. The next question becomes: Do people even have an awareness about, or even an appetite for, such an intervention?

“We want 80% of our decisions to be data-driven.” “We want 85% of our activities to be automated using AI/ML.” “We want to explore generative AI to disrupt the way we do business today.” These are loud and clear statements that we hear from the Fortune 100 players today. Demand exists for AI’s application within the data engineering and management life cycle. Generative AI can assume a role in all aspects of data engineering.

Let’s examine data ingestion, for instance. Several ETL tools abound in the marketplace today, and each has specific features that make it the strongest option in certain situations. However, it is important to note that ETL costs can account for up to 35-40% of any data warehouse/data framework implementations. Bringing this cost down by even 10% will result in substantial gains in the overall cost of plumbing, which is effectively a sunk cost. Existing tools and techniques can take a source-to-target mapping document and automatically generate the necessary ETL mappings/graphs with accuracy. These tools also consider other metadata in their quest to finetune the building of ETL logic. Using AI in ETL, we can build adaptive features that cater to changes in data structures and logic. With AI, we can reduce the time spent on root cause analysis (RCA) and subsequent mapping changes.

Data modeling also offers ample opportunity for automation. By considering the source structures and target reports, an engineer can automatically build the entity relationship diagram (ER diagram) and, eventually, the logical and physical data models. This process accelerates the building of a data foundation and can reduce this time by 15%. Using generative AI, we can scan global repositories for models best suited to banking, insurance, or other domains. MDM is an extension of this process. The hybrid approach, one of the best practices in data modeling, represents a mixture of top-down and bottom-up approaches. Along with the use of best-of-breed and domain-specific models, AI tools naturally combine both approaches to provide a state-of-the-art model for a customer’s unique need.

Endless potential exists within data governance. Today, use cases include an AI solution that scans data policy documentation and automatically configures the governance tool to implement the policies. Huge potential also exists in terms of evaluating the data in the clean storage area to verify adherence to policy requirements. For the same purpose, AI technologies might also audit activities. Auditing can ultimately improve observability and the ethical usage of AI, ensuring that data serves its intended purpose.

Data quality is an incredibly important consideration for people exploring AI tools. Topical concerns include the use of statistical processes to check the quality of data, the use of fuzzy logic to mix and match data, and the use of advanced AI techniques for data cleansing. AI techniques can also support encryption needs within data masking, which is a process that changes data without sacrificing its statistical relevance.

Also relevant is the generation of synthetic data to test the data value chain. Synthetic data enables remote operations that prioritize data sovereignty.

In short, AI techniques within the data engineering value chain can remove existing barriers around data sharing and reduce operations costs by 50%. More and more customers are leveraging AI within their ecosystem and hoping to reap the benefits, which include lower operations costs and greater data accuracy.

Krishna Thiagarajan

Krishna Thiagarajan

Senior Vice President, Data and Analytics Service Line


 Artificial Intelligence

Powering innovation with deep engineering

Related content