Perspectives

How’s Big Data Looking for 2017?

Arvind Purushothaman
Practice Head and Senior Director – Information Management & Analytics
Article

Get ready for more real-time data and non-relational or unstructured data along with the need for self-service tools, data governance, and data quality measurements.

The year 2016 was an important one in the world of big data. What used to be hype became the norm as more businesses realized that data and the related infrastructure, in all forms and sizes, are critical to making the best possible decisions. In 2017, we will be able to see continued growth of systems that support massive volumes of non-relational or unstructured data as well as move towards processing more data in real time. Data governance and data quality will grow in importance as organizations bring additional data sources into decision making. These systems will evolve and mature to operate well inside of enterprise IT systems and standards. This will enable both business users and data scientists to fully realize the value of big data and move to advanced analytics. The top big data trends for the following year include the following.

Hadoop projects mature! Enterprises continue their move from Hadoop proof of concepts to production: In a recent survey of 2,200 Hadoop customers, only three percent of respondents anticipate they will be doing less with Hadoop in the next 12 months. 76 percent of those who already use Hadoop plan on doing more within the next 3 months, and finally, almost half of the companies that haven’t deployed Hadoop say they will within the next 12 months. As further evidence to the growing trend of Hadoop becoming a core part of the enterprise IT landscape, we’ll see investment grow in the components surrounding enterprise systems such as security.

End users, meet big data; options expand to access data in Hadoop: With Hadoop gaining more traction in the enterprise, we see a growing demand from end users for the same fast data access capabilities they’ve come to expect from traditional data warehouses. To meet that end user demand, we see the growing adoption of technologies that enable business users to access data directly from Hadoop, further blurring the lines between traditional BI concepts and the world of big data. However, this is one area where current technology does not completely meet the querying needs of the end users, and we expect vendors to make significant improvements in the coming year.

The number of options for preparing end users to discover all forms of data grows: Self-service data preparation tools are exploding in popularity. This is in part due to the shift toward business user-generated data discovery tools that considerably reduce the time required to analyze data. Business users also want to be able to reduce the time and complexity of preparing data for analysis, something that is especially important in the world of big data when dealing with a variety of data types and formats.

Data warehouse growth is heating up in the cloud: The death of the data warehouse has been overhyped for some time now, but it’s no secret that growth in this segment of the market has been slowing. But we now see a major shift in the application of this technology to the cloud, where Amazon led the way with an on-demand cloud data warehouse. Analysts state that 90 percent of companies who have adopted Hadoop will also keep their data warehouses, and with these new cloud offerings, those customers can dynamically scale up or down the amount of storage and computing resources in the data warehouse relative to the larger amounts of information stored in their Hadoop data lake.

The buzzwords converge—IoT, the cloud, and big data come together: The technology is still in its early days, but the data from devices in the Internet of Things will become one of the killer apps for the cloud and a driver of petabyte-scale data explosion. For this reason, we see leading cloud and data companies such as Google, Amazon Web Services, IBM, and Microsoft bringing Internet of Things services to life where the data can move seamlessly to their cloud-based analytics engines.

Data governance and data quality continue to gain prominence: Organizations who are leveraging data for competitive advantage have realized that data quality is key to making the right decisions. Given this, we see organizations investing in data governance initiatives, including establishing a CoE with data stewards, looking at mastering key data elements such as customer and product, and being able to profile data in data lakes as it lands, allowing business users to create rules and set alerts, focus on end-to-end data lineage, and manage metadata (business and technical) better. Some of it is necessitated by regulations, especially in the banking and financial services industry, and others out of a need to have more faith in the data itself.

Organizations have too much of the same data: In traditional data warehousing architecture where the data from the source is moved to staging and then 3NF DW, followed by a reporting-oriented data mart, there is too much duplication (sometimes for a good reason) and lag between the time data was created and the time of consumption. However, there was not a better way to do this other than reporting directly from the source, thereby spawning reporting marts. With the advances in data virtualization technology, it is possible to create a logical view of data that resides across multiple source systems and data warehouse–type databases without having to push the data to an integrated data storage structure. This helps reduce duplication, improves business agility, and helps keep costs lower.

Advanced analytics goes mainstream: Analytics has been mainly restricted to descriptive analytics with some organizations or certain functions in an organization leveraging predictive analytics. Quant-oriented industries have been leveraging advanced analytics for many years. With the advent of big data technologies, the traditional models have become more sophisticated with additional variables. The main challenge in moving from descriptive to predictive on the continuum has been the inability to scale the computing power and storage at the rate at which data is exploding.

We also expect unsupervised learning and advanced machine learning techniques such as deep learning in the field of neural networks will become more mainstream in 2017 because of its ability to do auto-feature selection, extraction, and engineering. Also, cognitive APIs from IBM Watson will see higher adoption, especially as they become more industry specific.