Creating a single data lake to serve a newly merged Dell Inc. and EMC Corp. is a bit like harnessing the tectonic shifts in the Earth’s crust that form the more traditional lakes some of us would rather be fishing on.
Both companies—united last fall as Dell Technologies, the world’s largest privately held technology company—have relied on somewhat different technologies to perform critical Big Data analytics that are key to their success. Critical data for each company was housed in multiple legacy systems and platforms. The challenge was how to bring everything together in a central repository—i.e. a data lake.
As soon as the groundbreaking merger took place last fall, a newly merged Big Data team, for which I serve as lead architect, began working to develop a world-class data ecosystem that would provide the right data, in right place, in the right format and at the right time to solve for current challenges and position the company for digital transformation.
The data lake has not only allowed IT to open up Big Data to a broader community of internal business users, it is now helping us channel unprecedented amounts of information to DELL EMC customers as well.
Using data lake technology, for example, IT and our DELL EMC business groups forged a groundbreaking partnership to allow customers to leverage Big Data to monitor and proactively manage their IT environments. We created a tool called MyService360, an on-line solution that gives DELL EMC Support customers and partners easier and faster access to near real-time service information. It features a personalized dashboard that provides customers with a 360-degree view of their environment and customer service experience.
Launched last May, MyService360 only scratches the service of the potential value that is expected to spring-board from leveraging Big Data in the data lake. Having all the data in a centralized location provides easy access and gives developers and data scientists the opportunity to gain data insights that would be extremely difficult to achieve without the data lake. Those insights can then be used to create metrics that we can share to empower our customers.
Whether companies refer to results, outcomes, ROI, or case studies, Big Data and data science are finally moving beyond the hype and proving to deliver dividends over time. Several new Big Data technologies and predictive tools have been launched to meet the growing demand within business and technology groups to harness the constant growth of both structured and unstructured data within and outside of the enterprise. But such technologies and tools won’t be effective unless you define the problem to be addressed.
Most data science initiatives start with a proof of concept (PoC) or in some cases with a proof of value (PoV) if the foundational concept is clearly established. Developing a pipeline of PoC’s can be extremely helpful through working sessions with data scientists, business subject matter experts (SME’s), data experts, and leaders. Following this, prioritize PoCs by stack-ranking each of them based on business value and ease of implementation which factors in availability of data, granularity, and quality.
As organizations unleash the power of the data lake by providing business broader access to more and more data, they are facing a growing IT dilemma—How to keep improperly governed or poor quality data from polluting the data lake.
While IT’s traditional approach to managing data governance and quality have been quite effective over the years, the magnitude of data in today’s data lake is much larger than traditional data warehouse levels. Traditional tools and tactics are being overwhelmed by Big Data in the lake.
There are, however, strategies that organizations can use to reshape data governance and quality standards in the Big Data world. While our tactics and tools are still evolving, I will share some of the efforts we are developing at EMC IT to keep our data lake clean.
From using analytics to predict how our storage arrays will perform in the field, to engineering product configurations to best meet customers’ future needs, EMC is just beginning to tap into the gold mine of intelligence waiting to be extracted from our new data lake.
In fact, we are currently working on dozens of business use cases that are projected to drive millions in revenue opportunities. And we are just scratching the surface. There’s a lot more data available, more to be harvested, and more analytics to be built out as data scientists and business users hit their stride in exploring a new era of data-driven innovation at EMC.
As I noted in my earlier blog ( The Analytics Journey Leading to the Business Data Lake), EMC IT embarked on creating a data lake to transition from traditional business intelligence to advance analytics more than two years ago. A key focus of this effort was to address the fact that data scientists and business users seeking to leverage our growing amount of data were stifled by the need for such projects to go through IT, which was a costly and slow process that discouraged innovation.
We now have the foundation and tools in place to use data and analytics to create sustainable, long-term competitive differentiation. To get here, we worked closely with EMC affiliate Pivotal Software, Inc. to mature together and leverage the multi-tenancy capabilities of their Big Data Suite.
With the expanding volume of information in the digital universe and the increasing number of disk drives required to store that information, disk drive reliability prediction is imperative for EMC and EMC customers.
Figure 1- An illustration of the information expansion in the last years and expected growth
Disk drive reliability analysis, which is a general term for the monitoring and “learning” process of disk drive prior-to-failure patterns, is a highly explored domain both in academia and in the industry. The Holy Grail for any data storage company is to be able to accurately predict drive failures based on measurable performance metrics.
Naturally, improving the logistics of drive replacements is worth big money for the business. In addition, predicting that a drive will fail long enough in advance can facilitate product maintenance, operation and reliability, dramatically improving Total Customer Experience (TCE). In the last few months, EMC’s Data Science as a Service (DSaaS) team has been developing a solution capable of predicting the imminent failures of specific drives installed at customer sites.
From taking charge of healthcare choices to customizing product purchases, today’s consumers are increasingly using self-service, social, and mobile digital capabilities. EMC’s new MyService360 now brings that same personalized, proactive service to our Online Support customers.
Powered by EMC data lake solution, MyService360 (launched at EMC World 2016 on May 2) gives EMC Support customers easier and faster access to real-time information at their fingertips. Using its easy-to-read visual and powerful analytics, customers can view analysis of code levels, health, and risk scoring on their installed EMC products, service activity views by site, incident management, and more.
It takes many different best-of-breed technologies to effectively harvest “game-changing” analytics value from the data lake. Getting the right architecture to navigate your data lake requires a deep understanding of both the needs of Big Data and the available technologies in order to match analytics use cases with the appropriate platforms to get results.
Do you need to analyze large amounts of data fast or process many queries simultaneously? Is the data you are using organized in columns and rows, customer records perhaps? Or are you searching document files?
Let’s look at the basics of data lake architecture, some of the technologies and tools you should consider, and how EMC IT is approaching this crucial process.
In the expanding world of Big Data, there is more and more information out there that can help your organization target the right customers with the most effective messages for the right products and services at the right time. EMC IT is using data lake technology to help our Marketing and Sales teams gain unprecedented insights into our customer behaviors, needs and sentiments to drive effective marketing.
At the center of this effort is our Marketing Science Lab, which provides advanced analytics support for Marketing using a shared Marketing and Sales workspace in the data lake. The Lab collaborates with Sales on shared data and models to deliver 360 views of customer behaviors by analyzing a vast array of data from internal and increasingly, external sources.
With the digital universe expected to swell to 44 zettabytes of data by 2020, today’s enterprises need a central data repository that can process increasing volumes of all types of data faster to let business users make better, real-time decisions. In short they need a stronger backbone; they need the data lake!
Not only do traditional databases constrain real-time and shared data analytics due to their siloed nature, they also lack the technology to accommodate the skyrocketing level and types of data being created at an increasing rate. After all, according to IDC research, the growing number of smart devices that analyze everything from home heating systems to consumer information will mean that within four years there will be some 7 billion connected people using an estimated 30 billion devices.
The opinions and interests expressed on Dell EMC employee blogs are the employees' own and do not necessarily represent Dell EMC's positions, strategies or views. Dell EMC makes no representation or warranties about employee blogs or the accuracy or reliability of such blogs. When you access employee blogs, even though they may contain the Dell EMC logo and content regarding Dell EMC products and services, employee blogs are independent of Dell EMC and Dell EMC does not control their content or operation. In addition, a link to a blog does not mean that EMC endorses that blog or has responsibility for its content or use.