Creating a single data lake to serve a newly merged Dell Inc. and EMC Corp. is a bit like harnessing the tectonic shifts in the Earth’s crust that form the more traditional lakes some of us would rather be fishing on.

Both companies—united last fall as Dell Technologies, the world’s largest privately held technology company—have relied on somewhat different technologies to perform critical Big Data analytics that are key to their success. Critical data for each company was housed in multiple legacy systems and platforms. The challenge was how to bring everything together in a central repository—i.e. a data lake.

As soon as the groundbreaking merger took place last fall, a newly merged Big Data team, for which I serve as lead architect, began working to develop a world-class data ecosystem that would provide the right data, in right place, in the right format and at the right time to solve for current challenges and position the company for digital transformation.

Seven months later, we have built the foundation for the Dell Data Lake, we have stood up considerable functionality and are continuing to integrate data from legacy systems and harness its value to enable Dell to act like a single company.

While our data lake formation continues, here are some insights on our data lake journey so far.

Making the data connection

A first step toward integrated analytics was to design what the Dell Data Lake would look like. Our starting point was two Big Data platforms that were similar but not identical.  Dell relied on Apache Hadoop-based software, Cloudera and a massively parallel processing (MPP) database, Teradata, for its analytics, transformed data sets and operational reporting. In contrast, EMC had stood up its Big Data platform to analyze raw and transformed data sets using the MPP platform Greenplum and Hadoop-based Hortonworks software for processing large data sets.

We started by getting access to each other’s data platforms. We had to overcome typical barriers between two large companies trying to work together, such as different firewalls on each side, IP address conflicts, network routing and firewall rules. Once we got past the ability to get to each other’s data, we had to address the fact that we had two different sets of enterprise applications, like ERP and manufacturing systems, feeding two different Big Data solutions. Each of the solutions was performing operational reporting and analytics on their respective data sets.  In order to do meaningful analytics, you can’t have processes feeding into separate systems and then try to merge the results into one. You have to do it in one or you get inaccurate, skewed or misleading results.

So the challenge was how do we get two source systems that are already writing to two Big Data solutions to write to a third Big Data solution and then integrate those data sets into a common data lake.

Since analytics requires raw data, we had to first ingest data from Dell’s legacy applications and from EMC’s legacy applications, such as the two ERP systems, into the data lake. And while the ERP systems had similar data, they had two different schemas or database blueprints. That meant that once we moved it, we then had to map the data that was in one schema to the data in the other schema so we could integrate it into a common data model in order to do reporting and analytics on a single data set.

This clearly is a very long process. We have been working on this integration since last fall and have only scratched the surface of fully stocking the Dell Data Lake. Our priorities thus far have been Sales, Customer and Service data, all of which will give us better insights into our customers and help us function as a more united company.

Data Lake Architecture

Thinking about the future    

Among the key challenges we continue to face is deciding which architecture and tools to use for the Dell Data Lake. The question is how do we blend two Big Data solutions utilizing the best parts from both to make one. We have had a lot of discussion between enterprise architects and delivery organization architects and administrators on this issue.

Ultimately, the challenge is that you have two distinct groups of people that have done things somewhat differently and the goal is to get them to forget about the past and think about the future.

While it isn’t easy getting everyone to agree, we have made good progress and built a solid foundation.  I expect the Dell Data Lake to be fully built and functional within two years. While the data lake is functioning well already and we have a plan to complete it, two years from now we will still be building it and changing it.  The data lake will continually evolve, as the technology, insights and requirements are constantly changing and giving us newer and better ways to get value out of the data. So we will constantly evolve the lake as needed.

Check out Darryl Smith’s Technology Breakout Session at the 2017 Dell World in Las Vegas on Dell IT’s Big Data Journey with Big Possibilities on May 9th at 8:30 to 9:30 a.m. and May 10th at 12 to 1 p.m.

Darryl Smith

Darryl Smith

Chief Data Platform Architect, Distinguished Engineer, Dell IT
Tags: , , , ,

One Comment

  1. Nice article, and great to see StreamSets in your data lake architecture! Is there a recording of the Dell World breakout session?

Leave a Comment

Comments are moderated. Dell EMC reserves the right to remove any content it deems inappropriate, including but not limited to spam, promotional and offensive comments.

Follow Dell EMC

IT PROVEN MICROSITE:

Recent Tweets

Peter Cutts @cuttsp discusses hybrid cloud platform for Microsoft #Azure Stack. https://t.co/Xtg6rXxZG9 https://t.co/lDNeAzPr76 about 12 hours ago
Right from the start, the new generation of PowerEdge servers was designed to simplify the way you work. https://t.co/R5Mx7IN86I about 14 hours ago
See the latest IDC study showing why CI and HCI are building blocks for next-generation data centers.… https://t.co/4ZwgEzHKmX about 16 hours ago