5 Must-See Docker Big Data Use Cases That Show Docker's Processing Power

Big Data is one of the big trends in IT of recent years. The vast majority of CIOs are collecting and managing more business information than they did two years ago.

CIOs and IT Operations have a common goal: prepping their IT infrastructure to manage the data deluge and growing revenue by making better use of the data they collect.

They also share some common frustrations. Often the right systems are not in place to gather the information they need, and many struggle to give their business managers access to pertinent information.

That doesn’t mean all is lost, however. Arming your organization with the appropriate technology, staff, and systems/processes needed to optimize information for true business intelligence can help manage the data deluge. Apply the following approaches to increase your chances of a successful outcome.

1) Use Docker To Avoid Dependency Hell (Obviously)

If you ask developers which set of data tools they want to use, guess what’s going to happen? They’re going to each ask for their own set. Not to mention all the dependencies required, which then must be distributed to each machine in a cluster.

You may think this situation is manageable, but get enough developers on the same cluster and it doesn’t take long for one tools’ requirements to break another. You’re in dependency hell, my friend.

In this situation, you have two choices – get your entire development team to standardize on a common toolset (good luck with that!), or use Docker. Docker allows each tool to be self contained, along with all of its dependencies. This means that you can have different jobs use different versions of the same tool without a conflict.

This frees up your DevOps team to use the best tools for your data processing job, or set up entirely new systems and drive incredible scale and efficiency.

2) Reduce Reliance On MapReduce Experts With Pachyderm

For sysadmins that have a large amount of data to analyze, the go-to method has typically been to run MapReduce queries on Hadoop. This typically requires specialist programmers who specialize in writing MapReduce jobs, or hiring a third party such as Cloudera.

This typically means that Big Data initiatives require a lot of co-ordination internally and require resources that are beyond the reach of even large enterprises who do not have that kind of expertise on tap.

What if you could process Big Data without incurring the complexity of Hadoop and MapReduce? That’s Pachyderm.

Pachyderm is a tool that allows programmers to implement a http server inside a Docker container, then use Pachyderm to distribute the job. This has the potential to allow sysadmins to run large scale MapReduce jobs quickly and easily to make product level decisions, without knowing anything about MapReduce.

Pachyderm has the ambition of replacing Hadoop entirely – whether it achieves that remains to be seen, but it certainly looks like it will be a significant player in the next generation of data processing.

3) Run Scheduled Analytics Using Containers With Chronos

You already know that containers are a great way of deploying services at scale and giving isolation to services that run on the same host and improving utilization.

Did you know that you can also use Docker for batch processing? The latest release of the Chronos job scheduler for Mesos allows you to launch Docker instances into a Mesos cluster. This provides developers and sysadmins with the ability to run scheduled analytics jobs using containers.

Chronos 2.3.3 allows you to schedule Docker containers to run ETL, batch and analytics applications without manual setup on your cluster nodes. One of the neat features of Chronos is that it will also produce a dependency graph between scheduled jobs that depend on each other, so they only run if the previous job is successful.

Chronos and Marathon combine really nicely to provide orchestration for a container infrastructure.

4) Provision A Big Data Dev Environment Using Ferry

Ferry allows you to create big data clusters on your local machine (and AWS). The beauty of Ferry is that it allows you to define a big data stack using YAML, and then share it with other developers using a Dockerfile.

Setting up a Hadoop cluster is as simple as:

backend:
   - storage:
        personality: "hadoop"
        instances: 2
        layers:
           - "hive"
connectors:
   - personality: "hadoop-client"

Get started by typing

ferry start hadoop

This will create a two node Hadoop cluster and a single Linux client. This can be customized at runtime or defined using a Dockerfile.

Ferry is great for developers who want to get up and running with a big data environment using a test AWS box, developers that need a local big data dev environment, or users that want to share Big Data applications.

Running Ferry on AWS also has several advantages over something like Elastic MapReduce, such as not tying you to a single cluster of a single type (such as Hadoop).

5) Run Big Data As Microservices With Coho

When we talk to enterprise customers about Big Data processing, there are one or two recurring themes. For example, in healthcare, there are frequent workflows where new data triggers a new action.

Take transcoding for example. When a new image is pushed to the storage system, a transcoding workflow will take place, reading the data back to a client machine or VM, transcoding it, and then writing the results back to storage. This can mean that the data has to cross the network three times!

In Big Data environments, data might be pushed out to a separate HDFS-based analytics system, only to be pushed back to the enterprise system when the job has been run.

Coho has worked on a storage-integrated tool that allows developers and DevOps teams to think specifically about workflows as operations on data, and for them to be embedded in the storage system.

These resulting extensions can then run efficiently and transparently at scale as the system grows. This theoretically allows presentation layers to be built on top of existing data, for the system to be extended with audit and compliance functionality and for complex, environment based access controls to be built.

It’s just the start for Docker and Big Data processing – as the recent funding in Pachyderm has demonstrated.

How are you using Docker to process Big Data?