Scaling Hadoop up to the Cloud
The advent of Hadoop for processing Big Data many years ago solved the problem of needing to scale hardware to specialized and impractical dimensions, both from the specification as well as cost points of view. Hadoop distributes processing and storage to smaller units of hardware and allows new hardware to be added as required. These smaller units of hardware could be cheap, non-specialized commodity server hardware. This makes the proposition of working with Big Data more attractive from the point of view of the investment required in hardware over time.
The in-premise cost illusion. At the same time, even commodity hardware may reveal some issues of its own, ones that may not seem visible at first. In an environment where the demand for Hadoop usage occurs at regular intervals, and usage is either more or less constant or usually on a rising trend, the popular decision is to keep adding new hardware as needed, and to replace any old hardware that reaches end-of-life.
But this may not be the usual pattern of Hadoop usage everywhere. There may be times when Hadoop usage needs to increase for short durations, and new hardware is added in response. But what happens when usage comes down for long periods of time or when usage is relatively infrequent? Regardless of whether the hardware used is relatively inexpensive, chances are that it still piles up, it needs physical rack space and cabling, all representing a waste of capital and maintenance. Apart from that it would need to be replaced upon obsolescence, even if only occasionally used.
Evaluating cloud solutions. This is where the cloud providers come in with the business case for their offerings. By moving Hadoop processing and storage to the cloud the problem of accumulating hardware in-house seems to go away. If only it were that easy to decide, though. One of the first and most obvious questions that arises is about whether to build a private cloud or go to a public cloud, or whether to use a hybrid approach.
The answer there lies in the pattern of usage. If demand size is predictable a private cloud makes sense. But again, this is valid for the case where utilization is fairly high and constant, otherwise the expense on unused cloud storage and processing gets wasted. In many cases demand will neither be completely predictable nor constant, in which case the public cloud may make better sense until one factors in network performance issues and the cost of data transfers between machines on the cloud, and between the cloud and on-premises platforms. If data has to move frequently between in-house storage and Hadoop in a public cloud there may be risks of network delays, latency or service disruption and hard-to-predict data transfer costs. The decision about cloud utilization and privacy has to be made depending on the business criticality versus the cost of latency (or an outage) and the cost of cloud utilization. Since it is Big Data involved, the time taken to upload data may itself be significant, and this is a factor that may not arise with "small" data.
Even the frequency of data movement between in-house storage and the cloud may not always be quite predictable. In certain scenarios data that is processed once may not be needed again for the foreseeable future, in which case it can be backed up outside of the cloud. But if there's a good possibility it may be recalled within a short period (i.e., less than the archive window) it makes sense to retain it on the cloud rather than move it back in-house. This again, points to a larger utilization of cloud capacity, and the related cost.
There's more. Cloud services have come a long way in just a few years, though, so once the decision to move to the cloud has been made, there are several advantages to it that because of the various use cases that the major providers cover with special purpose services that abstract the problems of physical space and hardware variety, ease of setup and maintenance. Last but not least there's the issue of data security. Cloud providers do offer security for different parts of the cloud stack and how it works with Hadoop implementations needs to be looked at closely so that all angles are covered.
Obviously, moving Hadoop to the cloud to take advantage of its benefits is viable and can make sense. However, as with moving any other enterprise system to the cloud, there are always considerations to be taken into account that are unique to the concept of Hadoop.