You should not try to use Hadoop as a “drop-in” replacement of your current (R)DBMS. That said it is still possible to utilize the power of cluster computing while circumventing its weaknesses when it comes to ad-hoc or real-time queries. We use Hadoop as an on-line system tightly integrated with our application and use it for both, long-running analytical queries and ad-hoc style queries.
In the mindset of a “traditional” database engineer one of the biggest concerns about Hadoop, or MapReduce in conjunction with a distributed file system in general, is the lack of indexes. Set aside that the debate “(R)DBMS vs MapReduce” is most of the time superfluous and sometimes almost leads to religious debates, the absence of a thing like an index is one the biggest hurdles you face when migrating data from a traditional DBMS.
Even though you will love the ability to view your data in any way you want without caring about its structure, at some point you feel that it is not right to always scan you 45TB of log files. (Even though it is soooo easy…).
Brute force is easy. Brute force is bad.
When we began migrating all those TBs of log-style data from our huge MySQL installations to Hadoop we did a lot of testing. We tested everything from Hadoop and MapReduce settings to different MapReduce abstractions like PIG, Cascading, Hive and others. There was this huge mass of data grinning at us and waited to be analysed in multiple ways, from “online” real-time access to “offline” decision making analysis. Due to our multiple views on the same data we came to this conclusion quite quickly: “Brute force is easy. Brute force is bad.” Yes, we can optimize our Hadoop installations and we can choose the really best query mechanism (actually we ended up writing our own), but it will not make things noticeably faster if you continue scanning all of our data all of the time.
Partitions are (sometimes) the better indexes
So, why are you using indexes (in the context of data retrieval)? I know why we did and do. It is all about primary key lookup and data clustering. Say you have the following table (MySQL):
CREATE TABLE ORDER (
id INT
NOT NULL,
product INT
NOT NULL,
customer INT
NOT NULL,
amount FLOAT
NOT NULL,
orderDate DATE
NOT NULL,
PRIMARY KEY(id),
INDEX idx_product_customer(product, customer),
INDEX idx_customer(customer)
)
Just a simple order log with an unique identifier (id) and a single associated product and customer. Since we want to view our data from different perspectives we added two additional indexes on product and customer. (In this example we need two indexes because MySQL can only use the leftmost prefix of an index.)
Dumping the whole table as a single CSV-file into your Hadoop cluster would mean that you always have to use what (R)DBMS call a “full table scan”. It would be pretty much the same like removing all indexes from your MySQL-table. Try to search for all products a customer ordered without the index idx_product_customer. (In fact Hadoop would perform this full table scan an order of magnitude faster.) But it would be ridiculous to remove all indexes from your table. But that is actually what you did when you exported the whole table into a flat-file!
What you should do, and what we did with great success, is to split up your flat-file CSV and arrange the data so that you can decide beforehand which part of the data needs to be accessed. So let’s split up the data and simulate all of the indexes (besides the primary key, more on that later on). A file-system-layout could look like this:
orders/
product_A/
customer_1.csv
customer_2.csv
product_B/
customer_1.csv
customer_3.csv
So when searching all orders customer_1 placed, we just use this file-pattern orders/*/customer_1.csv. Remember: HDFS and MapReduce’s inputs (like FileInputFormat) support globbing.
Now we actually simulated indexes by partitioning the data!
From here on you can go into more detail depending on your data structure. As an example you could add the date- and id-range to the file name like this:
orders/product_A/customer_1.2009-06-04.2009-06-05.1000.2000.csv
orders/product_A/customer_1.2009-06-06.2009-06-07.5000.7000.csv
This comes handy if you keep adding data to your cluster.
To make thinks even easier you could write your own InputFormat that encapsulates the building of the paths that match your query.
The small file problem
Since Hadoop has been designed to work on quite huge blocks of data it is not efficient when using a lot of small files. To prevent the creation of millions of very small files take a closer look at your data. Say your average customer places 50 orders. It would be a waste of resources to store multiple files for a single customer, each filling only a few KBs. A possible solution: group customers together.
orders/
product_A/
customer_1_to_1000.csv
customer_1001_to_2000.csv
You have to find the right balance between file-size and access pattern.
Some final words
To make it clear: Even though we have found a way to partition our data we have not gained the same flexibility as we have in any descent (R)DBMS (with enough of disk space, processing power and - most of all - RAM!). Querying for all orders made by a single customer may still take 0.01s in a (R)DBMS vs. 10s (or more) in Hadoop.
Never try to simply replace your (R)DBMS with Hadoop! Eventually you will end up writing a blog post saying that MapReduce and Hadoop are hopelessly worse than your favourite (R)DBMS. Hadoop is not a database!
Real-time lookups
You can still accomplish real-time lookup performance using Hadoop. One thing you could do is to take a look at HBase, a Google BigTable implementation.
Some times it is enough to use MapFiles which are simply a huge disk-based Hashtable.
In our application we implemented primary key and secondary keys directly on CSV files using MapFiles and distribute the lookup and local in-memory-caches over several machines. To speed things up even more we use a memcached cluster. (Eventually we will release all of this along with our MapReduce-abstraction as open-source once we feel it is mature and stable enough.)
One way or the other: Data redundancy will most likely become your best friend in these situations.
Regardless the techniques you are actually using you still have to think about your data in another way. You always have to when moving from a traditional (R)DBMS to any other kind of data storage and retrieval system!