After following the scale up path (i.e. buy bigger boxes) for a while now, we definitely need to scale out as things start to become “unhandy” and simply HUGE. (See small things are better)
We are running an OLTP application with about 200 million transaction per month. Currently we have some very large tables with more than 1,5 billion rows and the total database size is about 600 GB - all InnoDB.
What we want to implement is application based sharding, i.e. splitting up the big tables and distribute them among many smaller servers. Furthermore we need to implement some sort of archiving mechanism since the db size is growing very fast. Putting old data into compressed myisam tables seems to be a good solution here.
So, how to implement this?
These solutions come into mind immediately:
- Using MySQL Cluster (this is not really application wise sharding but I mention it for completeness)
- Using MySQL Proxy with transparent query rewriting and load balancing
- Implement it into a JDBC driver
- Implement it into the application data access layer (DAOs)
Each solution comes with a list of pros and cons.
MySQL Cluster
Using the cluster would bring scalability right away but as we do not have any experience with this and had a lot of issues with the other storage engines (first crashes with MyISAM, then crashes with InnoDB, both being unsolved for a while even though the MySQL Enterprise Support did a great job!) I feel quite reluctant switching to a new storage engine. The hardware requirements would be too high too, so I don’t think that this could be the best solution.
Application Based Sharding
For all the other 3 types of partitioning I would use the following mechanism and meta tables.
Say this is a table I would like to split (Please note that this is just an example with some added redundancy (productCategory) to ease partitioning by product category.):
CREATE TABLE orders
(
orderId BIGINT
(20) NOT NULL,
productCategory VARCHAR(50) NOT NULL,
productId BIGINT(20) NOT NULL,
date DATETIME NOT NULL,
PRIMARY KEY (orderId)
);
The column productCategory would be the partition key since there are only about 20 to 100 product categories used and the data is spread almost evenly. The real scenario would be different but this is a close and easy to grasp match.
The partition meta data table could look like this:
CREATE TABLE partition
(
sourceTable VARCHAR
(255) NOT NULL,
# would be "orders" in our example
partitionKey VARCHAR
(255) NOT NULL,
# would be the value of "productCategory"
dbHost VARCHAR
(100) NOT NULL,
dbPort VARCHAR
(100) NOT NULL,
dbName VARCHAR
(100) NOT NULL,
dbUser VARCHAR
(100) NOT NULL,
dbPass VARCHAR
(100) NOT NULL,
dbTable VARCHAR
(100) NOT NULL,
PRIMARY KEY (sourceTable, partitionKey)
);
This way I could assign a different database connection to each product category or a set of categories. Furthermore we can have multiple tables of the same “type” within one database because the table name is configured, too (dbTable). So we could have the tables order_books, orders_hardware, orders_cds on server A and orders_dvds, orders_shirts on server B.
Limitations
This type of partitioning forces all queries to include a partition key (i.e. a productCategory) and IN clauses on the partition key are not possible or at least very hard to implement. Furthermore a full table scan (i.e. selecting rows based on the orderId column) requires to scan all tables. (This could be optimized if you have a correlation between orderId and productCategory in some way which is the case in our scenario - more on that later.).
MySQL Proxy and JDBC Driver
Yesterday I played a bit with the MySQL Proxy and yes it is hell of an awesome thing! I did some sort of query rewriting and load balancing which was easy at first hand.
The approach for both, MySQL Proxy and JDBC driver, would be the same: Rewrite the queries coming from the application by analyzing them and by using hints put into the SQL by the application (where analyzing does not help).
Assume the following query is issued to look up all orders of a product category:
SELECT * FROM orders WHERE productCategory=‘Books’;
After analyzing the query it is clear that we have to lookup the database configuration for the category ‘Books’, so we could rewrite that query (and send it to a different database connection if needed) like this:
SELECT * FROM orders_books WHERE productCategory=‘Books’;
We could even strip off the WHERE clause if there are only books stored in the table “orders_books”.
That was easy. What about more complex queries like (Attention: Bad database design ahead! It is just an example ;)):
SELECT
orders.*,
books.*
FROM
orders,
books
WHERE
orders.productName = books.name
We know that the partition key has to be ‘Books’. But how can our query analyzer know that? It can’t. So there might be queries that are valid but cannot be analyzed. One way is to add a where condition AND orders.productCategory = 'Books', but that might not always be possible or wanted.
Another option would be to add hints like this:
SELECT
orders.*,
books.*
FROM
orders /* HINT: productCategory=’Books’ */,
books
WHERE
orders.productName = books.name
I think implementing it as a MySQL Proxy LUA script or a JDBC driver would not make a big difference on the approach. I would prefer MySQL Proxy a lot though.
The pros of this approach are very clear:
- It is a transparent solution. Only a few parts of the application need to be changed in best case.
- It could be easily released as open source since it is not bound to any type of application - a huge plus IMHO!
- It keeps the application code simple.
The cons:
- Implementation is very challenging. Analyzing the queries would be the toughest part.
- Harder to test against the application since it is a runtime dependency. Would have to come up with a good integration test scenario.
- Performance could be an issue since the queries need to be parsed.
- Application would need to be changed anyway since we cannot issue some sorts of queries anymore. This could be small or a huge impact depending on the application.
Application Data Access Layer (DAO)
This would be the most straightforward way of implementing partitioning. Since we have total control over our application we could adopt the mechanism in any way to meet our needs. The basic concept would be to use a PartitionFactory like this:
Within this solution we have to know which partition to use beforehand. Thus every DAO method has to know where to go. This could be done by adding the partition key to every method or rewriting methods to first find out which partition to use.
The path to this solution is clear and very straightforward, I think.
The pros:
- Easy to implement since we know what the application wants and does.
- Fast. No need to rewrite queries.
- Easier to grasp for (database) developers since they see the dependencies right in the code.
The cons:
- Proprietary solution unless embedded into other frameworks like Spring, Hibernate or iBatis.
- Might blowup the application code making it harder to understand.
- A lot of the application needs to be changed.
Summary
We are still in the process of evaluating every solution. Right now I would love to go for the MySQL Proxy solution be it just for its aesthetic image
We need to do a bit more prototyping to see if this would really be a good solution.
Any comment, discussion or hint is highly appreciated!
More to follow soon…