MySQL Connector/J randomly hanging at com.mysql.jdbc.util.ReadAheadInputStream.fill

February 2nd, 2010 java, mysql

In the past months we struggled with large SELECT queries just get stuck at:


java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:129)
com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:113)
com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:160)
com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:188)
- locked com.mysql.jdbc.util.ReadAheadInputStream@cb9a81c
com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:2494)
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2949)
com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2938)
com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3481)
com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1959)
com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2109)
com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642)
- locked java.lang.Object@70cbccca
com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2571)
com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:782)
- locked java.lang.Object@70cbccca
com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:625)
org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:260)
org.apache.commons.dbcp.DelegatingStatement.execute(DelegatingStatement.java:260)

Whenever this happened we just restarted the Tomcat server and everything was fine again for some days or weeks. But today it struck us very hard so we finally took the time to hunt this down. It seems to be related to this bug report. Some comments suggested to use SQL_NO_CACHE with your queries.

A lot of people (including me) suggest to disable the MySQL query cache since it may cause severe problems. To disable the query cache at server startup, set the query_cache_size system variable to 0.

This is what we usually do, but one of our servers had query cache turned on. Disabling it solved this problem.

Improve performance on small hadoop clusters

November 30th, 2009 hadoop

Hadoop is designed to run on huge clusters containing several hundred machines. But some people just don’t need such a big cluster and are able to use the benefits of HDFS and MapReduce on a smaller scale.

We managed to improve performance of our 10-node-test-cluster by almost 100% by adjusting the heartbeat intervals. Namenode and jobtracker use heartbeats to communicate with their workers (datanodes and tasktrackers).
We concentrate on jobtracker heartbeats. To reliably manage huge cluster the minimum interval is 3 seconds. Every 10 nodes the interval is increased by a second. If you have lots of fast running map- or reduce-tasks this implies a noticeable overhead.

What we did was to patch Hadoop and lower the minimum heartbeat interval to as low as 500ms and the increment to 10ms per node. This way we got our MapReduce-jobs run almost twice as fast. If you want to try it, you could take a look at our github branch (view commit). Please note that the git-branch contains our adopted version of Hadoop, so use it only for testing purposes.

There is a fix (HADOOP-5784) in the upcoming version 0.21 which allows you to lower the heartbeat increment per node.

“Internet slow” on Ubuntu Karmic Koala (9.10)

November 8th, 2009 linux

“Internet slow” means actually “DNS slow”. After upgrading to Ubuntu 9.10 I experienced a strange and very annoying lag in DNS resolution. Running dig in a shell worked like a charm. But Firefox, Synaptic and everything else was hanging at DNS resolution.

To make a long story short (you probably read a lot of forum threads about this): Our Karmic Koala uses IPv6 for DNS queries and only if this fails it falls back to IPv4. A lot of home routers do not support IPv6 DNS queries. DOH!

Resolutions:

1. Firefox only: Disable IPv6 support by typing “about:config” into your location bar, then search for ipv6 and disable it by clicking on the line.

2. Disable IPv6 entirely: If you do not need IPv6-Support (I don’t) you could disable it completely and everything is up to speed again. How do I do this?

collectd + drraw.cgi - zoom into your graphs like you used to with cacti

September 16th, 2009 collectd, linux

I fell in love with collectd and drraw.cgi (a front-end to collectd). This combination is great: Fast, simple and yet sufficient.

But there was one thing I missed in drraw that I loved in cacti: Zooming. (This is how it looks like in cacti)

So I went on and hacked it into drraw.cgi using jQuery. This is how it looks like:

drraw_zoom.png

Download the patch (6kb).

A note: For simplicity’s sake I just included the jQuery lib hosted at Google APIs. If this is a problem for you, just download a copy, put it on your webserver and adjust the line in the patched drraw.cgi.

Have fun with it! Comments and improvements are always welcome!

Linux: Executables on a Samba/CIFS Share

August 30th, 2009 linux

Just a quick note: Don’t mount a cifs share with flag directio if you want to execute binaries that reside on that share.

Otherwise you will get the following error:

<command>: cannot execute binary file

Took me 2 hours to find out.

Simulating indexes in Hadoop

June 6th, 2009 hadoop, mysql

You should not try to use Hadoop as a “drop-in” replacement of your current (R)DBMS. That said it is still possible to utilize the power of cluster computing while circumventing its weaknesses when it comes to ad-hoc or real-time queries. We use Hadoop as an on-line system tightly integrated with our application and use it for both, long-running analytical queries and ad-hoc style queries.

In the mindset of a “traditional” database engineer one of the biggest concerns about Hadoop, or MapReduce in conjunction with a distributed file system in general, is the lack of indexes. Set aside that the debate “(R)DBMS vs MapReduce” is most of the time superfluous and sometimes almost leads to religious debates, the absence of a thing like an index is one the biggest hurdles you face when migrating data from a traditional DBMS.
Even though you will love the ability to view your data in any way you want without caring about its structure, at some point you feel that it is not right to always scan you 45TB of log files. (Even though it is soooo easy…).

Brute force is easy. Brute force is bad.

When we began migrating all those TBs of log-style data from our huge MySQL installations to Hadoop we did a lot of testing. We tested everything from Hadoop and MapReduce settings to different MapReduce abstractions like PIG, Cascading, Hive and others. There was this huge mass of data grinning at us and waited to be analysed in multiple ways, from “online” real-time access to “offline” decision making analysis. Due to our multiple views on the same data we came to this conclusion quite quickly: “Brute force is easy. Brute force is bad.” Yes, we can optimize our Hadoop installations and we can choose the really best query mechanism (actually we ended up writing our own), but it will not make things noticeably faster if you continue scanning all of our data all of the time.

Partitions are (sometimes) the better indexes

So, why are you using indexes (in the context of data retrieval)? I know why we did and do. It is all about primary key lookup and data clustering. Say you have the following table (MySQL):

CREATE TABLE ORDER (
    id INT NOT NULL,
    product INT NOT NULL,
    customer INT NOT NULL,
    amount FLOAT NOT NULL,
    orderDate DATE NOT NULL,

    PRIMARY KEY(id),
    INDEX idx_product_customer(product, customer),
    INDEX idx_customer(customer)
)

Just a simple order log with an unique identifier (id) and a single associated product and customer. Since we want to view our data from different perspectives we added two additional indexes on product and customer. (In this example we need two indexes because MySQL can only use the leftmost prefix of an index.)
Dumping the whole table as a single CSV-file into your Hadoop cluster would mean that you always have to use what (R)DBMS call a “full table scan”. It would be pretty much the same like removing all indexes from your MySQL-table. Try to search for all products a customer ordered without the index idx_product_customer. (In fact Hadoop would perform this full table scan an order of magnitude faster.) But it would be ridiculous to remove all indexes from your table. But that is actually what you did when you exported the whole table into a flat-file!
What you should do, and what we did with great success, is to split up your flat-file CSV and arrange the data so that you can decide beforehand which part of the data needs to be accessed. So let’s split up the data and simulate all of the indexes (besides the primary key, more on that later on). A file-system-layout could look like this:

orders/
    product_A/
        customer_1.csv
        customer_2.csv
    product_B/
        customer_1.csv
        customer_3.csv

So when searching all orders customer_1 placed, we just use this file-pattern orders/*/customer_1.csv. Remember: HDFS and MapReduce’s inputs (like FileInputFormat) support globbing.

Now we actually simulated indexes by partitioning the data!

From here on you can go into more detail depending on your data structure. As an example you could add the date- and id-range to the file name like this:

orders/product_A/customer_1.2009-06-04.2009-06-05.1000.2000.csv
orders/product_A/customer_1.2009-06-06.2009-06-07.5000.7000.csv

This comes handy if you keep adding data to your cluster.
To make thinks even easier you could write your own InputFormat that encapsulates the building of the paths that match your query.

The small file problem

Since Hadoop has been designed to work on quite huge blocks of data it is not efficient when using a lot of small files. To prevent the creation of millions of very small files take a closer look at your data. Say your average customer places 50 orders. It would be a waste of resources to store multiple files for a single customer, each filling only a few KBs. A possible solution: group customers together.

orders/
    product_A/
        customer_1_to_1000.csv
        customer_1001_to_2000.csv

You have to find the right balance between file-size and access pattern.

Some final words

To make it clear: Even though we have found a way to partition our data we have not gained the same flexibility as we have in any descent (R)DBMS (with enough of disk space, processing power and - most of all - RAM!). Querying for all orders made by a single customer may still take 0.01s in a (R)DBMS vs. 10s (or more) in Hadoop.

Never try to simply replace your (R)DBMS with Hadoop! Eventually you will end up writing a blog post saying that MapReduce and Hadoop are hopelessly worse than your favourite (R)DBMS. Hadoop is not a database!

Real-time lookups

You can still accomplish real-time lookup performance using Hadoop. One thing you could do is to take a look at HBase, a Google BigTable implementation.
Some times it is enough to use MapFiles which are simply a huge disk-based Hashtable.
In our application we implemented primary key and secondary keys directly on CSV files using MapFiles and distribute the lookup and local in-memory-caches over several machines. To speed things up even more we use a memcached cluster. (Eventually we will release all of this along with our MapReduce-abstraction as open-source once we feel it is mature and stable enough.)
One way or the other: Data redundancy will most likely become your best friend in these situations.

Regardless the techniques you are actually using you still have to think about your data in another way. You always have to when moving from a traditional (R)DBMS to any other kind of data storage and retrieval system!

Increasing Performance of Hadoop-Unit-Tests

March 30th, 2009 hadoop

Adding a lot of unit tests for our application that uses Hadoop and its Map-Reduce-Engine significantly increased integration build time. Hadoop comes with a LocalJobRunner which is used by default so you do not have to set up a complete cluster in order to run some Unit-Tests. This is great! But the problem is: it still produces a lot of overhead. Ramp-up and tear-down of a job might still take up to a few seconds. Having some hundreds of Map-Reduce-Jobs in your unit-test-base will definitely drag you away from the ideal “10 minute integration build feedback” you are always striving to get. ;)

I cannot provide a complete solution to this “problem” (hey, it is still great to be able to run Map-Reduce-Jobs locally!), but the following configuration parameters cut the execution times of our tests in halves:

<property>
    <name>io.sort.record.percent</name>
    <value>0.01</value>
</property>
<property>
    <name>io.sort.mb</name>
    <value>1</value>
</property>
<property>
    <name>min.num.spills.for.combine</name>
    <value>0</value>
</property>

These settings are only feasible for small jobs with little input, of course.

I’m always glad to hear of better solutions to decrease the overhead even more!

Hadoop and Linux kernel 2.6.27 - epoll limits

January 22nd, 2009 hadoop

Yesterday we faced a strange problem. A newly set up Hadoop cluster got unstable after a few minutes. Logs reported a lot of exceptions like:

java.io.IOException: Too many open files
at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
at sun.nio.ch.EPollArrayWrapper.(EPollArrayWrapper.java:68)
at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:52)
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
at sun.nio.ch.Util.getTemporarySelector(Util.java:123)
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:92)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:281)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:102)
at java.lang.Thread.run(Thread.java:619)

or

DataXceiver
java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:78)
at java.lang.Thread.run(Thread.java:619)

and others. We double-checked ulimit -n and it reported 32768 on all datanodes as expected. lsof -u hadoop | wc -l was as low as 2000, so “Too many open files”-exceptions seemed strange.

A day and several installation routines later we figured out that the available epoll resources were not sufficient any more. Java JDK 1.6 uses epoll to implement non-blocking-IO. With kernel 2.6.27 resource limits have been introduced and the default on openSuSE is 128 - way too low.

Increasing the limit with echo 1024 > /proc/sys/fs/epoll/max_user_instances fixed the cluster immediately. To make this setting boot safe add the following line to /etc/sysctl.conf:

fs.epoll.max_user_instances = 1024

Current project status - HSCALE

December 22nd, 2008 hscale, mysql, mysql-proxy

A lot of people keep asking me about the status of HSCALE so I thought it is best to just write about it.
Since I am waiting for the next GPL release of MySQL Proxy I concentrate on other things, both project-related and not project-related.

Continuous integration and test strategy enhancements

First of all there is a CI server up for almost 3 months now. Check out teamcity.hscale.org (Login as guest user). I also introduced a lint-process to discover bugs in my LUA code which arise mostly by mistyping variable names. By the way: You should definitely have a lint-like process in place if you are doing LUA programming or other languages of that type. It helps a lot.

What’s in svn trunk?

Multiple backends

The current status is this: Spreading across multiple backends is fully implemented and there are a lot of tests for that. Check out the latest code at http://svn.hscale.org/trunk. The tests running against multiple backends are “cheating” a little bit: Since the proxy still does not allow for ad-hoc allocation of backend connections I just create a lot of “dummy” connections to the proxy and use the approach of connection pooling the way the proxy does it. See an example. Note: This will not work in production!

Dictionary partition lookup and auto-partitioning

Besides the modulus partition lookup, which is only used in tests, there is now the dictionary partition lookup which allows for explicit partition definition. Detailed documentation can be found in the project wiki. It works as described here.
Another feature implemented is auto-partitioning. Depending on the partitioning function used it is possible to create new partitions on the fly. This way you can automatically spread the load across your MySQL servers. Another benefit is that you have a fine grained partition set-up right from the start which makes re-partitioning a lot easier afterwards.
Please take a look at the code and especially the tests to see how it works.

Parallel setup of HSCALE servers

In order to eliminate the SPOF (single point of failure) and possible bottleneck that directing all traffic through a single proxy would imply, HSCALE is designed to run in parallel. This means you can set up multiple proxies running HSCALE in parallel. This way you can easily implement fail-over scenarios using heartbeat or your favourite high availability solution and spread the load.
Why should running HSCALE in parallel be a problem in the first place? The easy answer: Because partition information is cached within each instance. As soon as the partition information changes your HSCALE instances would run out of sync using different partition information which would be disastrous to your data integrity. To avoid this HSCALE (more precise the dictionary partition lookup) works in two different modes: “NORMAL” and “FORCE”. If the mode is set to “NORMAL” then partition information is cached internally and only refreshed in a configurable time interval (configuration parameter “reloadInterval”). Whenever a change is made to the partition set up the mode is changed to “FORCE” which forces all HSCALE instances to re-fetch the partition information prior executing any query. After a configurable amount of time the system switches back to “NORMAL”. This is only the big picture. The implementation itself is a bit more complicated.
This approach is simple and robust because no other components are involved (like message queues) and it is guaranteed by design that no HSCALE instance is able to run with a wrong partition mapping. Changes made to the partition information involve a little overhead since all HSCALE instances will reload the partition mapping quite often. But taken into account that the partition information does not change that often (100 times a day would be huge!) it is an affordable price to pay for data integrity.

What’s next

Currently HSCALE is almost feature complete. The thing that’s missing - and making HSCALE production ready - is a MySQL Proxy version with different backend handling (Jan, please do not hate me!). Currently we (our company) do not have the resources digging into MySQL Proxy ourself and HSCALE is currently(!) not top-priority. So we will just wait and see.

Part of the problems we intended to solve with HSCALE are now moved to a Hadoop cluster since we have huge masses of log-style, read-only data which has to be analysed in multiple dimensions. HSCALE will gain focus right after that or as soon as a “suitable” GPL version of MySQL Proxy comes out.

If you are in need for a production ready, proxy-based sharding solution, please take a look at Spock Proxy. They use a different approach - they actually forked MySQL Proxy and implemented everything into it, thus no LUA is used - but the idea behind it is basically the same. They also offer some features HSCALE will not offer in the near future like handling of auto_increment columns across partitions. Some features are not there mostly because of the different design approach like arbitrary partitioning functions (they only offer range-based partitioning which is ok for many scenarios) or query hinting.

Speaking of Hadoop cluster - an idea that is ringing in my head for a while is to implement a MySQL Proxy LUA script that enables running (basic) queries against a cluster. It would be a little fun project. ;)

SHOW STATUS considered harmful

September 11th, 2008 mysql

First of all, I know this is a known problem, but it struck me so hard, I just had to write about it!

As Peter Zaitev points out calling SHOW STATUS might have a huge performance impact.

We recently replaced one of our servers with a DELL R900 with 96GB RAM. Having a disk-bound workload and > 1TB worth of data in InnoDB we expected a noticeable performance gain compared to the former server with 32GB. The new server even has better RAID and HDD.

But that was not the case. Things got even worse! A lot of queries “hang”, server load peaked at almost 7 and we saw a lot of cpu activity. Just before I started a deep analysis of what is going on inside I spotted a SHOW GLOBAL STATUS which ran every second. DOH!

Where did it came from? An administrator was running MySQL Administrator and used the Health chart to monitor some variables. So it periodically sent SHOW GLOBAL STATUS to the server. That resulted in a lot of queries waiting for the buffer pool (look at Peter’s post and the comments to understand why). And things get worse with bigger InnoDB buffer pool (this particular mysql instance uses 70GB!).

MySQL Administrator (the tool, not the person ;) ) shut down - everything just looks great now!