Taming the elephant

Learning Had00p

Hadoop distributed cache error

leave a comment »

Some Hadoop job errors are really hard to debug. Spent most of the time today trying to figure out why a job was failing unpredictably. Tasks were failing in setup on some occasions (but frequently enough).

You may see something like this in your logs.

13/01/30 19:10:13 ERROR fs.Inode: 2055.68894.1308820 <some file> Returning bad page to cache page: (2055.68894.1308820 0, id: 62, state Invalid)

We narrowed down the problem from the logs to non availability of a distributed cache file, exception was from the hadoop distributors proprietary code, so it was hard to figure out the exact reason. But the word “cache” was key and this was happening sporadically.  I debugged this with a colleague (looking at the failure patterns) and came to a conclusion that two jobs which are running close to each other are having the failures.   So there must be competing write/deletes which are making the cache inconsistent(this issue may only occur with the specific distributor).

To prove this hypothesis i wrote two scripts , one deleting and creating a file on one node and the other one reading the file on other node. And guess what the same exception on the reader.

for x in `seq 1 100`; do hadoop fs -rm tmp/hive-exec-0.9.0.jar; hadoop fs -copyFromLocal hive-exec-0.9.0.jar tmp/ ; done

for x in `seq 1 100`; do hadoop fs -rm tmp/hive-exec-0.9.0.jar; hadoop fs -copyFromLocal hive-exec-0.9.0.jar tmp/ ; done

To be fair to the hadoop distributor,

  • You won’t be deleting and creating files in the hadoop filesystem for fun. 
  • You wouldn’t normally modify the distributed cache files.
  • And yes this is expected in a distributed file system [In fact kudos for detecting the inconsistency and failing the Task]. But it should have resolved itself for the next task attempts.

But we do have a use case where we actually recreate the cache files on every run. Making the cache file names unique for a run should solve this.  And hey, people are really abusing hadoop by using it for things that it was not supposed to do.


Written by rawatra

January 31, 2013 at 12:38 am

Posted in bigdata, hadoop

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: