Taming the elephant

Learning Had00p

SQL for Hadoop

leave a comment »

Listing of all the SQL solutions on top of Hadoop (or just linking to the name Hadoop).

SQL is what’s next for Hadoop: Here’s who’s doing it — Tech News and Analysis.

Key to SQL on Hadoop is to integrate with HDFS to take advantage of data locality and a good query optimizer. As of now systems based on HBase or using in memory caches seem to be the best bet.

I have recently work on Teradata offload to Hadoop and the current attempts of “SQL on Hadoop” hint at the model Teradata uses. Only *big* difference is the cost though. SQL on Hadoop solutions can mature and replace such costly systems

Challenge:

  • Keep the storage format open or else you are creating a new MPP database.

Written by rawatra

February 23, 2013 at 5:33 pm

Posted in bigdata, hadoop

Tagged with ,

Hadoop distributed cache error

leave a comment »

Some Hadoop job errors are really hard to debug. Spent most of the time today trying to figure out why a job was failing unpredictably. Tasks were failing in setup on some occasions (but frequently enough).

You may see something like this in your logs.

13/01/30 19:10:13 ERROR fs.Inode: 2055.68894.1308820 <some file> Returning bad page to cache page: (2055.68894.1308820 0, id: 62, state Invalid)

We narrowed down the problem from the logs to non availability of a distributed cache file, exception was from the hadoop distributors proprietary code, so it was hard to figure out the exact reason. But the word “cache” was key and this was happening sporadically.  I debugged this with a colleague (looking at the failure patterns) and came to a conclusion that two jobs which are running close to each other are having the failures.   So there must be competing write/deletes which are making the cache inconsistent(this issue may only occur with the specific distributor).

To prove this hypothesis i wrote two scripts , one deleting and creating a file on one node and the other one reading the file on other node. And guess what the same exception on the reader.

for x in `seq 1 100`; do hadoop fs -rm tmp/hive-exec-0.9.0.jar; hadoop fs -copyFromLocal hive-exec-0.9.0.jar tmp/ ; done

for x in `seq 1 100`; do hadoop fs -rm tmp/hive-exec-0.9.0.jar; hadoop fs -copyFromLocal hive-exec-0.9.0.jar tmp/ ; done

To be fair to the hadoop distributor,

  • You won’t be deleting and creating files in the hadoop filesystem for fun. 
  • You wouldn’t normally modify the distributed cache files.
  • And yes this is expected in a distributed file system [In fact kudos for detecting the inconsistency and failing the Task]. But it should have resolved itself for the next task attempts.

But we do have a use case where we actually recreate the cache files on every run. Making the cache file names unique for a run should solve this.  And hey, people are really abusing hadoop by using it for things that it was not supposed to do.

Written by rawatra

January 31, 2013 at 12:38 am

Posted in bigdata, hadoop

Hadoop version mystery

leave a comment »

Finally some insight on various release of Hadoop. It had become so complicated that developers had no choice but to use “hadoop distributions” from vendors like cloudera etc.  Hopefully this will improve in 2012.

http://www.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/

I have been trying to get this information from JIRA, SVN logs, release notes etc. Some of it is though available on the news page for Hadoop project.

But still 0.23 is an alpha release.  Loads of features have been added on trunk and i am not sure if anyone is using trunk based release yet.

Written by rawatra

January 22, 2012 at 5:43 pm

Posted in Uncategorized