Wikipedia

Search results

The Hadoop FAQ for Oracle DBAs

Oracle DBAs, get answers to many of your most common questions about getting started with Hadoop.
As a former Oracle DBA, I get a lot of questions (most welcome!) from current DBAs in the Oracle ecosystem who are interested in Apache Hadoop. Here are few of the more frequently asked questions, along with my most common replies.
How much does the IT industry value Oracle DBA professionals who have switched to Hadoop administration, or added it to their skill set?
Right now, a lot. There are not many experienced Hadoop professionals around (yet)!
In many of my customer engagements, I work with the DBA team there to migrate parts of their data warehouse from Teradata or Netezza to Hadoop. They don’t realize it at the time, but while working with me to write Apache Sqoop export jobs, Apache Oozie workflows, Apache Hive ETL actions, and Cloudera Impala reports, they are learning Hadoop. A few months later, I’m gone, but a new team of Hadoop experts who used to be DBAs is left in place.
My solutions architect team at Cloudera also hires ex-DBAs as solutions consultants or system engineers. We view DBA experience as invaluable for those roles.
What do you look for when hiring people with no Hadoop experience?
I strongly believe that DBAs have the skills to become excellent Hadoop experts 
 — but not just any DBAs. Here are some of the characteristics I look for:

  • Comfort with the command line. Point-and-click DBAs and ETL developers need not apply.
  • Experience with Linux. Hadoop runs on Linux so that’s where much of the troubleshooting will happen. You need to be very comfortable with Linux OS, filesystem, tools, and command line. You should understand OS concepts around memory management, CPU scheduling, and IO.
  • Knowledge of networks. ISO layers, what ssh is really doing, name resolution, basic understanding of switching.
  • Good SQL skills. You know SQL and you are creative in your use of it. Experience with data warehouse basics such as partitioning and parallelism is a huge plus. ETL experience is a plus. Tuning skills are a plus.
  • Programming skills. Not necessarily Java (see below). But, can you write a bash script? Perl? Python? Can you solve few simple problems in pseudo-code? If you can’t code at all, that’s a problem.
  • Troubleshooting skills. This is huge, as Hadoop is far less mature than Oracle. You’ll need to Google error messages like a pro, but also be creative and knowledgeable about where to look when Google isn’t helpful.
  • For senior positions, we look for systems and architecture skills too. Prepare to explain how you’ll design a flight-scheduling system or something similar.
  • And since our team is customer facing, communication skills are a must. Do you listen? Can you explain a complex technical point? How do you react when I challenge your opinion?
Is that maybe too much to ask? Possibly. But I can’t think of anything I could remove and still expect success with our team.
How do I start learning Hadoop?
The first task we give new employees is to set up a five-node cluster in the AWS cloud. That’s a good place to start. Neither Cloudera Manager nor Apache Whirr is allowed; they make things too easy.
The next step is to load data into your cluster and analyze it.
The tutorials here, which show how to load Twitter data using Apache Flume and analyze it using Hive:
Also, Cloudera’s QuickStart VM (download here) includes TPC-H data and queries. You can run your own TPC-H benchmarks in the VM.
There are also some good books to help you get started. My favorite is Eric Sammer’s Hadoop Operations – it’s concise and practical, and I think DBAs will find it very useful. The chapter on troubleshooting is very entertaining. Other books that DBAs will find useful are Hadoop: The Definitive GuideProgramming Hive, and Apache Sqoop Cookbook (all of which are authored or co-authored by Clouderans).

Do I need to know Java?
Yes and no :)
You don’t need to be a master Java programmer. I’m not, and many of my colleagues are not. Some never write Java code at all.
You do need to be comfortable reading Java stack traces and error messages. You’ll see many of those. You’ll also need to understand basic concepts like jars and classpath.
Being able to read Java source code is useful. Hadoop is open source, and digging into the code often helps you understand why something works the way it does.
Even without mastery required, the ability to write Java is often useful. For example, Hive UDFs are typically written in Java (and it’s easier to do that than you think).

Conclusion

If you’re an Oracle DBA interested in learning Hadoop (or working for Cloudera), this post should get you started.

No comments:

Post a Comment