Monday, August 22, 2011

Hadoop, HBase, HipHop and other H-software

Introduction
I'm working on a large scale data project and am in the research phase right now.  I only have a few high level requirements:
  1. Scalability to multi-petabyte 
  2. Not tied to a single programming language
  3. All open source software  
Part of the research is developing a front end to display the data, but this is a small portion.  I wanted to document what I was doing in the hopes it may help others.  I have been doing a lot of research, playing around with configurations and finding my share of frustrating moments.  I have been helped by so many different postings and blogs, I felt I needed to give back something as well.

Here is my initial setup that I have begun to configure and work with:
  1. Fedora Linux
  2. Drupal for the frontend
  3. MySQL (for Drupal and some other metadata)
  4. Apache webserver
  5. Apache Solr
  6. Hadoop 0.20
  7. HBase NoSQL database (for primary storage)
  8. Apache Nutch webcrawler
  9. Hiphop PHP (for web services that will really benefit from native code)
Architecture
I'm using 4 existing machines and can setup virtual machines as needed, but will try to maintain the smallest number of servers to easy implementation and testing.  Drupal and MySQL were installed on a single machine.  Hadoop is installed and working on the 4 machines and 3 VMs, so I have 7 nodes for parallel tasking.  The 4 servers are configured as the data nodes.

Sub-projects
I have a long lists of things to do and also things I have already done included in my bullets below.  I plan on writing a blog about each one of these as I get time or if I get specific requests about any they might get preference:

Completed
  • Integrate Drupal to Solr Search
  • Integrate Apache Nutch to Solr for combined results in Drupal
  • Build API interface with Hiphop PHP for Drupal to HBase and HDFS integration
  • Write test Drupal module for block, node page altering
  • Google maps integration with Drupal
  • Mobile phone detection and theme switching with Drupal
  • Learn PHP (had to for working with Drupal)
  • Many more ...
To Do Items
  • Build API interface with Hiphop PHP for Drupal to HBase and HDFS integration
  • Deploy Hiphop PHP to a chroot environment
  • Federated search
  • Apache Solr clustering for redundancy and performance (if possible)
  • HBase cluster (zookeepers, Master, Regional)
  • XML to XSLT for XML data revisions in HBase
  • Apache/Drupal cluster with support for SSL sessions
  • Map / Reduce tuning and programming testing
  • Nagios/Ganglia for monitoring Hadoop
  • Geocode searching Drupal/Solr
Thats enough for the first post.  My first task is getting Hiphop running in a chroot environment today and working on getting config file like I want it.