I'm working on a large scale data project and am in the research phase right now. I only have a few high level requirements:
- Scalability to multi-petabyte
- Not tied to a single programming language
- All open source software
Here is my initial setup that I have begun to configure and work with:
- Fedora Linux
- Drupal for the frontend
- MySQL (for Drupal and some other metadata)
- Apache webserver
- Apache Solr
- Hadoop 0.20
- HBase NoSQL database (for primary storage)
- Apache Nutch webcrawler
- Hiphop PHP (for web services that will really benefit from native code)
I'm using 4 existing machines and can setup virtual machines as needed, but will try to maintain the smallest number of servers to easy implementation and testing. Drupal and MySQL were installed on a single machine. Hadoop is installed and working on the 4 machines and 3 VMs, so I have 7 nodes for parallel tasking. The 4 servers are configured as the data nodes.
Sub-projects
I have a long lists of things to do and also things I have already done included in my bullets below. I plan on writing a blog about each one of these as I get time or if I get specific requests about any they might get preference:
Completed
- Integrate Drupal to Solr Search
- Integrate Apache Nutch to Solr for combined results in Drupal
- Build API interface with Hiphop PHP for Drupal to HBase and HDFS integration
- Write test Drupal module for block, node page altering
- Google maps integration with Drupal
- Mobile phone detection and theme switching with Drupal
- Learn PHP (had to for working with Drupal)
- Many more ...
- Build API interface with Hiphop PHP for Drupal to HBase and HDFS integration
- Deploy Hiphop PHP to a chroot environment
- Federated search
- Apache Solr clustering for redundancy and performance (if possible)
- HBase cluster (zookeepers, Master, Regional)
- XML to XSLT for XML data revisions in HBase
- Apache/Drupal cluster with support for SSL sessions
- Map / Reduce tuning and programming testing
- Nagios/Ganglia for monitoring Hadoop
- Geocode searching Drupal/Solr