Big Data Code and Musings: 2011

Wednesday, September 7, 2011

HipHop PHP in Chroot

HipHop PHP to C++
If you are unaware of HipHop PHP it is a program written by Facebook developers to convert PHP into optimized C++ code and compile the result into a single binary which provides a basic and fast web server with your code emebedded:
https://github.com/facebook/hiphop-php/wiki/

You can read about building it and how it works. My goal once I got it compiled and working was to make it secure by running it in a chroot environment, but it still needed access to the network and other data access like HBase, HDFS and MySQL.

I was able to succesfully get HipHop PHP binary (HTTP Server) running in chroot environment. There wasn't much different then any other chroot. You can do a quick search on Google to figure out the basics of using a chroot.

The biggest hassle in this job is the HipHop binary since it is dynamically linked to so many libraries. The command you want to run is 'ldd hiphopbinary'. This will create a dump of all the libraries this executable needs in order to run.

Once you have a list, you need to copy all these binaries into your new chroot environment and they have to be in the same directory structure for HipHop to find them. I used the output of ldd and a quick Perl script to create a copy script.

Create the Directory Structure
I created a root direct as /opt/hiphop/chroot, then created the following directories:
/usr/lib64 - this is a 64bit machine so some libraries go here
/usr/lib64/mysql - some mysql specific libraries
/usr/lib64/php - some php libraries
/usr/lib64/php/modules
/dev - special files directory...read the following paragraph
/var/www/html - some static content was needed so it is put here
/var/run - the .pid file needs a home
/var/log - logs go here
/lib64 - other libraries are searched in this path
/tmp - junk folder
/bin - only /bin/sh is here
/opt/hiphop/local/lib - when you build HipHop it requires separate compiled libraries, which I put here
/etc - OS config files

Making special files
Depending on how thorough you want to be or some other special circumstances, your program may need access to special files such as /dev/null, /dev/urandom, etc. In order to create these you need to use the mknod binary. For example, to create /dev/urandom, go to your chroot ennvironment
cd /opt/hiphop/dev
mknod urandom c 1 9

If you need to create other special files, I found this random page for a shell script for a RAM filesystem that creates the special files:
http://linuxfirmwarekit.googlecode.com/svn/trunk/initramfs/dev.sh

Get the web server running !!!
You can start the HipHop server using the following basic command:
sbin/httpd -u httpd -m daemon -p 8080 2>/var/log/http_error.log >/var/log/access.log

Change sbin/httpd to wherever you put the binary. Once I was sure that was working properly I put it into a startup script called sbin/start_httpd and replace sbin/httpd with /sbin/httpd. This was done because it will be executed under the chroot context and / will actually be referring to the chroot environment.

I then took the regular Apach http startup script and slightly modified it. I have included the relevant portion below for the start. The final command is:
$CHROOTBIN --userspec=99:99 $CHROOTHOME $startcmd

This translates to:
/usr/sbin/chroot --userspec=99:99 /opt/hiphop/chroot /sbin/start_httpd

/usr/sbin/chroot - means execute the chroot system binary
--userspec=99:99 - this is the uid of the user I want the process to run under (obvious not root)
/opt/hiphop/chroot - the chroot binary which make this the "home" directory for the next command it is about to execute
/sbin/start_httpd - this is the startup script that actually executes the HipHop binary. The file actually sits (on the Linux filesystem) /opt/hiphop/chroot/sbin/start_httpd, but we are chrooted so it only knows about its new home.

init.d/chroot_httpd
...
prog=chroot_httpd_api
pidfile=${PIDFILE-/var/run/httpd/chroot_httpd_api.pid}
lockfile=${LOCKFILE-/var/lock/subsys/chroot_httpd_api}
CHROOTBIN=/usr/sbin/chroot
CHROOTHOME=/opt/hiphop/chroot
startcmd=/sbin/start_httpd
RETVAL=0

start() {
        echo -n $"Starting $prog: "
        $CHROOTBIN --userspec=48:48 $CHROOTHOME $startcmd
        RETVAL=$?
        echo
        [ $RETVAL = 0 ] && touch ${lockfile}
        return $RETVAL
}
...

You can ignore the rest if you like, but I did a find on my chroot environment to show what the final product looks like with directory structure and files. I took all the shared library files (.so) out of the list because there were about 60 of them. Make sure you edit your etc/passwd, etc/shadow, etc/hosts and other files to remove any sensitive data. The hosts file may need some entries for hostnames such as your MySQL database.
/opt/hiphop/chroot/
/opt/hiphop/chroot/usr
/opt/hiphop/chroot/usr/lib64
/opt/hiphop/chroot/usr/lib64/..many .so libraries here
/opt/hiphop/chroot/usr/lib64/mysql
/opt/hiphop/chroot/usr/lib64/mysql/libmysqlclient_r.so.16
/opt/hiphop/chroot/usr/lib64/php
/opt/hiphop/chroot/usr/lib64/php/modules
/opt/hiphop/chroot/usr/lib64/php/modules/apc.so
/opt/hiphop/chroot/dev
/opt/hiphop/chroot/dev/urandom
/opt/hiphop/chroot/dev/null
/opt/hiphop/chroot/dev/random
/opt/hiphop/chroot/var
/opt/hiphop/chroot/var/www
/opt/hiphop/chroot/var/www/html
/opt/hiphop/chroot/var/www/html/favicon.ico
/opt/hiphop/chroot/var/run
/opt/hiphop/chroot/var/log
/opt/hiphop/chroot/var/log/access_log
/opt/hiphop/chroot/var/log/access.log
/opt/hiphop/chroot/var/log/admin_log
/opt/hiphop/chroot/var/log/http_error.log
/opt/hiphop/chroot/var/log/error_log
/opt/hiphop/chroot/sbin
/opt/hiphop/chroot/sbin/httpd
/opt/hiphop/chroot/sbin/start_httpd
/opt/hiphop/chroot/lib64
/opt/hiphop/chroot/lib64/...many more .so libraries here
/opt/hiphop/chroot/tmp
/opt/hiphop/chroot/www.pid
/opt/hiphop/chroot/bin
/opt/hiphop/chroot/bin/sh
/opt/hiphop/chroot/opt
/opt/hiphop/chroot/opt/hiphop
/opt/hiphop/chroot/opt/hiphop/local
/opt/hiphop/chroot/opt/hiphop/local/lib < this was the local folder with libraries built for hiphop
/opt/hiphop/chroot/etc
/opt/hiphop/chroot/etc/run.conf
/opt/hiphop/chroot/etc/nsswitch.conf
/opt/hiphop/chroot/etc/log.conf
/opt/hiphop/chroot/etc/shadow
/opt/hiphop/chroot/etc/hosts
/opt/hiphop/chroot/etc/passwd
/opt/hiphop/chroot/etc/resolv.conf
/opt/hiphop/chroot/etc/httpd.conf
/opt/hiphop/chroot/etc/group

Drupal and Solr Integration

The basic Apache Solr integration is rather simple. Download the Solr Integration module from:
http://drupal.org/project/apachesolr

Install it and enable the module. Follow the included readme directions since you need to copy the schema.xml, protwords.txt, and solconfig.xml files so Solr can talk the same "language" as Drupal.

Configuring Solr Integration Module
Once you have installed Solr, enabled the module, and cron has run, some data should be in Solr. Configuring the module with make your search results much more useful. I'm giving the steps for Drupal 7, but they are a little different for Drupal 6.

Go to Configuration > Search and Metadat > Apache Solr Search

Edit the search environment to point to your Solr server
Under Behavior on empty search, pick the last radio item which talks about showing the first page of results. This is best for testing purposes so you can see results without filtering on terms.
Also enable the spellchecker since they may help "find" results

Go to the Search Index tab

If it doesn't say 100% of your site has been index, click run cron to send some documents. Also check number of documents in the index, to see if anything has been sent

The enabled filters tab is where you can define what content types will be indexed

You can also edit the content bias and search field tabs to change result weighting once you get the basic search running.

Finally go to Configuration > Search Settings

Make sure Apache Solr Search is enabled under modules
Make sure Apache Solr Search is set to be the default engine
Save your settings

One final tweak is to enable the blocks you want to see, so go to Structure > Blocks

Scroll down till you see the Disable section
Pick some of the Apache Solr items there and assign then to your sidebar

You should see items such as:

Apache Solr environment: localhost server : Current Search
Apache Solr Core: Sorting

Your's may look a little different based on other modules that have been installed, but they should look very similar.

Additional research (homework) to do on your own...take a look at different modules that can help extend Solr:

Apache Solr Attachments
Apache Solr Autocomplete
Facet API
Search API
Search API Solr

Apache Nutch Integration
Another interesting idea is to add Nutch:
http://nutch.apache.org/

To webcrawl websites and import its data into Solr. This part gets a little trickier because Nutch has its own schema it wants you to put into the schema.xml, so you end up having to merge to different schemas to get a final result. I ended up using only a few of the suggested schema changes and added the following line to schema.xml
<copyField source="id" dest="path"/>

This told Solr to copy the id field to a new field called path (which Drupal looks for to display search results).

I then edited the solrindex-mapping.xml file for Nutch as follows:

<mapping>

</fields>

</mapping>

This correctly created search documents in the Solr index which were in the correct format the Apache Solr module was looking for. Also note I copied the "site" field to "type" so when you turn on facets in Apache Solr, the Drupal content types get faceted along with the URL of the site you are webcrawling. This was a person choice, but you can categorize any other way.

I realize this is a mashed up entry with several different directions, but I wanted to give a little flavor of all the way you can integrate Solr into Drupal and do it with data from multiple sources. For additional reading, here is a link to lots of projects I found on Drupal:

http://drupal.org/node/343467#other-documentation

Next entry I want to talk about how I used Solr's Data Import Handler to index data from RSS/blogs, MySQL and XML files.

Monday, August 22, 2011

Hadoop, HBase, HipHop and other H-software

Introduction
I'm working on a large scale data project and am in the research phase right now. I only have a few high level requirements:

Scalability to multi-petabyte
Not tied to a single programming language
All open source software

Part of the research is developing a front end to display the data, but this is a small portion. I wanted to document what I was doing in the hopes it may help others. I have been doing a lot of research, playing around with configurations and finding my share of frustrating moments. I have been helped by so many different postings and blogs, I felt I needed to give back something as well.

Here is my initial setup that I have begun to configure and work with:

Fedora Linux
Drupal for the frontend
MySQL (for Drupal and some other metadata)
Apache webserver
Apache Solr
Hadoop 0.20
HBase NoSQL database (for primary storage)
Apache Nutch webcrawler
Hiphop PHP (for web services that will really benefit from native code)

Architecture
I'm using 4 existing machines and can setup virtual machines as needed, but will try to maintain the smallest number of servers to easy implementation and testing. Drupal and MySQL were installed on a single machine. Hadoop is installed and working on the 4 machines and 3 VMs, so I have 7 nodes for parallel tasking. The 4 servers are configured as the data nodes.

Sub-projects
I have a long lists of things to do and also things I have already done included in my bullets below. I plan on writing a blog about each one of these as I get time or if I get specific requests about any they might get preference:

Completed

Integrate Drupal to Solr Search
Integrate Apache Nutch to Solr for combined results in Drupal
Build API interface with Hiphop PHP for Drupal to HBase and HDFS integration
Write test Drupal module for block, node page altering
Google maps integration with Drupal
Mobile phone detection and theme switching with Drupal
Learn PHP (had to for working with Drupal)
Many more ...

To Do Items

Build API interface with Hiphop PHP for Drupal to HBase and HDFS integration
Deploy Hiphop PHP to a chroot environment
Federated search
Apache Solr clustering for redundancy and performance (if possible)
HBase cluster (zookeepers, Master, Regional)
XML to XSLT for XML data revisions in HBase
Apache/Drupal cluster with support for SSL sessions
Map / Reduce tuning and programming testing
Nagios/Ganglia for monitoring Hadoop
Geocode searching Drupal/Solr

Thats enough for the first post. My first task is getting Hiphop running in a chroot environment today and working on getting config file like I want it.