Documentation¶
This documentation has the following resources:
Big Data Development Environment using Docker¶
Ferry helps you create big data clusters on your local machine. Define your big data stack using YAML and share your application with Dockerfiles. Ferry supports Hadoop, Cassandra, Spark, GlusterFS, and Open MPI.
Here’s an example Hadoop cluster:
backend:
- storage:
personality: "hadoop"
instances: 2
layers:
- "hive"
connectors:
- personality: "hadoop-client"
Then get started by typing ferry start hadoop. This will automatically create a two node Hadoop cluster and a single Linux client. You can customize the Linux client during runtime or define your own using a Dockerfile. In addition to Hadoop, Ferry also supports Spark, Cassandra, GlusterFS, and Open MPI.
Ferry is useful for:
- Data scientists that want to experiment and learn about big data technologies
- Developers that need a locally accessible big data development environment
- Users that want to share big data application quickly and safely
Ferry provides several useful commands for your applications:
- Start and stop services
- View status and create snapshots
- SSH into clients
- Copy over log files to a host directory
For example, let’s inspect all the running services
$ ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- --------- ---------- ------- ------- ----
sa-2 se-6 [u'se-7'] se-8 removed hadoop --
sa-1 se-3 [u'se-4'] se-5 stopped openmpi --
sa-0 se-0 [u'se-1'] se-2 running cassandra --
Ferry is under active development, so follow us on Twitter to keep up to date.
If you’re interested in collaborating or have any questions, feel free to send an email to info@opencore.io.
Installation¶
Client Install¶
Ferry currently relies on a relatively new version of Linux (Ubuntu 13.10 and 14.04 have both been tested). While other distros will probably work, these instructions are not guaranteed to work on those environments. If you decide to go ahead and try anyway, please let us know how it went and file any issues on Github!
OS X¶
The easiest way to get started on OS X is via Vagrant. In theory you could create an Ubuntu VM yourself and install everything (via the Ubuntu instructions), but using Vagrant is much easier.
Assuming you’re running Vagrant, type the following into your prompt:
$ vagrant box add opencore/ferry https://s3.amazonaws.com/opencore/ferry.box
$ vagrant init opencore/ferry
$ vagrant up
This will create your Vagrant box and initialize everything. Please note that the Ferry box is about 3 GB, so the download will take a while (the box contains all the Docker images pre-built). After the Vagrant box is up and running, ssh into it:
$ vagrant ssh
Now you can get started. Please note that this Vagrant box does not contain very much besides the basic Ferry installation, so you’ll probably want to install your favorite text editor, etc.
Linux (Requirements)¶
Before installing Ferry, you’ll need to have Docker installed. Here are the commands for a relatively new version of Ubuntu:
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 36A1D7869245C8950F966E92D8576A8BA88D21E9
$ sudo sh -c "echo deb http://get.docker.io/ubuntu docker main > /etc/apt/sources.list.d/docker.list"
$ sudo apt-get update
$ sudo apt-get install lxc lxc-docker
Note that Ferry requires the use of LXC. Since newer versions of Docker uses libcontainer by default, we need to include lxc in the installation as well. If you’d like a more in-depth explanation of what’s going on, visit the Docker homepage for more detailed instructions.
After installing Docker, you’ll want to create a new group called docker to simplify interacting with Docker and Ferry as a non-root user.
$ sudo groupadd docker
$ sudo usermod -a -G docker $USER
You may need to logout and log back in for the group changes to take effect.
You’ll also need to install pip. On an Ubuntu machine, type:
$ sudo apt-get install python-pip
Linux (Preferred)¶
The preferred way of running Ferry on your Linux box is to use our installation script ferry-dust. You can obtain this script via pip. Just type the following:
$ sudo pip install -U ferry
Afterwards, type the following to get the images installed:
$ export FERRY_DIR=/var/lib/ferry
$ ferry-dust install
Note that you can set FERRY_DIR to any directory that you’d like. This simply tells Ferry where to store all the Ferry images. The install command will pull all the images and may take quite a while.
After the install completes, you can start using Ferry. To enter a console, type:
$ ferry-dust start
Linux (Manual)¶
These instructions are for installing Ferry manually (without using ferry-dust). While the instructions aren’t long, please be warned this process is a bit more fragile. Also, if you are upgrading from a prior installation, head over here for a more in-depth explanation.
First you’ll need to install Ferry via pip.
$ sudo pip install -U ferry
After installing Ferry, we’ll need to install the Ferry images (containing Hadoop, Spark, etc.).
$ sudo ferry install
By default Ferry will use a default set of public/private keys so that you can interact with the connectors afterwards. You can instruct ferry to use your own keys by supplying a directory like this ferry -k $KEY_DIR install. The build process may take a while, so sit back and relax.
Running Ferry¶
Once Ferry is completely installed, you should be able to start the Ferry server and start writing your application. First you’ll need to start the server.
$ sudo ferry server
$ ferry info
Congratulations! Now you’ll want to head over to the Getting Started documents to figure out how to write a big data application. Currently Ferry supports the following backends:
- Hadoop (version 2.3.0) with Hive (version 0.12)
- Cassandra (version 2.0.5)
- Titan graph database (0.3.1)
- Gluster Filesystem (version 3.4)
- Open MPI (version 1.7.3)
When you’re all done writing your application, you can stop the Ferry servers by typing:
$ sudo ferry quit
OpenStack Server Installation¶
This documentation is meant for system adminstrators and data engineers that are interested in installing Ferry in their own OpenStack private cloud. If you’re an end-user, you probably want to read this.
Installing Ferry on your OpenStack cluster is relatively straightfoward, and simply requires creating a “Ferry Server” image.
Quick Installation¶
- Instantiate an Ubuntu 14.04 image
- Create a /ferry/master directory and export “FERRY_DIR=/ferry/master”
- Install Ferry and all Ferry images
- Save the image as “Ferry Server (small)”
Ubuntu 14.04¶
The Ferry server must be based on Ubuntu 14.04. If your OpenStack cluster does not have an Ubuntu 14.04 image already, you can download an image from the Ubuntu website:
Afterwards you can import the image into Glance by typing: Once you have an Ubuntu 14.04 instance running,
$ glance image-create --name "Ubuntu 14.04 (amd64)" --disk-format=raw --container-format=bare --file=./trusty-server-cloudimg-amd64-disk1.img
After importing the image, boot up a new Ubuntu instance. You’ll want to use an instance type with at least a 10GB root directory, since we’ll be installing all the Ferry images.
Installing Ferry¶
After instantiating the Ubuntu instance, you’ll need to install Ferry. The easiest way is to use our automated installation script:
$ git clone https://github.com/opencore/ferry.git
$ sudo ferry/install/ferry-install
However you can also install Ferry manually.
- Detailed Ferry instructions
One point of note: you’ll need to first create and set an alternative Ferry data directory.
$ mkdir -p /ferry/master && export FERRY_DIR=/ferry/master
The installation process can take quite a while (it will download approximately 5GB of Docker images).
To verify if Ferry has been installed, you can just type:
$ sudo ferry server
...
using backend local
$ sudo ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- ------- ---------- ------ ---- ----
OpenStack Heat¶
Ferry uses the Heat orchestration engine to launch clusters on OpenStack. If your OpenStack cluster already has Heat installed, then you can skip this step. Otherwise, Ferry will use its own “stand-alone” Heat server. To use the stand-alone Heat server, you’ll need to download the Ferry Heat image.
$ sudo ferry server
$ sudo ferry pull image://ferry/heatserver
Save Image¶
Now that you have Ferry installed, go ahead and stop Ferry.
$ sudo ferry quit
Afterwards, create a snapshot of the instance. You can name the snapshot whatever you want, but users will need this name later when configuring the client. Something like “Ferry Server” should do.
Next Steps¶
Once the Ferry image is created, users should be able to start using Ferry to create big data clusters. The “Ferry Server” image can be used either as a client or server.
- Configuring the client instructions
- HP Cloud client instructions
OpenStack Client¶
This documentation is meant for Ferry end users. Before we can get started, however, the Ferry server image will need to be installed in the OpenStack cluster. Normally a system adminstrator would do that part.
- Installing the Ferry image instructions
Once that’s done, we can launch our Ferry client.
Launch Summary¶
- Launch a new Ferry client using the Ferry server image
- Set OpenStack credentials
- Create a new Ferry configuration file
- Start the Ferry server as root via sudo ferry server
Launching the Client¶
This documentation assumes that the Ferry client is launched from a VM running in the OpenStack cluster. This isn’t strictly necessary however. If you already have Ferry installed on your local machine, you can skip this step and begin setting up your OpenStack credentials.
1. Start by launching a new instance using the Ferry server image. If you don’t know the name of the image, ask your system adminstrator (it’s probably named something like “Ferry Server”) 2. Use an instance type with more than 10GB of root storage. 3. Associate a floating IP with the instance so that can ssh into it later.
The Ferry image is based on an Ubuntu 14.04 image, so after the instance is fully launched, log in using the ubuntu account.
$ ssh -i myprivatekey.pem ubuntu@172.x.x.y
The next step is to configure your OpenStack credentials.
Setting OpenStack credentials¶
In order to use the OpenStack backend, you’ll need to set your OpenStack credentials. Right now Ferry requires that you run everything as root. So switch to the root user (or equivalently use sudo). Afterwards, set the following environment variables:
- OS_USERNAME
- OS_PASSWORD
- OS_TENANT_ID
- OS_TENANT_NAME
- OS_REGION_NAME
- OS_AUTH_URL
These values can be readily found using the OpenStack Horizon web interface.
Configuring the Client¶
After setting your OpenStack environment variables, you need to create a new Ferry configuration file. Create a new file called /root/.ferry-config.yaml and use the following example to populate it.
system:
provider: openstack
network: eth0
backend: ferry.fabric.cloud/CloudFabric
mode: ferry.fabric.openstack.singlelauncher/SingleLauncher
proxy: false
openstack:
params:
dc: homedc
zone: ZONE
deploy:
image: Ferry Server
personality: standard.small
default-user: ubuntu
ssh: ferry-keys
ssh-user: ferry
homedc:
region: REGION
keystone: https://<IDENTITYSERVICE>.com
neutron: https://<NETWORKSERVICE>.com
nova: https://<COMPUTESERVICE>.com
swift: https://<STORAGESERVICE>.com
cinder: https://<DISKSERVICE>.com
heat: https://<ORCHSERVICE>.com
extnet: 123-123-123
network: 123-123-123
router: 123-123-123
First let’s fill in the openstack.params.zone and openstack.homedc.region values.
- openstack.params.zone : the default availability zone
- openstack.homedc.region : the default region
Next we need to supply the OpenStack service endpoints.
- openstack.homedc.keystone : location of the identity service
- openstack.homedc.neutron : location of the network service
- openstack.homedc.nova : location of the compute service
- openstack.homedc.swift : location of the storage service
- openstack.homedc.cinder : location of the block storage service
- openstack.homedc.heat : location of the orchestration service (optional)
Now under openstack and homedc, there are three fields called extnet, network, and router. To fill in these values, you can use the ferry-install os-info command. Just type that in and you should see something like this:
$ ferry-install os-info
====US West====
Networks:
+--------------------------------------+----------------+---------------------------------+
| id | name | subnets |
+--------------------------------------+----------------+---------------------------------+
| 11111111-2222-3333-4444-555555555555 | Ext-Net | 1111111111111-2222-3333-444444 |
| 11111111-2222-3333-4444-555555555555 | myuser-network | 1111111111111-2222-3333-444444 |
+--------------------------------------+----------------+---------------------------------+
Routers:
+--------------------------------------+---------------+----------------------------------+
| id | name | external_gateway_info |
+--------------------------------------+---------------+----------------------------------+
| 11111111-2222-3333-4444-555555555555 | myuser-router | {"network_id": "11111111-2222-3 |
+--------------------------------------+---------------+----------------------------------+
Just copy the the ID of the Ext-Net, myuser-network and myuser-router into the respective extnet, network and router fields.
Next you need to configure your ssh key.
- openstack.deploy.ssh : name of the ssh key you’d like to use for VM creation
On your client, you’ll need to place a copy of the private key placed in the /ferry/keys/ directory.
Finally, here are the list of optional values that you can set.
- system.proxy : set to true if you’re running your client in the OpenStack cluster.
- openstack.deploy.personality : the default personality to use. Highly recommended to use an image with more than 2 virtual CPUs.
Running Examples¶
After you’ve created your configuration file, you should start the Ferry server:
$ sudo ferry server
It’ll take a few seconds, but you’ll eventually see output that indicates that you’re using the OpenStack backend.
$ sudo ferry server
...
using heat server http://10.1.0.3:8004/v1/42396664178112
using backend cloud ver:0.1
Afterwards, you should be able to start a new application stack.
$ sudo ferry start hadoop
Starting the Hadoop stack can take 10 minutes or longer. If you login to your Horizon web interface, you should be able to see the VMs being instantiated. You can also check the status via Ferry:
$ sudo ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- ------- ---------- ------ ---- ----
sa-bfa98eda [] [' '] [] building hadoop
$ sudo ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- ------- ---------- ------ ---- ----
sa-bfa98eda [u'se-60c89300'] [' '] [u'se-0b841c69'] running hadoop
Once the stack is in the running state, log in to the Hadoop client:
$ sudo ferry ssh sa-bfa98eda
Afterwards, run a simple Hadoop job:
$ /service/runscripts/test/test01.sh hive
That’s it! Once you’re done, you can stop and delete the entire Hadoop cluster:
$ sudo ferry stop sa-bfa98eda
$ sudo ferry rm sa-bfa98eda
HP Cloud¶
Ferry images are pre-installed on HP Cloud, so launching new big data clusters on HP Cloud is relatively straightforward.
Launch Summary¶
- Launch a new Ferry client using the “Ferry Server (small)” image
- Set OpenStack credentials
- Create a new Ferry configuration file
- Start the Ferry server as root via sudo ferry server
Launching the Client¶
This documentation assumes that the Ferry client is launched from a VM running in the HP Cloud. This isn’t strictly necessary however. If you already have Ferry installed on your local machine, you can skip this step and begin setting up your OpenStack credentials.
If you’re launching a new Ferry client, OpenCore provides pre-built images for HP Cloud.
- Start by launching a new instance using the “Ferry Server (small)” image.
- Use at least the “standard.small” instance size
- Associate a floating IP with the instance so that can ssh into it later.
The Ferry image is based on an Ubuntu 14.04 image, so after the instance is fully launched, log in using the ubuntu account.
$ ssh -i myprivatekey.pem ubuntu@172.x.x.y
The next step is to configure your OpenStack credentials.
Setting OpenStack credentials¶
In order to use the OpenStack backend, you’ll need to set your OpenStack credentials. Right now Ferry requires that you run everything as root. So switch to the root user (or equivalently use sudo). Afterwards, set the following environment variables:
- OS_USERNAME
- OS_PASSWORD
- OS_TENANT_ID
- OS_TENANT_NAME
- OS_REGION_NAME
- OS_AUTH_URL
These values can be readily found using the HP Cloud web interface . OS_TENANT_ID and OS_TENANT_NAME can be found under “Identity, Projects”. OS_REGION_NAME should be set to either region-a.geo-1 (for US West) or region-b.geo-1 for (US East). Finally, you can find the OS_AUTH_URL under “Project, Access & Security, API Access”. Specifically, you’ll want the URL for the “Identity” service.
Configuring the Client¶
After setting your OpenStack environment variables, you need to create a new Ferry configuration file. If you’re running your client using the HP Cloud image, just open the file /root/.ferry-config.yaml. Otherwise, go ahead and create the file now.
You want your configuration to look like this:
system:
provider: hp
network: eth0
backend: ferry.fabric.cloud/CloudFabric
mode: ferry.fabric.openstack.singlelauncher/SingleLauncher
proxy: false
hp:
params:
dc: uswest
zone: az2
deploy:
image: Ferry Server (small)
personality: standard.small
default-user: ubuntu
ssh: ferry-keys
ssh-user: ferry
uswest:
region: region-a.geo-1
keystone: https://region-a.geo-1.identity.hpcloudsvc.com:35357/v2.0/
neutron: https://region-a.geo-1.network.hpcloudsvc.com
nova: https://region-a.geo-1.compute.hpcloudsvc.com/v2/10089763026941
swift: https://region-a.geo-1.images.hpcloudsvc.com:443/v1.0
cinder: https://region-a.geo-1.block.hpcloudsvc.com/v1/10089763026941
extnet: 122c72de-0924-4b9f-8cf3-b18d5d3d292c
network: 123-123-123
router: 123-123-123
First let’s fill in the openstack.params.zone and openstack.homedc.region values.
- openstack.params.zone : the default availability zone (az1, az2, az3)
- openstack.homedc.region : the default region (region-a.geo-1, region-b.geo-1)
Now under hp and uswest, there are two fields called network and router. To fill in these values, you can use the ferry-install os-info command. Just type that in and you should see something like this:
$ ferry-install os-info
====US West====
Networks:
+--------------------------------------+----------------+---------------------------------+
| id | name | subnets |
+--------------------------------------+----------------+-------------------------------- +
| 122c72de-0924-4b9f-8cf3-b18d5d3d292c | Ext-Net | c2ca2626-97db-429a-bb20-1ea42e1 |
| 11111111-2222-3333-4444-555555555555 | myuser-network | 1111111111111-2222-3333-4444444 |
+--------------------------------------+----------------+---------------------------------+
Routers:
+--------------------------------------+---------------+----------------------------------+
| id | name | external_gateway_info |
+--------------------------------------+---------------+----------------------------------+
| 11111111-2222-3333-4444-555555555555 | myuser-router | {"network_id": "122c72de-0924-4b |
+--------------------------------------+---------------+----------------------------------+
Just copy the the ID of the myuser-network and myuser-router into the network and router fields.
Next you need to configure your ssh key.
- hp.deploy.ssh : name of the ssh key you’d like to use for VM creation
On your client, you’ll need to place a copy of the private key placed in the /ferry/keys/ directory.
Finally, here are the list of optional values that you can set.
- system.proxy : set to true if you’re running your client in the OpenStack cluster.
- hp.deploy.personality : the default personality to use. Highly recommended to use standard.small or larger
Running Examples¶
After you’ve created your configuration file, you should start the Ferry server:
$ sudo ferry server
It’ll take a few seconds, but you’ll eventually see output that indicates that you’re using the OpenStack backend.
$ sudo ferry server
...
using heat server http://10.1.0.3:8004/v1/42396664178112
using backend cloud ver:0.1
Afterwards, you should be able to start a new application stack.
$ sudo ferry start hadoop
Starting the Hadoop stack can take 10 minutes or longer. If you login to your HP Cloud web interface, you should be able to see the VMs being instantiated. You can also check the status via Ferry:
$ sudo ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- ------- ---------- ------ ---- ----
sa-bfa98eda [] [' '] [] building hadoop
$ sudo ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- ------- ---------- ------ ---- ----
sa-bfa98eda [u'se-60c89300'] [' '] [u'se-0b841c69'] running hadoop
Once the stack is in the running state, log in to the Hadoop client:
$ sudo ferry ssh sa-bfa98eda
Afterwards, run a simple Hadoop job:
$ /service/runscripts/test/test01.sh hive
That’s it! Once you’re done, you can stop and delete the entire Hadoop cluster:
$ sudo ferry stop sa-bfa98eda
$ sudo ferry rm sa-bfa98eda
Rackspace¶
Coming soon!
Amazon Web Services¶
Ferry comes with built-in support for AWS, and lets you launch, run, and manage big data clusters on Amazon EC2. This means that you can quickly spin up a new Hadoop, Spark, Cassandra, or Open MPI cluster on AWS with just a few simple commands.
Ferry on AWS offers several advantages over tools such as Elastic MapReduce.
- Greater control over storage. You can instruct Ferry to use either ephemeral or elastic block storage.
- Greater control over the network. You can launch instances in either a public or private subnet (via VPC).
- Ability to mix-and-match components such as Oozie and Hue.
- Ability to manage multiple clusters of different types (Hadoop, Spark, and Cassandra all from a single control interface).
Before You Start¶
This documentation assumes that you have access to an Amazon Web Services account. If you don’t, go ahead and create one now. You’ll also probably want to create a new key pair for Ferry. While you can use an existing key pair, that is considered poor practice.
Ferry launches new instances into private subnets within a VPC. While more secure than legacy EC2, it does mean that you’ll need to have a VPC set up for Ferry to use. Most likely your AWS account already has this set up. If not, just navigate to Services -> VPC on your AWS homepage. Click on the “Your VPCs” menu to browse available VPCs. If do you have a VPC set up, you’ll want to remember the “VPC ID” (we’ll use it during the configuration stage).
Otherwise, if you don’t have any VPCs set up, just click on “Create VPC”. Amazon doesn’t charge you for creating a VPC, so don’t be afraid of messing up. After your VPC is set up, just note the “VPC ID”. For those that need more help, here’s a friendly [VPC tutorial](http://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/Wizard.html).
Once the VPC is set up, Ferry will automatically handle the creation of the various subnets (unless you provide your own subnet information).
In summary, you will need:
- An active Amazon Web Services account
- A keypair used for communicating with Ferry EC2 instances
- A working VPC
Launch Summary¶
- Create new Ferry client VM
- Create a new Ferry configuration file
- Start the Ferry server as root via sudo ferry server
- Launch new clusters via sudo ferry start hadoop
Launching¶
The very first step is to have a functioning Ferry installation. The quickest way to get a functioning Ferry installation is to use our public client image. Search for “Ferry” under “Community Images”. The image is currently available in the US East and US West (N. California) regions.
Please note that for the AWS backend to work properly, the client has to be running in the same VPC that the Ferry instances will be running. Otherwise, the client won’t be able to communicate with your instances when they’re launched in a private subnet.
After spinning up the client VM, ssh into it:
$ ssh -i MY_AWS_KEY.pem ubuntu@MY_IP_ADDRESS
Please note that as of right now, Ferry requires root accesss, type the following:
$ sudo su
$ export FERRY_DIR=/ferry/master
That last command tells Ferry where to find all the Ferry images. If you’re using your own client installation (instead of our AWS image), you can probably skip that part.
Now it’s time to create the configuration file.
Configuration¶
In order to tell Ferry to use the AWS backend, you’ll need to create a Ferry configuration file.
Create a new configuration file ~/.ferry-config.yaml. If you have a pre-existing configuration, you can just modify that one instead.
You want your configuration to look like this:
system:
provider: aws
backend: ferry.fabric.cloud/CloudFabric
mode: ferry.fabric.aws.awslauncher/AWSLauncher
proxy: false
web:
workers: 1
bind: 0.0.0.0
port: 4000
aws:
params:
dc: us-east-1
zone: us-east-1b
volume: ebs:8
deploy:
image: ami-20ef5d48
personality: $EC2_TYPE
vpc: $VPC_ID
manage_subnet: $SUBNET_ID
data_subnet: $SUBNET_ID
default-user: ubuntu
ssh: $EC2_KEYPAIR
ssh-user: ferry
public: false
user: $API USER
access: $API_ACCESS
secret: $API_SECRETY
The most important parameters are:
- $EC2_TYPE: This is the instance type for all the VMs created by Ferry. The minimum size supported is t2.small, although
you’ll want something larger for production environments * $EC2_KEYPAIR: This is the key pair that Ferry will use to communicate with the VMs. You must place the private in /ferry/keys/ so that Ferry can find it. * $API_USER: Your EC2 user handle. * $API_ACCESS: Your EC2 access token. You can find these credentials from the AWS homepage by clicking Account, Security Credentials, Access Credentials. * $API_SECRET: Your EC2 secret key. You can find these credentials from the AWS homepage by clicking Account, Security Credentials, Access Credentials.
Storage¶
You can specify the storage capabilities of the VMs via the volume parameter. The syntax for modifying this parameter is:
- [ebs,ephemeral]:(size)
For example, to use 32GB EBS data volumes, set the value to: ebs:32. To use the instance store, just set the value to ephemeral. You can’t specify the ephemeral block size since that is determined by your instance type.
Networking¶
You can specify the networking configuration via the following parameters:
- vpc: (Mandatory) Replace this with your VPC ID.
- manage_subnet: (Optional) If you specify a subnet ID, connectors will be launched into
that subnet. Otherwise a new public subnet will be created. * data_subnet: (Optional) If you specify a subnet ID, backend nodes will be launched into that subnet. Otherwise a new data subnet will be created. * public: (Optional) If set to true, then the data subnet will be public. Otherwise, the data subnet will be private. The default value is false.
Region and AMI¶
Finally, you can specify the EC2 region via the following parameters:
- dc: The EC2 region to use.
- zone: The availability zone to use.
Depending on which EC2 region you specify, you’ll need to change the AMI.
Region | AMI |
---|---|
us-east-1 | ami-20ef5d48 |
us-west-1 | ami-bb919afe |
Please note that only us-east-1 and us-west-1 are officially supported. Please file a GitHub issue for additional region support.
Running Examples¶
After you’ve created your configuration file, you should start the Ferry server:
$ sudo ferry server
It’ll take a few seconds, but you’ll eventually see output that indicates that you’re using the AWS backend.
$ sudo ferry server
...
using heat server http://10.1.0.3:8004/v1/42396664178112
using backend cloud ver:0.1
Afterwards, you should be able to start a new application stack.
$ sudo ferry start hadoop
Starting the Hadoop stack can take 10 minutes or longer. If you login to your AWS CloudFormation interface, you should be able to see the VMs being instantiated. You can also check the status via Ferry:
$ sudo ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- ------- ---------- ------ ---- ----
sa-bfa98eda [] [' '] [] building hadoop
$ sudo ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- ------- ---------- ------ ---- ----
sa-bfa98eda [u'se-60c89300'] [' '] [u'se-0b841c69'] running hadoop
Once the stack is in the running state, log in to the Hadoop client:
$ sudo ferry ssh sa-bfa98eda
Afterwards, run a simple Hadoop job:
$ /service/runscripts/test/test01.sh hive
Terminating the Cluster¶
If you want to stop your cluster, just type:
$ sudo ferry stop sa-bfa98eda
You can restart the same cluster by typing:
$ sudo ferry start sa-bfa98eda
Once you’re finished, you can delete the entire cluster by typing:
$ sudo ferry rm sa-bfa98eda
This will remove all the resources associated with the cluster. Be warned, however, that doing so will delete all the data associated with the cluster!.
Future Features¶
There are a few features that aren’t quite implemented yet.
- Spot instance support. All instances are currently allocated in an on-demand manner.
- Heterogeneous instance types. At the moment, all instances use the same instance type.
- Resizing clusters. Once a cluster is created, the size of the cluster is fixed.
If any of these features are particularly important to you, please consider contributing.
Getting Started¶
Quick start¶
Let’s get a basic Hadoop cluster up and running using Ferry. First you’ll need to install Docker. If you’re running the latest version of Ubuntu, it’s fairly straightforward.
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 36A1D7869245C8950F966E92D8576A8BA88D21E9
$ sudo sh -c "echo deb http://get.docker.io/ubuntu docker main > /etc/apt/sources.list.d/docker.list"
$ sudo apt-get update
$ sudo apt-get install lxc-docker-0.8.1
Please note that you’ll need to install Docker version 0.8.1. This will install additional libraries that Ferry needs.
You’ll also want to create a new group called docker to simplify interacting with Docker and Ferry as a non-root user. There are more detailed instructions on the Docker homepage.
Next you’ll want to install Ferry.
$ sudo pip install -U ferry
Now you’ll want to build the various Ferry images. Just type:
$ sudo ferry install
This will take at least tens of minutes, so sit back and relax. After all the images are built, check by typing:
$ sudo ferry server
$ ferry info
Congratulations!
You can examine the pre-installed Ferry applications by typing:
$ ferry ls
App Author Version Description
--- ------ ------- -----------
cassandra James Horey 0.2.0 Cassandra stack...
hadoop James Horey 0.2.0 Hadoop stack...
openmpi James Horey 0.2.0 Open MPI over Gluster...
spark James Horey 0.2.0 Spark over Hadoop...
yarn James Horey 0.2.0 Hadoop YARN over Glus...
Say you’re interested in the Hadoop application, just type the following to create a new Spark cluster:
$ ferry start hadoop
At this point, you probably want to head over to the Hadoop tutorial to understand how to interact with your new cluster.
When you’re all done writing your application, you can stop the Ferry servers by typing:
$ sudo ferry quit
Advanced Topics¶
Creating a Dockerfile¶
All connectors in Ferry can be customized and saved during runtime. However, sometimes it’s nice to specify all these runtime dependencies outside the connector. This would let us share applications easily with others and help us organize our own applications. Fortunately for us, we can use special files called Dockerfiles to help us do this. Here’s an example Dockerfile.
FROM ferry/cassandra-client
NAME james/cassandra-examples
RUN apt-get --yes install python-pip python-dev
RUN pip install cassandra-driver
RUN git clone https://github.com/opencore/cassandra-examples.git /home/ferry/cassandra-examples
Let’s take a look line by line. The first states that
FROM ferry/cassandra-client
That just means that our client is based on the official Ferry cassandra client. This ensures that all the right drivers, etc. will be automatically installed. The cassandra client, in turn, is based on an Ubuntu 12.04 image, so you can use convenient tools like apt-get in the Dockerfile. The next line:
NAME james/cassandra-examples
specifies the official name of this Dockerfile (for those that have created Dockerfiles before, you may recognize that the NAME command is not an officially supported Docker command). Now the next few lines tells Ferry what to install:
RUN apt-get --yes install python-pip python-dev
RUN pip install cassandra-driver
In our case, we just want to use the Python Cassandra driver. Finally, we’re going to download a copy of some convenient Cassandra examples from GitHub.
RUN git clone https://github.com/opencore/cassandra-examples.git /home/ferry/cassandra-examples
Once you have the Dockerfile specified, you can use it in a Ferry application by specifying it’s name in the connector section. Here’s an example:
backend:
- storage:
personality: "cassandra"
instances: 2
connectors:
- personality: "james/cassandra-examples"
Let’s assume that your yaml file is called cassandra_examples.yml and stored in the directory ~/my_app. In order to start your new custom application, just type the following:
$ ferry start ~/my_app/cassandra_examples.yml -b ~/my_app
The -b flag tells Ferry where to find your Dockerfile. Without that flag, it won’t be able to compile the james/cassandra-examples image. After you type that command the first time, you can omit the the -b flag (since the compiled image will reside on your local Ferry repository). Once you log into your connector, you find the Cassandra example applications like this:
$ su ferry
$ ls /home/ferry/cassandra-examples
sensorapp
twissandra
kairosdb
Creating a Dockerfile for your application is a convenient way to store and share your application. By providing the Dockerfile (along with any files that are included in the Dockerfile), any user can run the same application in Ferry.
Port forwarding¶
If your connector exposes a web service, you can find the IP address of your connector using the inspect command. This IP can then be used to access your connector so long as you’re on the same host (this IP is not exposed to the outside world). However, if you wanted to expose this web service to the outside world, you can use port forwarding. This concept is very similar to the native Docker port feature. Simply add the ports argument to your YAML stack file like below:
backend:
- storage:
personality: "cassandra"
instances: 2
connectors:
- personality: "james/cassandra-examples"
ports: ["7888:8000"]
Here we’re specifying both the exposed port on the host (7888) and the internal port used by your web service (8000). If you use a single value (“8000”), Ferry will simply choose a random port to expose on the host. You can find the exposed port value via the ``inspect` command. For example, here’s a shortened version of what you should see:
{
"connectors":[
{
"containers":[
{
"internal_ip":"10.1.0.5",
"ports":{
"8000":[
{
"HostIp":"0.0.0.0",
"HostPort":"7888"
}
]
}
Pushing and Pulling Applications¶
Ferry lets you easily share your application with others.
Once you’re done developing your app, you can upload your application to the Ferry servers to share it with others. For now, you’ll need an account on Docker.io to upload your images and another on Ferry to store your application stack. Once you have these accounts, you’ll need to create an authorization file.
$ cat ~/.ferry.docker
docker:
user: <DOCKER LOGIN>
password: <DOCKER PASSWORD>
email: <DOCKER EMAIL>
server: https://index.docker.io/v1/
ferry:
user: <FERRY LOGIN>
key: <FERRY KEY>
server: http://apps.opencore.io
registry: docker
Note that currently all Ferry applications are public. It doesn’t necessarily mean that all your customizations are publicly accessible, just that anybody can download your application. Private applications are something that we plan on supporting in the near future.
To upload the application, just perform a push command and point it to your application stack.
$ ferry push app:///home/ferry/mortar.yml
opencore/mortar
The final name of the application is your Ferry user name appended with the name of the application file.
Pulling Applications¶
To download an application, you’ll want to perform a pull command. Ferry lets you download Docker images manually or an entire Ferry application. Here we’re going to pull the Mortar Recommendation System application.
$ sudo ferry pull app://opencore/mortar
$ ferry inspect opencore/mortar
backend:
- storage:
personality: "hadoop"
instances: ">=1"
layers:
- "hive"
connectors:
- personality: "ferry/mortar-recsys"
This will pull all the Docker images the application needs and will register the application on your local machine. If you just want to download an image, just type:
$ sudo ferry pull image://opencore/mortar-recsys
Once you downloaded the application, you can start it like any other application:
$ ferry start opencore/mortar
Getting started with Cassandra¶
Cassandra is a highly scalable, “wide-column” store used to store large amounts of semi-structured data. It is often used for applications that insert lots of streaming data (i.e., sensors, web metrics, etc.), and where high availability is a premium.
The first thing to do is define our stack in a file (let’s call it cassandra.yaml). The file should look something like this:
backend:
- storage:
personality: "cassandra"
instances: 2
layers:
- "titan"
args:
db: "users"
connectors:
- personality: "cassandra-client"
args:
db: "users"
There are two main sections: the backend and connectors. In this example, we’re defining a single storage backend with two Cassandra instances. We’re also specifying the name of the default database (users). We’re also creating a single Cassandra client that connects to the default database. This client is just a Linux instance that is automatically configured to connect to Cassandra.
Running an example¶
Now that we’ve defined our stack, let’s start it up. Just type the following in your terminal:
$ ferry start cassandra
sa-0
$ ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- --------- ---------- ------- ------- ----
sa-0 se-0 [u'se-1'] se-2 running cassandra --
The entire process should take about 20 seconds. Before we continue, let’s take a step back to examine what just happened. After typing start, ferry created the following Docker containers:
- Two Cassandra data nodes
- A Linux client
Now that the environment is created, let’s interact with it by connecting to the Linux client. Just type docker ssh sa-0 in your terminal. From there you’ll can check your backend connection and install whatever you need.
Now let’s check what environment variables have been created. Remember this is all being run from the connector.
$ env | grep BACKEND
BACKEND_STORAGE_TYPE=cassandra
BACKEND_STORAGE_IP=10.1.0.3
Now if you’re really impatient to get a Cassandra application working, just type the following into the terminal:
$ /service/runscripts/test/test01.sh cql
It will take a few seconds to complete, but you should see some output that comes from executing the application. If you want to know what you just did, take a peek at the /service/runscripts/test/test01.sh file.
Now let’s interact with manually Cassandra and create a simple database. You can interact with Cassandra using cql, a language similar to SQL.
CREATE KEYSPACE mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
USE mykeyspace;
CREATE TABLE users (
user_id int PRIMARY KEY,
fname text,
lname text
);
INSERT INTO users (user_id, fname, lname) VALUES (1745, 'john', 'smith');
SELECT * FROM users WHERE lname = 'smith';
All this does is create a simple users table and inserts some fake data into it. Let’s save this CQL script into a file myscript.db. Now you can run this example by typing:
$ /service/bin/cqlsh -f myscript.db
Events and customization¶
Each connector is a complete Linux (Ubuntu) environment that can be completely configured. In fact, the connector is just a normal Docker container with a few extra scripts and packages pre-configured. That means you can install additional packages or include new code. Afterwards, it’s easy to save the entire state.
Connectors are customized using scripts that reside under /service/runscripts. You should see a set of directories, one for each type of event that Ferry produces. For example, the start directory contains scripts that are executed when the connector is first started. Likewise, there are events for:
- start: triggered when the connector is first started
- restart: triggered when the connector is restarted
- stop: triggered when the connector is stopped
- test: triggered when the connector is asked to perform a test
If you look in the test directory, you’ll find some example programs that you can execute. You can add your own scripts to these directories, and they’ll be executed in alphanumeric order.
Saving everything¶
Once you’ve installed all your packages and customized the runscripts, you’ll probably want to save your progress. You can do this by typing:
$ ferry snapshot sa-0
sn-sa-0-81a67d8e-b75b-4919-9a65-50554d183b83
$ ferry snapshots
UUID Base Date
-------------------------------------------- ------ --------------------
sn-sa-4-81a67d8e-b75b-4919-9a65-50554d183b83 cassandra 02/5/2014 (02:02 PM)
$ ferry start sn-sa-0-81a67d8e-b75b-4919-9a65-50554d183b83
sa-1
This will produce a snapshot that you can restart later. You can create as many snapshots as you want.
More resources¶
The Cassandra data model can take some getting used to. Once you do, you’ll find that Cassandra is relatively straightforward to use. Here are some additional resources that can help get you started.
Getting started with Hadoop¶
Hadoop is a popular big data platform that includes both storage (HDFS) and compute (YARN). In this example, we’ll create a small 2 node Hadoop cluster, and a single Linux client.
The first thing to do is to define our big data stack. You can either use YAML or JSON. Let’s call our new application stack hadoop.yaml. The file should look something like this:
backend:
- storage:
personality: "hadoop"
instances: 2
layers:
- "hive"
connectors:
- personality: "hadoop-client"
There are two main sections: the backend and connectors. In this example, we’re defining a single storage backend. This storage backend is going to run two instances of hadoop and also installs hive, an SQL compatability layer for Hadoop. The backend may also optionally include a compute section (for example additional yarn instances). However, in this example, we won’t need one since Hadoop will automatically come with its own compute capabilities.
Connectors are basically Linux clients that are able to connect to the backend. You’ll want at least one to simplify interacting with Hadoop. You can also place your application-specific code in the connector (for example, your web server). In this example, we’ll use the built-in hadoop-client.
Running an example¶
Now that we’ve defined our stack, let’s start it up. Don’t forget that you need the Ferry server to be up and running (via sudo ferry server). Afterwards type the following in your terminal:
$ ferry start hadoop
sa-0
$ ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- --------- ---------- ------- ------- ----
sa-0 se-0 [u'se-1'] se-2 running hadoop --
hadoop should be replaced with the path to your specific file. Otherwise it will use a default Hadoop stack. The entire process should take less than a minute.
Before we continue, let’s take a step back to examine what just happened. After typing start, ferry created the following Docker containers:
- Two Hadoop data/yarn nodes
- Hadoop namenode
- Hadoop YARN resource manager
- A Hive metadata server
- A Linux client
For those that already run docker for other reasons, don’t worry, ferry uses a separate Docker daemon so that you’re environment is left unaffected.
Now that the environment is created, let’s interact with it by connecting to the Linux client. Just type ferry ssh sa-0 in your terminal. From there you’ll can check your backend connection and install whatever you need.
Now let’s check what environment variables have been created. Remember this is all being run from the connector.
$ env | grep BACKEND
BACKEND_STORAGE_TYPE=hadoop
BACKEND_STORAGE_IP=10.1.0.3
Now if you’re really impatient to get a Hadoop application working, just type the following into the terminal:
$ /service/runscripts/test/test01.sh hive
It will take a minute or two to complete, but you should see a bunch of output that comes from executing the application. If you want to know what you just did, take a peek at the /service/runscripts/test/test01.sh file.
Now let’s manually run some Hadoop jobs to confirm that everything is working. We’re going download a dataset from internet. It’s very important that we run everything as the ferry user (as opposed to root). Otherwise you may see strange errors associated with permissions. So switch over to the ferry user by typing:
$ su ferry
$ source /etc/profile
That last command just sets the PATH environment variable so that you can find the hadoop and hive commands. To confirm, if you type the following, you should see the full path of the hive command. Of course, you can also just type in the full path if you prefer.
$ which hive
/service/packages/hive/bin/hive
Now that the PATH is set, we’re going to copy that dataset into the Hadoop filesystem. This is a necessary pre-condition to actually running any Hadoop jobs that operate over the data.
$ wget http://files.grouplens.org/datasets/movielens/ml-100k/u.data -P /tmp/movielens/
$ hdfs dfs -mkdir -p /data/movielens
$ hdfs dfs -copyFromLocal /tmp/movielens/u.data /data/movielens
Now we’re going to create the Hive tables. This will let us use SQL to interact with the data. To save our progress, let’s create a file createtable.sql to store all of our SQL. The file should contain something like this:
CREATE TABLE movielens_users (
userid INT,
movieid INT,
rating INT,
unixtime STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA INPATH '/data/movielens/u.data'
OVERWRITE INTO TABLE movielens_users;
Hive lets you create tables using different formats. Here we’re using the “Textfile” format to initially load the data. Afterwards, you can load the data into alternative formats such as “RCfile” for better performance.
After creating our SQL file, we can execute the query by typing:
$ hive -f createtable.sql
This should execute several MapReduce jobs (you’ll see a bunch of output to the screen). After it’s done loading, we can query this table. Let’s do this interactively:
$ hive
$ hive> SELECT COUNT(userid) FROM movielens_users WHERE userid < 10;
...
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 4.55 sec HDFS Read: 387448 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 550 msec
OK
1282
You’ll see way more output, but the last few lines should like this.
Compiling a new application¶
Running a custom MapReduce program is pretty straightforward. First we compile, then we package the results in a jar file, and then invoke the hadoop command. Here’s an example:
$ javac -classpath $HADOOP_HOME -d Wordcount/ Wordcount.java
$ jar -cvf Wordcount.jar -C Wordcount/ .
$ hadoop jar Wordcount.jar org.opencore.Wordcount test/ testout/
If you want to find a copy of the Wordcount.java file, look in the file hadoop-mapreduce-examples-2.2.0-sources.jar. jar files are just zip files, so you can use unzip it and find what you need.
Events and customization¶
Each connector is a complete Linux (Ubuntu) environment that can be completely configured. In fact, the connector is just a normal Docker container with a few extra scripts and packages pre-configured. That means you can install additional packages or include new code. Afterwards, it’s easy to save the entire state.
Connectors are customized using scripts that reside under /service/runscripts. You should see a set of directories, one for each type of event that Ferry produces. For example, the start directory contains scripts that are executed when the connector is first started. Likewise, there are events for:
- start: triggered when the connector is first started
- restart: triggered when the connector is restarted
- stop: triggered when the connector is stopped
- test: triggered when the connector is asked to perform a test
If you look in the test directory, you’ll find some example programs that you can execute. You can add your own scripts to these directories, and they’ll be executed in alphanumeric order.
Saving everything¶
Once you’ve installed all your packages and customized the runscripts, you’ll probably want to save your progress. You can do this by typing:
$ ferry snapshot sa-0
sn-sa-0-81a67d8e-b75b-4919-9a65-50554d183b83
$ ferry snapshots
UUID Base Date
-------------------------------------------- ------ --------------------
sn-sa-4-81a67d8e-b75b-4919-9a65-50554d183b83 hadoop 02/5/2014 (02:02 PM)
$ ferry start sn-sa-0-81a67d8e-b75b-4919-9a65-50554d183b83
sa-1
This will produce a snapshot that you can restart later. You can create as many snapshots as you want.
More resources¶
Most of these examples can also be found in the hadoop-client connector. Just navigate to /service/runscripts/test and you’ll find a couple scripts that basically do what we just documented.
Hadoop is fairly complicated with many moving pieces and libraries. Hopefully ferry will make it easier for you to get started. Once you’re comfortable with these examples, here are some additional resources to learn more.
Getting started with MongoDB¶
MongoDB is a popular document store that makes it super simple store, retrieve, and analyze JSON-based datasets. It is often used as a transactional store for web applications. Although it’s not strictly a “big data” tool, it is an important element in many applications.
To get started, let’s define our stack in a file (let’s call it mongo.yaml). The file should look something like this:
backend:
- storage:
personality: "mongodb"
instances: 1
connectors:
- personality: "ferry/mongodb-client"
Although MongoDB does support sharding for scalability, Ferry currently does not support this (as of version 0.2.2). Consequently, you probably want just a single instance of MongoDB.
Running an example¶
Once you have the application defined, go ahead and start it. Once it’s started login to your client to see what environment variables have been populated by MongoDB.
$ env | grep BACKEND
BACKEND_STORAGE_TYPE=mongodb
BACKEND_STORAGE_MONGO=10.1.0.1
BACKEND_STORAGE_MONGO_PASS=eec68b55-9819-4461-bb55-f1494cfe364e
BACKEND_STORAGE_MONGO_USER=9b845bdf-657c-4402-ac45-c08a3c91e3d8
Note that MongoDB by default generates a random username and password for authentication. You’ll need to pass those values into your client before accessing MongoDB.
Getting started with Open MPI¶
MPI is a popular parallel programming tool that abstracts various communication patterns and makes it relatively simple to coordinate code running across many machines. Unlike platforms such as Hadoop, MPI relies on a separate shared filesystem. In our case, we’ll use GlusterFS, a distributed filesystem from Redhat.
The first thing to do is define our stack in a file (let’s call it openmpi.yaml). The file should look something like this:
backend:
- storage:
personality: "gluster"
instances: 2
compute:
- personality: "mpi"
instances: 2
connectors:
- personality: "mpi-client"
name: "control-0"
There are two main sections: the backend and connectors. In this example, we’re defining a single storage backend and a single compute backend. This backend is going to run two instances of gluster and mpi.
We’ll also instantiate an MPI connector. The client will automatically mount the Gluster volume and contain all the necessary configuration to launch new MPI jobs. By default the Gluster volume is mounted under /service/data. Of course you can remount the directory to wherever you like. Once you’ve started your application and logged into your client, type mount to see the mount configuration.
Note that we’ve assigned a name to our client (control-0). This is an optional user-defined value. It helps if you have multiple clients and you want a simple way to ssh into a specific client. That capability is illustrated in the next section.
Running an example¶
Now that we’ve defined our stack, let’s start it up. Just type the following in your terminal:
$ ferry start openmpi
sa-0
$ ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- --------- ---------- ------- ------- ----
sa-0 se-0 [u'se-1'] se-2 running openmpi --
The entire process should take about 20 seconds. Before we continue, let’s take a step back to examine what just happened. After typing start, ferry created the following Docker containers:
- Two Gluster data nodes (sometimes called a “brick”)
- Two Open MPI compute nodes
- A Linux client
Now that the environment is created, let’s interact with it by connecting to the Linux client. Just type docker ssh sa-0 in your terminal. By default, the ssh command will log you into the first client. If you have multiple clients and you’ve assigned them names, you can specify the client by typing docker ssh sa-0 control-0 (where control-0 is the name you’ve defined for that client).
Once you’re logged in, let’s check what environment variables have been created. Remember this is all being run from the connector.
$ env | grep BACKEND
BACKEND_STORAGE_TYPE=gluster
BACKEND_STORAGE_IP=10.1.0.3
BACKEND_COMPUTE_TYPE=openmpi
BACKEND_COMPUTE_IP=10.1.0.5
Notice there are two sets of environment variables, once for the storage and the other for the compute.
Now if you’re really impatient to get an Open MPI application working, just type the following into the terminal:
$ /service/runscripts/test/test01.sh
It will take a few seconds to complete, but you should see some output that comes from executing the application. If you want to know what you just did, take a peek at the /service/runscripts/test/test01.sh file.
Ok, now let’s actually compile some code and run it. Here’s a super simple hello world example:
#include <mpi.h>
int main(int argc, char **argv)
{
int numprocs, rank, namelen;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if(rank == 0) {
std::cout << "master (" << rank << "/" << numprocs << ")\n";
}
else {
std::cout << "slave (" << rank << "/" << numprocs << ")\n";
}
MPI_Finalize();
}
All it does is initialize MPI, determine who the masters & slaves are, and prints out some information to the console. We can compile and run this example by typing the following in a terminal:
$ su ferry
$ mpic++ -W -Wall /service/examples/helloworld.cpp -o /service/data/binaries/helloworld.o
$ mpirun -np 4 --hostfile /usr/local/etc/instances /service/data/binaries/helloworld.o
Note that the we must pass in the instances file to mpirun. This file contains the set of Open MPI hosts that can execute the code.
Although this example does not read or write to shared storage, everything under /service/data is shared across all the Open MPI nodes and the Linux client.
A YARN example¶
In addition to Open MPI, you can also create a YARN compute cluster that uses GlusterFS for storage. YARN is the next-generation Hadoop compute layer that enables more flexibility compared to the old MapReduce API. The configuration file will look something like this:
{
"backend":[
{
"storage":
{
"personality":"gluster",
"instances":2
},
"compute":[
{
"personality":"yarn",
"instances":2
},]
}],
"connectors":[
{"personality":"hadoop-client"}
]
}
Note that under compute, we’ve replaced the mpi section with a yarn section. After starting this stack, you should be able to run normal Hadoop and Hive applications. You can find some examples under /service/runscripts/test.
Events and customization¶
Each connector is a complete Linux (Ubuntu) environment that can be completely configured. In fact, the connector is just a normal Docker container with a few extra scripts and packages pre-configured. That means you can install additional packages or include new code. Afterwards, it’s easy to save the entire state.
Connectors are customized using scripts that reside under /service/runscripts. You should see a set of directories, one for each type of event that Ferry produces. For example, the start directory contains scripts that are executed when the connector is first started. Likewise, there are events for:
- start: triggered when the connector is first started
- restart: triggered when the connector is restarted
- stop: triggered when the connector is stopped
- test: triggered when the connector is asked to perform a test
If you look in the test directory, you’ll find some example programs that you can execute. You can add your own scripts to these directories, and they’ll be executed in alphanumeric order.
Saving everything¶
Once you’ve installed all your packages and customized the runscripts, you’ll probably want to save your progress. You can do this by typing:
$ ferry snapshot sa-0
sn-sa-0-81a67d8e-b75b-4919-9a65-50554d183b83
$ ferry snapshots
UUID Base Date
-------------------------------------------- ------- --------------------
sn-sa-4-81a67d8e-b75b-4919-9a65-50554d183b83 openmpi 02/5/2014 (02:02 PM)
$ ferry start sn-sa-0-81a67d8e-b75b-4919-9a65-50554d183b83
sa-1
This will produce a snapshot that you can restart later. You can create as many snapshots as you want.
More resources¶
MPI is relatively complex compared to other more recent frameworks such as Hadoop, but is very useful for applications that require complex coordination. Here are some additional resources you can use to learn more.
Getting started with Spark¶
Apache Spark is a new parallel, in-memory processing framework from U.C. Berkeley’s AMPLab. Spark has two main advantages relative to other similar frameworks. First, because of its in-memory design, Spark is able to run certain computations much faster, for example machine learning algorithms. Second, Spark has a very clean API. Spark is written in Scala, a functional language for the JVM but also has strong support for Python. Having said this, Spark is still compatible with Hadoop and can easily read/write data from HDFS.
The first thing to do is to define our application stack. Let’s call our new application stack spark.yaml. The file should look something like this:
backend:
- storage:
personality: "hadoop"
instances: 2
compute:
- personality: "spark"
instances: 2
connectors:
- personality: "ferry/spark-client"
Note that Spark relies on Hadoop for the actual data storage. Ferry runs Spark in “stand-alone” mode (deeper YARN integration will come in the future). So to create the actual Spark cluster, we’ll need to specify the compute section. Finally, we’ll need at least one spark-client so that we can launch jobs and interact with the rest of the cluster.
Running an example¶
Now that we’ve defined our stack, let’s start it up. Don’t forget that you need the Ferry server to be up and running (via sudo ferry server). Afterwards type ferry start spark into your terminal. spark should be replaced with the path to your specific file. Otherwise it will use a default Spark stack. The entire process should take less than a minute.
Before we continue, let’s take a step back to examine what just happened. After typing start, ferry created the following Docker containers:
- Two Hadoop data nodes
- Hadoop namenode
- Hadoop YARN resource manager
- Two Spark compute nodes
- A Linux client
Now that the environment is created, let’s interact with it by connecting to the Linux client. Just type ferry ssh sa-0 (where sa-0 is replaced with your application ID). Once you’re logged in, you should be able to run all the examples. Remember, the connector is just a Docker container. That means you can completely customize the environment including installing packages and even modify configuration files.
Python Examples¶
Now we should be able to run Spark jobs. If you’re really impatient, you can run some Python examples by typing:
$ /service/runscripts/test/test01.sh load
$ /service/runscripts/test/test01.sh python regression.py
This downloads some data and runs a linear regression example over that data. You can check out more examples in the directory /service/examples/python. Here’s a Python example showing how to perform collaborative filtering (a popular method for recommendations).
import sys
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS
from numpy import array
if __name__ == "__main__":
data_file = '/spark/data/als.data'
sc = SparkContext(sys.argv[1], "Collaborative Filtering")
data = sc.textFile(data_file)
ratings = data.map(lambda line: array([float(x) for x in line.split(',')]))
# Build the recommendation model using Alternating Least Squares
model = ALS.train(ratings, 1, 20)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (int(p[0]), int(p[1])))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).reduce(lambda x, y: x + y)/ratesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
As you can see the source is fairly short for what it does. Spark includes the MLLib machine-learning library which simplifies creating advanced data mining algorithms. If you specified more than a single node for your Spark cluster, this example will run (virtually) in parallel.
If you want to run your own Python application, just type the following (as the ferry user):
$ $SPARK_HOME/bin/pyspark my_spark_app.py spark://$BACKEND_COMPUTE_MASTER:7077
More resources¶
Once you’re done running the built-in examples, check out these additional resources to learn more.
Multiple Storage and Compute¶
Although many application consist of a single storage layer, you can actually combine multiple storage layers. This lets you construct complex applications that uses the storage in different ways. For example, here we’re using both Hadoop (HDFS) and MongoDB, a popular document store.
backend:
- storage:
personality: "hadoop"
instances: ">=1"
layers:
- "hive"
- storage:
personality: "mongodb"
instances: 1
connectors:
- personality: "ferry/hadoop-client"
Hadoop is great at storing the archival data for analytic purposes, while MongoDB may be used for transactional data. Once you’re logged into the client, you’ll find environment variables that refer to both sets of storage.
One word of warning: the bult-in client for Hadoop doesn’t have MongoDB drivers installed, so you’ll want to install that before anything else.
You can also combine multiple compute layers. For example, we might create a data science application that uses both Hadoop/YARN and OpenMPI.
backend:
- storage:
personality: "gluster"
instances: 1
compute:
- personality: "yarn"
instances: 1
layers:
- "hive"
- personality: "openmpi"
instances: 1
connectors:
- personality: "ferry-user/data-science"
In this example, we’re supplying our own custom client that has both YARN and Open MPI drivers installed.
Reference¶
Command-Line¶
Run ferry --help to get a list of available commands.
deploy¶
Deploy a service in an operational setting
$ ferry deploy sn-fea3x --mode=local --conf=opconf
Deploying an application pushes your connector images to the cloud and enables other users to interact with your application. Deployments support multiple modes and configuration options. Use ferry deploy --help to view these options.
This feature is experimental
help¶
Print out help information. Help for specific commands can be invoked by typing ferry CMD --help.
inspect¶
Print out detailed information about an application (either running or installed)
For example:
$ ferry inspect sa-0
This command prints out detailed information regarding the service, including the list of all the docker containers that make up the service. Note that sa-0 is the unique ID of a running service.
install¶
Build or rebuild all the necessary Docker images
For example:
$ sudo ferry install -k $MY_KEY_DIRECTORY
Images can be built with a custom public/private key pair by specifying the -k parameter.
logs¶
Copy over the logs to the host directory
For example:
Usage: ferry logs sa-0 $LOGDIR
Note that sa-0 is the unique ID of a running service, and LOGDIR is a directory on the host where the logs should be copied.
ls¶
List installed applications
$ ferry ls
App Author Version Description
--- ------ ------- -----------
cassandra James Horey 0.2.0 Cassandra stack...
hadoop James Horey 0.2.0 Hadoop stack...
openmpi James Horey 0.2.0 Open MPI over Gluster...
spark James Horey 0.2.0 Spark over Hadoop...
yarn James Horey 0.2.0 Hadoop YARN over Glus...
ps¶
List the available applications
By default this command will only print out running applications. You can print out stopped and terminated applications by typing: ferry ps -a.
pull¶
Download either individual Docker images or complete Ferry applications
For example:
$ ferry pull app://<user>/<app>
$ ferry pull image://<user>/<image>
If you download an application, all the necessary images will download automatically.
push¶
Upload either individual Docker images or complete Ferry applications
For example:
$ ferry push app:///home/ferry/myapp.yml
$ ferry push image://<user>/<image>
If you upload an application, all the necessary images will be uploaded automatically to the Docker registry specified in the Ferry authorization file.
rm¶
Remove a stopped service
For example:
$ ferry rm sa-0
Note that sa-0 refers to a stopped application. Remove all data associated with the stack, including connector information. It is highly recommended to snapshot the state before removing an application. After removing the application, it may appear in the ps list for a short time.
server¶
Start the ferry daemon
$ sudo ferry server
The ferry server controls all interaction with the actual service and must be running to do anything.
ssh¶
SSH into a running connector
For example:
$ ferry ssh sa-0 client-0
Note that sa-0 refers to the unique service ID and client-0 refers to the user-defined connector name. If the connector name is not supplied, ferry will attempt to connect to the first available connector.
start¶
Start or restart an application
For example:
$ ferry start openmpi
$ ferry start sa-0
$ ferry start sn-aee3f...
The application may be new, a stopped application, or a snapshot.
stop¶
Stop, but do not delete, a running application
For example:
$ ferry stop sa-0
$ ferry ps
UUID Storage Compute Connectors Status Base Time
---- ------- --------- ---------- ------- ------- ----
sa-0 se-0 [u'se-1'] se-2 stopped hadoop --
Note that sa-0 is the unique ID of the running service. After the service is stopped, the service can be restarted. All state in the connectors are preserved across start/restart events.
snapshot¶
Take a snapshot of an application
For example:
$ ferry snapshot sa-0
Note that sa-0 refers to either a running or stopped service. A snapshot saves all the connector state associated with a running service. The user can create multiple snapshots.
snapshots¶
List all the available snapshots
For example:
$ ferry snapshots
UUID Base Date
-------------------------------------------- ------ --------------------
sn-sa-4-81a67d8e-b75b-4919-9a65-50554d183b83 hadoop 02/5/2014 (02:02 PM)
quit¶
Stop the Ferry servers
This will gracefully shutdown the servers controlling Ferry. This command must be executed via sudo.