Giraph - Giraph Input/Output with G

This project has retired. For details please refer to its
Attic page
Giraph -
Giraph Input/Output with Gora
Project Information
Team
Mailing Lists
Source Repository
Issue Tracking
Continuous Integration
Summary
License
Dependencies
Reports
Surefire Report
Checkstyle Results
JavaDoc
Test JavaDoc
User Docs
Introduction
Related Literature
Download Releases
Quick Start
Building and Testing
Options List
Blocks Framework
FAQ
Presentations
External Community Wiki
Developer Docs
JavaDoc
Test JavaDoc
Tag List
Cobertura Test Coverage
Jdepend
Source Xref
Test Source Xref
Modules
Aggregators
Out-of-core
Implementation
Page Rank Example
Input/Output in Giraph
Hive I/O
Gora I/O
Rexster I/O
How to generate patches?
How to release giraph?
How to build this site?
Apache
Giraph
Giraph Input/Output with Gora
Last Published: 2020-08-11
Version: 1.4.0-SNAPSHOT
Overview
The
Apache
Gora
project is an open source framework which provides an in-memory
data model and persistence for big data. Gora supports persisting to column
stores, key value stores, document stores and RDBMSs, and
analyzing the data with extensive Apache Hadoop MapReduce support.
The integration of these two awesome Apache projects has as main motivation
the possibility of turning Gora-supported-NoSQL data stores into
Giraph-processable graphs, and to provide Giraph the ability to store its
results into different data stores, letting users focus on the processing itself.
The way Gora works is by defining the data model how our data is going to be
stored using a JSON-like schema inspired in
Apache Avro
and
doing the physical mapping to the data store using an XML file.
The former one will help us generate data beans which will be read or written
into different data stores, and the latter one, helps us defining which data
bean should go where.

In this way, Giraph will be able to read/write data using three files:
The generated data beans representing our data model.
The XML mapping file representing our physical mapping.
A file called
gora.properties
containing
configurations related to which data store Gora will use.
The image below shows how this integration works in a plain simple image:
Generating DataBeans
So the first thing we have to is to define our data model using a JSON-like schema. Here it is
a schema resembling graphs stored inside Apache HBase through Gora. The following shows a schema
for a vertex:
{"type": "record",
"name": "Vertex",
"namespace": "org.apache.giraph.gora.generated",
"fields" : [
{"name": "vertexId", "type": "long"},
{"name": "value", "type": "float"},
{"name": "edges",
"type": {
"type":"array", "items": {
"name": "Edge",
"type": "record",
"namespace": "org.apache.giraph.gora.generated",
"fields": [
{"name": "vertexId", "type": "long"},
{"name": "edgeValue", "type": "float"}
And this other schema shows what a schema for an edge should look like.
"type": "record",
"name": "GEdge",
"namespace": "org.apache.giraph.gora.generated",
"fields" : [
{"name": "edgeId", "type": "string"},
{"name": "edgeWeight", "type": "float"},
{"name": "vertexInId", "type": "string"},
{"name": "vertexOutId", "type": "string"},
{"name": "label", "type": "string"}
Now we are ready to generate our data beans. To do this, we need to use gora-core.jar which
comes with Giraph. The gora-compiler works using three parameters:
- REQUIRED -individual avsc file to be compiled or a directory path containing avsc files
- REQUIRED -output directory for generated Java files
<-license id> - the preferred license header to add to the
So by executing the gora compiler through this command, the generated data beans
will be created in the path set.
java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
java -jar gora-core-0.4-SNAPSHOT.jar org.apache.gora.compiler.GoraCompiler.class edge.avsc gora-app/src/main/java/
This will result into a java class which will look something similar to this:
/**
* Class for defining a Giraph-Vertex.
*/
@SuppressWarnings("all")
public class GVertex extends PersistentBase {
/**
* Schema used for the class.
*/
public static final Schema OBJ_SCHEMA = Schema.parse(
"{\"type\":\"record\",\"name\":\"Vertex\"," +
"\"namespace\":\"org.apache.giraph.gora.generated\"," +
"\"fields\":[{\"name\":\"vertexId\",\"type\":\"string\"}," +
"{\"name\":\"value\",\"type\":\"float\"},{\"name\":\"edges\"," +
"\"type\":{\"type\":\"map\",\"values\":\"string\"}}]}");

/**
* Vertex Id
*/
private Utf8 vertexId;

/**
* Gets vertexId
* @return Utf8 vertexId
*/
public Utf8 getVertexId() {
return (Utf8) get(0);

/**
* Sets vertexId
* @param value vertexId
*/
public void setVertexId(Utf8 value) {
put(0, value);
. . .
Once this logical data modeling is done, the physical mapping between this generated
classes and the actual data repositories have to be made. Gora does this by using a
xml "mapping file".
The file below represents a
gora-hbase-mapping.xml
i.e. the necessary
information to map our data model into HBase tables. Within the tags
table
the necessary column families will be defined. Moreover, within the tags
class
, the actual generated java bean will be mapped into the column
families. Inside this, each field should be mapped into their respective column
family, and the HBase qualifier to be used for storing this field.
This mapping file can contain as many mappings as generated data beans our application
uses i.e. we can redefine more
table
tags with their own
class
and
fields

A more complex file can be found inside
giraph-gora/conf
folder.
Preparation
Once the data beans have been generated, the
gora.properties
file
has be created. This file specifies which data store is going to be used with
Gora, but also contains extra information about such data store. An example of
such file can be found inside
giraph-gora/conf
folder. Following
our example, if it has been decided to use Apache HBase so
gora.properties
should contain such configuration, as shown below:
# FOR HBASE DATASTORE
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
Then to be able to use the Gora API the user needs to prepare the Gora environment.
This is not more than having set up one of the data stores Gora support, having
the data beans generated and the
gora.properties
file set up. A more
detail yet simple tutorial can be found
here
The data definition files should be available in the classpath when the
Giraph job is run. But also all configuration files needed for each specific data
store should also be made available across the cluster. For example, if we were
to use HBase along Giraph and Gora, then the hbase-site.xml file should be passed
along as well. There are several ways to make these files available, and one common
way to do this is with the
-file
option. This option would look like
something similar to this:
-files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml
Gora also needs to be told which serialization types it will use. This serialization
types could be made across the cluster, but if that is not desired, then they can be
passed using the
-D
option of Hadoop. This option would look like
something similar to this:
-Dio.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization
Configuration Options
Now that the data beans have been generated, and Gora environment ready,
the configuration options for this API have to be known in order to be specified
by the user. These configurations are as follow:
label
type
description
giraph.gora.datastore.class
string
Gora DataStore class to access to data from - required.
giraph.gora.key.class
String
Gora Key class to query the datastore - required.
giraph.gora.persistent.class
String
Gora Persistent class to read objects from Gora - required.
giraph.gora.start.key
String
Gora start key to query the datastore.
giraph.gora.end.key
String
Gora end key to query the datastore.
giraph.gora.keys.factory.class
String
Keys factory to convert strings into desired keys - required.
giraph.gora.output.datastore.class
String
Gora DataStore class to write data to - required.
giraph.gora.output.key.class
String
Gora Key class to write to datastore - required.
giraph.gora.output.persistent.class
String
Gora Persistent class to write to Gora - required.
Input/Output Example
To make use of the Giraph input API available for Gora, it is required to extend the
classes
GoraVertexInputFormat
or
GoraEdgeInputFormat
In the first class, the only method that has to be implemented is
transformVertex
to transform a
Gora Object
into a
Giraph's
Vertex
object. Likewise, for the second class the methods
that have to be implemented are
transformEdge
, to convert a
Gora Edge Object
into a the Giraph's
Edge
object, and
getCurrentSourceId
. There are two Examples of such implementations
which are
GoraGVertexVertexInputFormat
and
GoraGEdgeEdgeInputFormat
. One other class that has to be implemented
here is the
KeyFactory
because this class is used to transform the keys
passed as strings throught the options into actual Gora key Objects used to query
the data store. The default one assumes your key type is a
String
On the other hand, to make use of the Giraph output API available for Gora,
it is required to extend the classes
GoraVertexOutputFormat
or
GoraEdgeOutputFormat
In the first class, the only method that has to be implemented is
getGoraVertex
to transform a Giraph's Vertex object into a
Gora object, and
getGoraKey
to determine the key which will represent
such vertex. Likewise, for the Edge output class the methods
that have to be implemented are
getGoraEdge
, to convert a Giraph's
Edge object into a Gora Edge object, and
getGoraKey
to determine the
key which will represent such edge. There are two Examples of such implementations
which are
GoraGVertexVertexOutputFormat
and
GoraGEdgeEdgeOutputFormat
An example command showing how to put together all these classes and configurations
is shown below. This command is to compute the shortest path algorithm onto the
graph database shown previously is provided below.
export GIRAPH_CORE_JAR=$GIRAPH_CORE_TARGET_DIR/giraph-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar
export GIRAPH_EXAMPLES_JAR=$GIRAPH_EXAMPLES_TARGET_DIR/giraph-examples-$GIRAPH_VERSION-for-$HADOOP_VERSION-jar-with-dependencies.jar
export GIRAPH_GORA_JAR=$GIRAPH_GORA_TARGET_DIR/giraph-gora-$GIRAPH_VERSION-SNAPSHOT-jar-with-dependencies.jar
export GORA_HBASE_JAR=$GORA_HBASE_TARGET_DIR/gora-cassandra-$GORA_VERSION.jar
export HBASE_JAR=$GORA_DIR/gora-hbase/lib/hbase-0.90.4.jar
export HADOOP_CLASSPATH=$GIRAPH_CORE_JAR:$GIRAPH_EXAMPLES:$GIRAPH_GORA_JAR:$GORA_HBASE_JAR
hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner
-files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml
-Dio.serializations=org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization
-Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore
-Dgiraph.gora.key.class=java.lang.String
-Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge
-Dgiraph.gora.start.key=0
-Dgiraph.gora.end.key=10
-Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory
-Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore
-Dgiraph.gora.output.key.class=java.lang.String
-Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult
-libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR
org.apache.giraph.examples.SimpleShortestPathsComputation
-eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat
-eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat
-w 1

Giraph - Giraph Input/Output with Gora