Hokam Singh Chauhan

Sunday, 16 October 2016

Big Data Computational Models

Most of the distributed processing engine comes along with either one or more of three different types of computational models, namely batch, micro-batch, and continuous flow operator.

Batch model: Batch model processes the data at rest, taking a large amount of data at once and then processing it and then writing out the output to some file system or data stores.

Micro-batch model: Micro-batch combines the aspects of both batch and the continuous flow operator. In this model data gets gathered at perticular time interval and gets executed. Micro-batches are an essentially “collect and then process” kind of computational model.

Continuous flow operator model: This model processes the data when it arrives, without any delay in collecting or processing the data.

To have better understanding about Micro-batch and Continuous flow model, lets assume you wants to supply water into a water tank situated on top of a building. So there are two ways to do this, first you can collect it in lower basement water tank and then supplying it into the top most water tank from there but in second way you can directly supply the water to top most water tank using the pipe lines. That is essentially the basic difference between micro-batch and continuous flow operator model.

Apache Spark vs Apache Flink

In last few years the real time processing is one of hot topic in the market and also it has gained lots of value as well. The emerging tools tools which provides support for real time processing are Apache Storm, Apache Spark and Apache Flink. In this blog I am going to compare different features of Spark and Flink.

Feature by feature Comparison of Apache Spark and Flink:

Exactly once semantic:

Both Spark Streaming and Flink provide exactly once guarantee, which means that every record will be processed exactly once, thereby eliminating processing of data multiple times.

High Throughput and Fault Tolerance overhead:

Both Spark Streaming and Flink provides very high throughput compared to other processing systems like Storm. The overhead of fault tolerance is also low in both the processing engines.

Computational Model:

Spark Streaming and Flink differs is in its computation model. Spark has adopted micro-batching model whereas Flink has adopted a continuous flow, operator-based streaming model.

Data Windowing/Batching:

Spark has a time-based Window criteria, whereas Flink has Windows over time, record counts, sessions, data-driven windows or any custom user-defined Window criteria.

Stream Splitting:

Flink has direct API for splitting the input Data Stream into multiple streams, whereas in Spark it is not possible.

Complex Event Processing:

Flink comes along with the complex event processing API which means Flink has support for Event Time and Out-of-Order Events while Spark does not have it.

Memory Management:

Spark provides configurable memory management while Flink provides automatic memory management. Spark has also moved towards automating memory management (Unified memory management) in Spark 1.6.0 version.

Saturday, 3 January 2015

Setup SolrCloud

This post explain how to install the solr cluster. The below installation steps, install and run the solr cluster. After installation of cluster, you can create the indexes, aliases and fields by providing the solrconfig.xml and schema.xml file.

Perform the below steps on all nodes where you wants to install the Solr.

1.Download the Solr bundle from http://archive.apache.org/dist/lucene/solr/4.7.2/solr-4.7.2.zip using the below command:

wget http://archive.apache.org/dist/lucene/solr/4.7.2/solr-4.7.2.zip

2. Unzip the downloaded bundle:

unzip solr-4.7.2.zip

3. Create a solr home directory to store the solr data:

mkdir-p solr/home/

4. Move to the solr home directory and create an file named “solr.xml”:

cd solr/home/

Create the solr.xml file and add the following content in it.

<?xml version="1.0" encoding="UTF-8" ?>

<int name="hostPort">${jetty.port:}</int>

<str name="hostContext">${hostContext:solr}</str>

<int name="zkClientTimeout">${zkClientTimeout:30000}</int>

<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>

</solrcloud>

<int name="socketTimeout">${socketTimeout:0}</int>

<int name="connTimeout">${connTimeout:0}</int>

</shardHandlerFactory>

</solr>

5. Now move to the example directory available inside the extracted solr folder:

cd solr-4.7.2/example/

6. Before starting the solr, you should have running zookeeper cluster. Solr need the IP address and port of zookeeper to start.

nohup java -Xms128M -Xmx1024M -Dsolr.solr.home=solr/home/ -Djetty.port=8983 -DzkHost= zkHost1:2181,zkHost2:2181,zkHost3:2181 -DzkClientTimeout=10000 -DsocketTimeout=1000 -DconnTimeout=1000 -DhostContext=/solr -jar start.jar &

Saturday, 20 December 2014

LZO Compression in Hadoop and HBase

LZO's licence (GPL) is incompatible with Hadoop (Apache) and therefore one should install the LZO separately in cluster to enable LZO compression in Hadoop and HBase. LZO compression format is split-table compression. It provides the high compression and decompression speed.

Perform the below steps to enable the LZO compression in Hadoop and HBase:

1. Install the LZO development packages:

sudo yum install lzo lzo-devel

2. Download the Latest LZO release using below command:

wget https://github.com/twitter/hadoop-lzo/archive/release-0.4.17.zip

3. Unzip the downloaded bundle:

unzip release-0.4.17.zip

4. Change the current directory to the extracted folder:

cd hadoop-lzo-release-0.4.17

5. Run the command to generate the native libraries

ant compile-native

6. Copy the generated jar and native libraries to Hadoop and HBase lib directories.

cp build/hadoop-lzo-0.4.17.jar $HADOOP_HOME/lib/

cp build/hadoop-lzo-0.4.17.jar $HBASE_HOME/lib/

cp build/hadoop-lzo-0.4.17/lib/native/Linux-amd64-64/* $HADOOP_HOME/lib/native/

cp build/hadoop-lzo-0.4.17/lib/native/Linux-amd64-64/* $HBASE_HOME/lib/native/

7. Add the following properties in core-site.xml file of hadoop.

<name>io.compression.codecs</name>

<value>

org.apache.hadoop.io.compress.DefaultCodec,

org.apache.hadoop.io.compress.GzipCodec,

org.apache.hadoop.io.compress.BZip2Codec,

org.apache.hadoop.io.compress.DeflateCodec,

org.apache.hadoop.io.compress.SnappyCodec,

org.apache.hadoop.io.compress.Lz4Codec,

com.hadoop.compression.lzo.LzoCodec,

com.hadoop.compression.lzo.LzopCodec

</value>

</property>

<name>io.compression.codec.lzo.class</name>

<value>com.hadoop.compression.lzo.LzoCodec</value>

</property>

8. Sync the hadoop and HBase Home directory on all nodes of hadoop and hbase cluster.

rsync $HADOOP_HOME/ node1:$HADOOP_HOME/ node2:$HADOOP_HOME/
rsync $HBASE_HOME/ node1:$HBASE_HOME/ node2:$HBASE_HOME/

9. Add the HADOOP_OPTS variable in .bashrc file on all hadoop nodes:

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native:$HADOOP_HOME/lib/"

10. Add the HBASE_OPTS variable in .bashrc file on all HBase nodes:

export HBASE_OPTS="-Djava.library.path=$HBASE_HOME/lib/native/:$HBASE_HOME/lib/"

11. Verify the LZO compression in Hadoop:

a. Create a LZO compressed file using lzop utility. Below command will create a compressed file for the LICENSE.txt file which is available inside the HADOOP_HOME directory.

lzop LICENSE.txt

b. Copy the Generated LICENSE.txt.lzo file to / (root) HDFS path using below command.

bin/hadoop fs -copyFromLocal LICENSE.txt.lzo /

c. Index the LICENSE.txt.lzo file in HDFS using below command.

bin/hadoop jar lib/hadoop-lzo-0.4.17.jar com.hadoop.compression.lzo.LzoIndexer /LICENSE.txt.lzo

Once you execute the above command you will see the below output on console. You can also verify the index file creation on HADOOP UI in HDFS Browser.

14/12/20 14:04:05 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library

14/12/20 14:04:05 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo revc461d77a0feec38a2bba31c4380ad60084c09205]

Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /data/repo/hadoop-2.4.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

14/12/20 14:04:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

14/12/20 14:04:08 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /LICENSE.txt.lzo, size 0.00 GB...

14/12/20 14:04:08 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available

14/12/20 14:04:09 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.61 seconds (0.01 MB/s). Index size is 0.01 KB.

12. Verify the LZO Compression in HBase:

You can verify the LZO Compression in HBase by creating a table using the LZO compression from HBase shell.

a. Create a table with LZO Compression using below command:

create ‘t1’, { NAME=>’f1’, COMPRESSION=>’lzo’ }

b. Verify the Compression type in table using below describe command on table:

describe ‘t1’

Once you execute the above command you will see the below console output. The LZO Compression for the table can also be verified on HBase UI.

DESCRIPTION ENABLED

't1', { NAME => 'f1' , DATA_BLOCK_ENCODING => 'NONE' , BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSION true S => '1', COMPRESSION => 'LZO', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'false', BLOCK SIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}

1 row(s) in 0.8250 seconds