HBase Schema

“billions of rows * millions of columns * thousands of versions = terabytes or petabytes of storage” (The HBase project)

Apache HBase is an open source implementation of Google’s BigTable. It is built atop Apache Hadoop and is tightly integrated with it. It is a good choice for applications requiring fast random access to very large amounts of data.

HBase stores data in a form of a distributed sorted multidimensional persistence maps called Tables. The table terminology makes it easier for people coming from the relational data management world to abstract data organization in HBase. HBase is designed to manage tables with billions of rows and millions of columns.

HBase data model consists of tables containing rows. Data is organized into column families grouping columns in each row. This is where similarities between HBase and relational databases end. Now we will explain what is under the HBase table/rows/column families/columns…

View original post 2,470 more words

Installing Scala in RHEL / Cent OS

Tags

, , , , , ,

To install Scala, it requires the Java run time version 1.8 or later. Once we have Java installed and configured, we can download the Scala distribution in RHEL or Cent OS using this command

wget http://www.scala-lang.org/files/archive/scala-2.12.1.tgz

Once the download is done, we will extract the distribution at the given location /usr/lib

sudo tar -xf scala-2.12.1.tgz -C /usr/lib

Lets create symbolic link to the scala directory

sudo ln -s /usr/lib/scala-2.12.1 /usr/lib/scala

Now we will add the scala bin directory to PATH

export PATH=$PATH:/usr/lib/scala/bin

Thats all we have to do. Now we can check our scala installation using the command

scala -version

It should print the following in terminal

Scala code runner version 2.12.1 — Copyright 2002-2016, LAMP/EPFL and Lightbend, Inc.

Save data to Cassandra tables using Apache Spark

Tags

, , , , , , , , , , , , , ,

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms. Continue reading

How to use existing HBase table in Apache Phoenix

Tags

, , , , , , ,

For latest updates on this post check
my new blog site

Apache Phoenix is an open source, relational database layer on top of noSQL store such as Apache HBase. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; upsert and delete rows singly and in bulk; and query data through SQL.

Continue reading

Installing Apache Solr

Tags

, , ,

Apache Solr:

Apache Solr is an opensource search platform powered by Apache Lucene written in Java. Solr is standalone search server with REST-like API. We index documents in it via JSON, XML, CSV or binary over HTTP. We query it via HTTP GET and receive JSON, XML, CSV or binay results.

Continue reading

HBase shell commands

Tags

, , , , , , , , ,

HBase is free, open-source software from the Apache Foundation. It is a cross platform technology, so we can run it on Linux, Windows or OS/X machines and also can be hosted on Amazon Web Services and Microsoft Azure.

HBase is a NoSQL database which can run on a single machine, or cluster of servers. HBase provides data access in real-time. HBase tables can store billions of rows and millions of columns, unlike other big data technologies, which are batch-oriented. In HBase we have few key concepts like row key structure, column families, and regions.
Continue reading

Read data from Cassandra tables using Apache Spark

Tags

, , , , ,

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms. Continue reading

Read records from HBase table using Java

Tags

, , , , ,

For latest updates on this post check
my new blog site

hbase-client.jar will be used to get connected to HBase using Java and this is available in maven repository. The following dependency can be added in our pom.xml

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client<artifactId>
    <version>1.1.0.1</version>
<dependency>

Once we have added the dependency we need to create Configuration object specifying core-site.xml and hbase-site.xml as resources. Continue reading

Configuring Apache Phoenix in CDH 5.x using Cloudera Manager

Tags

, , , , ,

Apache Phoenix is an open source, relational database layer on top of noSQL store such as Apache HBase. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; upsert and delete rows singly and in bulk; and query data through SQL.

Continue reading

Create MySQL Events / Schedulers

Tags

, , ,

MySQL Event is, performing or executing some operation based on the specified or scheduled time. MySQL Events have been added from version 5.1.6 MySQL event scheduler is a process that runs in background and looks for events to execute. Before we create or schedule an event in MySQL, we need to first verify whether its enabled or not Issue the following command to turn on the scheduler Continue reading