Sunday, August 5, 2012

Big (unstructure) Data, Grid computing, Hadoop, HDFS & BigTable


Big Data => It is really very big data, imaging u have perabyte data to be processed

You have app server and db. If ur app server need to process perabyte data, it needs to retrieve and copy the data from db to app server first, then process. This is traditional computing, move the data close to processing logic.

if it is small data, it is ok and right way to do that. But if the data volume is big, moving data from db to app server could take days. Just imaging how long it take to copy perabyte data files from one server to another server.

In this case, grid computing comes in and move the logic close to data rather than moving data to processing logic.

If we move logic to data, first of all, you save the time of moving data around. Then, people may argue, then u need to move result back, sure, but most of time, we need to process more data to produce one result, otherwise there is no point to process it, but copy it only.

Doing work by one man is less productive than doing work by more man. The best stratigy is allocate some time to teach others to do the same work and then get others to teach others recursively and do the work. The effect will be 2*2*2*..... faster than doing it alone.

So, we divide & distribute the data to different machines, teach & monitor (Hadoop and map & reduce model - It enables applications to work with thousands of computational independent computers and petabytes of data) them process currently, this is where distributed file system comes into picture (HDFS).

Instead of distributed file system, it could be nicer if we can put them into dictributed db as it could be easier to manage them.

So, What DB we can use? Big Data means unstructured data and it could include text strings, documents of all types, audio and video files, metadata, web pages, email messages, social media feeds, form data, and so on.  Big Data may well have different data types within the same set that do not contain the same structure. The consistent trait of these varied data types is that the data schema isn’t known or defined when the data is captured and stored. Rather, a data model is often applied at the time the data is used.

Then BigTable (casandra / HBase / MongoDB) come into picture, it is more of name-value pair + timestamp (NOSQL) DB, each value may have different fields like json string and it can be a nested name-value pair and flexiable enough to cater for any data format.

{
    "_id": ObjectId("4efa8d2b7d284dad101e4bc9"),
    "Last Name": "DUMONT",
    "First Name": "Jean",
    "Date of Birth": "01-22-1963"
},
{
    "_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
    "Last Name": "PELLERIN",
    "First Name": "Franck",
    "Date of Birth": "09-19-1983",
    "Address": "1 chemin des Loges",
    "City": "VERSAILLES"
}

Fortunately, Google's BigTable Paper clearly explains what BigTable actually is. Here is the first sentence of the "Data Model" section:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. It explains that:

The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

HBase

HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

When Should Use HBase?

HBase isn't suitable for every problem.

First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.

HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only.

What Is The Difference Between HBase and Hadoop/HDFS?

HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.

What are the real life BIG DATA use cases we can use Hadoop or similar technologies and where to start?

Check out Cloudera and use cases from their recommendation:

 E-tailing

  • Recommendation engines — increase average order size by recommending complementary products based on predictive analysis for cross-selling.
  • Cross-channel analytics — sales attribution, average order value, lifetime value (e.g., how many in-store purchases resulted from a particular recommendation, advertisement or promotion).
  • Event analytics — what series of steps (golden path) led to a desired outcome (e.g., purchase, registration).

Financial Services

  • Compliance and regulatory reporting.
  • Risk analysis and management.
  • Fraud detection and security analytics.
  • CRM and customer loyalty programs.
  • Credit scoring and analysis.
  • Trade surveillance.

Government

  • Fraud detection and cybersecurity.
  • Compliance and regulatory analysis.
  • Energy consumption and carbon footprint management.

Health & Life Sciences

  • Campaign and sales program optimization.
  • Brand management.
  • Patient care quality and program analysis.
  • Supply-chain management.
  • Drug discovery and development analysis.

Retail/CPG

  • Merchandizing and market basket analysis.
  • Campaign management and customer loyalty programs.
  • Supply-chain management and analytics.
  • Event- and behavior-based targeting.
  • Market and consumer segmentations.

Telecommunications

  • Revenue assurance and price optimization.
  • Customer churn prevention.
  • Campaign management and customer loyalty.
  • Call Detail Record (CDR) analysis.
  • Network performance and optimization.

Web & Digital Media Services

  • Large-scale clickstream analytics.
  • Ad targeting, analysis, forecasting and optimization.
  • Abuse and click-fraud prevention.
  • Social graph analysis and profile segmentation.
  • Campaign management and loyalty programs.

Thursday, August 2, 2012

AppDynamics ? What, Why & How


If you ask me what is "AppDymamics", in my own words, it is real-time web application profiler and monitoring tool, reason i called it in such way is the features i used most in the free version. 


Actually, AppDynamics call itself Application Performance Management (APM) product and has more features then i blogged here. This post is just a jump-start and personal experience of AppDymamics Lite (vs Commercial version) in my development experience.


Two most important reasons made me to try it out: 

- Easy to install
My team is so busy in development and no time to setup individual application profiler on their IDEs. 

With few minutes' effort, i setup one from our DEV server and allow all my developers to use with no wait. 

- Easy to use 
Very intuitive GUI and allow to drill down all the way to JDBC and SQL query with detail call stack (call graph) and time spending.

No training required, just follow the color code and double-click and click, click until I found the root cause in minutes.

So, When should use application Profiler instead of AppDynamics? Here is a short answer:    


WHEN TO USE A PROFILERWHEN TO USE APPDYNAMICS LITE
You need to troubleshoot high CPU usage and high memory usageYou need to troubleshoot slow response times, slow SQL, high error rates, and stalls
Your environment is Development or QAYour environment is Production or performance load-test
15-20% overhead is okayYou can't afford more than 2% overhead


Now, let me walk you through how to start & use the AppDynamics


- Download the appdynamics and install it in 4 steps


- Open web browser and browse through the slow web pages or the processes you intend to monitor and profile


- Open web browser and login AppDynamics (Be patient, if you don't see any data. AppDynamics need a short while to build and load up the analytic info) 


^Landing page will show list of recent transactions (web request) and highlight the response time in different color code depending on the settings (slow, very slow and stalling) which you can change at your own will. 


- Double-click the page name (URL) in red color and until you see the full call graph & timing of each spending.


    * First, it will drill down to all sampling of the same transaction (web request) 
   * Chose one of them and double-click to drill down to call graph
    ^you can filter the graph by method timing spending and class type (servlet, pojo, spring bean, ejb and etc)  
    
   *Click and click ... until you see something you are familiar 
    ^Take note of color code, don't confuse the total time spending (ORANGE) with method self time spending (BLUE)

As recent transactions are not good enough, we need to check out long transaction in past 30 minutes or 1 hours. Never mind, go and visit SnapShot page.

   * Click on SnapShot Tab
   ^ Too much info in the page, never mind, enable filter to view less and more concerned transactions.

    * Double-click the long transaction to pull out the call graph
   ^ There is JDBC call in this transaction

    * Right click on slow method to further expend the graph

   * Click on JDBC to check out list of executed SQL & time spending 

    * Scroll Up & Down and check what are the SQL and how much they spend in the total timing. 
    
   * Save the call graph for offline or offsite study by other developers  

    * Save to local disc and sent it to developers for further study. 
    ^ This file also includes list of executed SQL and timing of each.   

That is all, it is sweet & easy experience. In fact, i have put it up behind an Apach Server so that our offsite team can access and check out root cause of long transaction. 

´