Big Data => It is really very big data, imaging u have perabyte data to be processed
You have app server and db. If ur app server need to process perabyte data, it needs to retrieve and copy the data from db to app server first, then process. This is traditional computing, move the data close to processing logic.
if it is small data, it is ok and right way to do that. But if the data volume is big, moving data from db to app server could take days. Just imaging how long it take to copy perabyte data files from one server to another server.
In this case, grid computing comes in and move the logic close to data rather than moving data to processing logic.
If we move logic to data, first of all, you save the time of moving data around. Then, people may argue, then u need to move result back, sure, but most of time, we need to process more data to produce one result, otherwise there is no point to process it, but copy it only.
Doing work by one man is less productive than doing work by more man. The best stratigy is allocate some time to teach others to do the same work and then get others to teach others recursively and do the work. The effect will be 2*2*2*..... faster than doing it alone.
So, we divide & distribute the data to different machines, teach & monitor (Hadoop and map & reduce model - It enables applications to work with thousands of computational independent computers and petabytes of data) them process currently, this is where distributed file system comes into picture (HDFS).
Instead of distributed file system, it could be nicer if we can put them into dictributed db as it could be easier to manage them.
So, What DB we can use? Big Data means unstructured data and it could include text strings, documents of all types, audio and video files, metadata, web pages, email messages, social media feeds, form data, and so on. Big Data may well have different data types within the same set that do not contain the same structure. The consistent trait of these varied data types is that the data schema isn’t known or defined when the data is captured and stored. Rather, a data model is often applied at the time the data is used.
Then BigTable (casandra / HBase / MongoDB) come into picture, it is more of name-value pair + timestamp (NOSQL) DB, each value may have different fields like json string and it can be a nested name-value pair and flexiable enough to cater for any data format.
{
"_id": ObjectId("4efa8d2b7d284dad101e4bc9"),
"Last Name": "DUMONT",
"First Name": "Jean",
"Date of Birth": "01-22-1963"
},
{
"_id": ObjectId("4efa8d2b7d284dad101e4bc7"),
"Last Name": "PELLERIN",
"First Name": "Franck",
"Date of Birth": "09-19-1983",
"Address": "1 chemin des Loges",
"City": "VERSAILLES"
}
Fortunately, Google's BigTable Paper clearly explains what BigTable actually is. Here is the first sentence of the "Data Model" section:
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. It explains that:
The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
HBase
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
When Should Use HBase?
HBase isn't suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only.
What Is The Difference Between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
What are the real life BIG DATA use cases we can use Hadoop or similar technologies and where to start?
Check out Cloudera and use cases from their recommendation:
E-tailing
- Recommendation engines — increase average order size by recommending complementary products based on predictive analysis for cross-selling.
- Cross-channel analytics — sales attribution, average order value, lifetime value (e.g., how many in-store purchases resulted from a particular recommendation, advertisement or promotion).
- Event analytics — what series of steps (golden path) led to a desired outcome (e.g., purchase, registration).
Financial Services
- Compliance and regulatory reporting.
- Risk analysis and management.
- Fraud detection and security analytics.
- CRM and customer loyalty programs.
- Credit scoring and analysis.
- Trade surveillance.
Government
- Fraud detection and cybersecurity.
- Compliance and regulatory analysis.
- Energy consumption and carbon footprint management.
Health & Life Sciences
- Campaign and sales program optimization.
- Brand management.
- Patient care quality and program analysis.
- Supply-chain management.
- Drug discovery and development analysis.
Retail/CPG
- Merchandizing and market basket analysis.
- Campaign management and customer loyalty programs.
- Supply-chain management and analytics.
- Event- and behavior-based targeting.
- Market and consumer segmentations.
Telecommunications
- Revenue assurance and price optimization.
- Customer churn prevention.
- Campaign management and customer loyalty.
- Call Detail Record (CDR) analysis.
- Network performance and optimization.
Web & Digital Media Services
- Large-scale clickstream analytics.
- Ad targeting, analysis, forecasting and optimization.
- Abuse and click-fraud prevention.
- Social graph analysis and profile segmentation.
- Campaign management and loyalty programs.