Hbase入門6 -白話MySQL(RDBMS)與HBase之間
Jun 07, 2016 pm 04:26 PM我的廢話1: 任何一項(xiàng)新技術(shù)并非救命稻草,一抹一擦立馬藥到病除的百寶箱,并非使用Spring或者NOSQL的產(chǎn)品就神乎其神+五光十色,如果那樣基本是扯淡。同類 型產(chǎn)品中不管那種技術(shù)最終要達(dá)到的目的是一樣的,通過(guò)新的技術(shù)手段你往往可能避諱了當(dāng)前你所需要面對(duì)
我的廢話1:
?? 任何一項(xiàng)新技術(shù)并非救命稻草,一抹一擦立馬藥到病除的百寶箱,并非使用Spring或者NOSQL的產(chǎn)品就神乎其神+五光十色,如果那樣基本是扯淡。同類 型產(chǎn)品中不管那種技術(shù)最終要達(dá)到的目的是一樣的,通過(guò)新的技術(shù)手段你往往可能避諱了當(dāng)前你所需要面對(duì)的問(wèn)題,但過(guò)后新的問(wèn)題又來(lái)了。也許回過(guò)頭來(lái)看看還不 如在原來(lái)的基礎(chǔ)上多動(dòng)動(dòng)腦筋 想想辦法 做些改良可以得到更高的回報(bào)。??
?
?? 傳統(tǒng)數(shù)據(jù)庫(kù)是以數(shù)據(jù)塊來(lái)存儲(chǔ)數(shù)據(jù),簡(jiǎn)單來(lái)說(shuō),你的表字段越多,占用的數(shù)據(jù)空間就越多,那么查詢有可能就要跨數(shù)據(jù)塊,將會(huì)導(dǎo)致查詢的速度變慢。在大型系統(tǒng)中一張表上百個(gè)字段,并且表中的數(shù)據(jù)上億條這是完全是有可能的。因此會(huì)帶來(lái)數(shù)據(jù)庫(kù)查詢的瓶頸。我們都知道一個(gè)常識(shí)數(shù)據(jù)庫(kù)中表記錄的多少對(duì)查詢的性能有非常大的影響,此時(shí)你很有可能想到分表、分庫(kù)的做法來(lái)分載數(shù)據(jù)庫(kù)運(yùn)算的壓力,那么又會(huì)帶來(lái)新的問(wèn)題,例如:分布式事務(wù)、全局唯一ID的生成、跨數(shù)據(jù)庫(kù)查詢 等,依舊會(huì)讓你面對(duì)棘手的問(wèn)題。如果打破這種按照行存儲(chǔ)的模式,采用一種基于列存儲(chǔ)的模式,對(duì)于大規(guī)模數(shù)據(jù)場(chǎng)景這樣情況有可能發(fā)生一些好轉(zhuǎn)。由于查詢中的選擇規(guī)則是通過(guò)列來(lái)定義的,因此整個(gè)數(shù)據(jù)庫(kù)是自動(dòng)索引化的。按列存儲(chǔ)每個(gè)字段的數(shù)據(jù)聚集存儲(chǔ), 可以動(dòng)態(tài)增加,并且列為空就不存儲(chǔ)數(shù)據(jù),節(jié)省存儲(chǔ)空間。 每個(gè)字段的數(shù)據(jù)按照聚集存儲(chǔ),能大大減少讀取的數(shù)據(jù)量,查詢時(shí)指哪打哪,來(lái)的更直接。無(wú)需考慮分庫(kù)、分表 Hbase將對(duì)存儲(chǔ)的數(shù)據(jù)自動(dòng)切分?jǐn)?shù)據(jù),并支持高并發(fā)讀寫操作,使得海量數(shù)據(jù)存儲(chǔ)自動(dòng)具有更強(qiáng)的擴(kuò)展性。
?? Java中的HashMap是Key/Value的結(jié)構(gòu),你也可以把HBase的數(shù)據(jù)結(jié)構(gòu)看做是一個(gè)Key/Value的體系,話說(shuō)HBase的區(qū)域由表名和行界定的。在HBase區(qū)域每一個(gè)"列族"都由一個(gè)名為HStore的對(duì)象管理。每個(gè)HStore由一個(gè)或多個(gè)MapFiles(Hadoop中的一個(gè)文件類型)組成。MapFiles的概念類似于Google的SSTable。 在Hbase里面有以下兩個(gè)主要的概念,Row key 和 Column Family,其次是Cell qualifier和Timestamp tuple,Column family我們通常稱之為“列族”,訪問(wèn)控制、磁盤和內(nèi)存的使用統(tǒng)計(jì)都是在列族層面進(jìn)行的。列族Column family是之前預(yù)先定義好的數(shù)據(jù)模型,每一個(gè)Column Family都可以根據(jù)“限定符”有多個(gè)column。在HBase每個(gè)cell存儲(chǔ)單元對(duì)同一份數(shù)據(jù)有多個(gè)版本,根據(jù)唯一的時(shí)間戳來(lái)區(qū)分每個(gè)版本之間的差異,最新的數(shù)據(jù)版本排在最前面 。
口水:Hbase將table水平劃分成N個(gè)Region,region按column family劃分成Store,每個(gè)store包括內(nèi)存中的memstore和持久化到disk上的HFile。
上述可能我表達(dá)的還不夠到位,下面來(lái)看一個(gè)實(shí)踐中的場(chǎng)景,將原來(lái)是存放在MySQL中Blog中的數(shù)據(jù)遷移到HBase中的過(guò)程:
MySQL中現(xiàn)有的表結(jié)構(gòu):
遷移HBase中的表結(jié)構(gòu):
原來(lái)系統(tǒng)中有2張表blogtable和comment表,采用HBase后只有一張blogtable表,如果按照傳統(tǒng)的RDBMS的話,blogtable表中的列是固定的,比如schema 定義了Author,Title,URL,text等屬性,上線后表字段是不能動(dòng)態(tài)增加的。但是如果采用列存儲(chǔ)系統(tǒng),比如Hbase,那么我們可以定義blogtable表,然后定義info 列族,User的數(shù)據(jù)可以分為:info:title? ,info:author ,info:url 等,如果后來(lái)你又想增加另外的屬性,這樣很方便只需要 info:xxx 就可以了。
對(duì)于Row key你可以理解row key為傳統(tǒng)RDBMS中的某一個(gè)行的主鍵,Hbase是不支持條件查詢以及Order by等查詢,因此Row key的設(shè)計(jì)就要根據(jù)你系統(tǒng)的查詢需求來(lái)設(shè)計(jì)了額。 Hbase中的記錄是按照rowkey來(lái)排序的,這樣就使得查詢變得非??臁?/p>
具體操作過(guò)程如下:
============================創(chuàng)建blogtable表=========================
create 'blogtable', 'info','text','comment_title','comment_author','comment_text'
?
============================插入概要信息=========================
put 'blogtable', '1', 'info:title', 'this is doc title'
put 'blogtable', '1', 'info:author', 'javabloger'
put 'blogtable', '1', 'info:url', 'http://www.javabloger.com/index.php'
put 'blogtable', '2', 'info:title', 'this is doc title2'
put 'blogtable', '2', 'info:author', 'H.E.'
put 'blogtable', '2', 'info:url', 'http://www.javabloger.com/index.html'
============================插入正文信息=========================
put 'blogtable', '1', 'text:', 'what is this doc context ?'
put 'blogtable', '2', 'text:', 'what is this doc context2?'
==========================插入評(píng)論信息===============================
put 'blogtable', '1', 'comment_title:', 'this is doc comment_title '
put 'blogtable', '1', 'comment_author:', 'javabloger'
put 'blogtable', '1', 'comment_text:', 'this is nice doc'
put 'blogtable', '2', 'comment_title:', 'this is blog comment_title '
put 'blogtable', '2', 'comment_author:', 'H.E.'
put 'blogtable', '2', 'comment_text:', 'this is nice blog'
HBase的數(shù)據(jù)查詢\讀取,可以通過(guò)單個(gè)row key訪問(wèn),row key的range和全表掃描,大致如下:
注意:HBase不能支持where條件、Order by 查詢,只支持按照Row key來(lái)查詢,但是可以通過(guò)HBase提供的API進(jìn)行條件過(guò)濾。
例如:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html
scan 'blogtable' ,{COLUMNS => ['text:','info:title'] }? —> 列出 文章的內(nèi)容和標(biāo)題
scan 'blogtable' , {COLUMNS => 'info:url' , STARTROW => '2'}??? —> 根據(jù)范圍列出 文章的內(nèi)容和標(biāo)題
get 'blogtable','1'??? —> 列出 文章id 等于1的數(shù)據(jù)
get 'blogtable','1', {COLUMN => 'info'}??? —> 列出 文章id 等于1 的 info 的頭(Head)內(nèi)容
get 'blogtable','1', {COLUMN => 'text'}?? —> 列出 文章id 等于1 的 text? 的具體(Body)內(nèi)容
get 'blogtable','1', {COLUMN => ['text','info:author']}? —> 列出 文章id 等于1 的內(nèi)容和作者(Body/Author)內(nèi)容
我的廢話2:
?? 有人會(huì)問(wèn)Java Web服務(wù)器中是Tomcat快還是GlassFish快?小型數(shù)據(jù)庫(kù)中是MySQL效率高還是MS-SQL效率高?我看是關(guān)鍵用在什么場(chǎng)景和怎么使用這 個(gè)產(chǎn)品(技術(shù)),所以我漸漸的認(rèn)為是需要對(duì)產(chǎn)品、技術(shù)本身深入的了解,而并非一項(xiàng)新的技術(shù)就是絕佳的選擇。試問(wèn):Tomcat的默認(rèn)的運(yùn)行參數(shù)能和我們線 上正在使用的GlassFish性能相提并論嗎?我不相信GlassFishv2和GlassFishv3在默認(rèn)的配置參數(shù)下有顯著的差別。我們需要對(duì)產(chǎn) 品本身做到深入的了解才能發(fā)揮他最高的性能,而并非感觀聽從廠家的廣告和自己的感性認(rèn)識(shí) 迷信哪個(gè)產(chǎn)品的優(yōu)越性。
我的廢話3:
? 對(duì)于NOSQL這樣的新技術(shù),的的確確是可以解決過(guò)去我們所需要面對(duì)的問(wèn)題,但也并非適合每個(gè)應(yīng)用場(chǎng)景,所以在使用新產(chǎn)品的同時(shí)需要切合當(dāng)前的產(chǎn)品需要, 是需求在引導(dǎo)新技術(shù)的投入,而并非為了趕時(shí)髦去使用他。你的產(chǎn)品是否過(guò)硬不是你使用了什么新技術(shù),用戶關(guān)心的是速度和穩(wěn)定性,不會(huì)關(guān)心你是否使用了 NOSQL。相反Google有著超大的數(shù)據(jù)量,能給全世界用戶帶來(lái)了驚人的速度和準(zhǔn)確性,大家才會(huì)回過(guò)頭來(lái)好奇Google到底是怎么做到的。所以根據(jù) 自己的需要千萬(wàn)別太勉強(qiáng)自己使用了某項(xiàng)新技術(shù)。
我的廢話4:
? 總之一句話,用什么不是最關(guān)鍵,最關(guān)鍵是怎么去使用!
?
相關(guān)文章:
Lily-建立在HBase上的分布式搜索
MySQL向Hive/HBase的遷移工具
HBase入門5(集群) -壓力分載與失效轉(zhuǎn)發(fā)
Hive入門3–Hive與HBase的整合
HBase入門篇4
HBase入門篇3
HBase入門篇2-Java操作HBase例子
HBase入門篇
基于Hbase存儲(chǔ)的分布式消息(IM)系統(tǒng)-JABase
–end–
原文地址:Hbase入門6 -白話MySQL(RDBMS)與HBase之間, 感謝原作者分享。

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

mysqldump is a common tool for performing logical backups of MySQL databases. It generates SQL files containing CREATE and INSERT statements to rebuild the database. 1. It does not back up the original file, but converts the database structure and content into portable SQL commands; 2. It is suitable for small databases or selective recovery, and is not suitable for fast recovery of TB-level data; 3. Common options include --single-transaction, --databases, --all-databases, --routines, etc.; 4. Use mysql command to import during recovery, and can turn off foreign key checks to improve speed; 5. It is recommended to test backup regularly, use compression, and automatic adjustment.

When handling NULL values ??in MySQL, please note: 1. When designing the table, the key fields are set to NOTNULL, and optional fields are allowed NULL; 2. ISNULL or ISNOTNULL must be used with = or !=; 3. IFNULL or COALESCE functions can be used to replace the display default values; 4. Be cautious when using NULL values ??directly when inserting or updating, and pay attention to the data source and ORM framework processing methods. NULL represents an unknown value and does not equal any value, including itself. Therefore, be careful when querying, counting, and connecting tables to avoid missing data or logical errors. Rational use of functions and constraints can effectively reduce interference caused by NULL.

GROUPBY is used to group data by field and perform aggregation operations, and HAVING is used to filter the results after grouping. For example, using GROUPBYcustomer_id can calculate the total consumption amount of each customer; using HAVING can filter out customers with a total consumption of more than 1,000. The non-aggregated fields after SELECT must appear in GROUPBY, and HAVING can be conditionally filtered using an alias or original expressions. Common techniques include counting the number of each group, grouping multiple fields, and filtering with multiple conditions.

MySQL paging is commonly implemented using LIMIT and OFFSET, but its performance is poor under large data volume. 1. LIMIT controls the number of each page, OFFSET controls the starting position, and the syntax is LIMITNOFFSETM; 2. Performance problems are caused by excessive records and discarding OFFSET scans, resulting in low efficiency; 3. Optimization suggestions include using cursor paging, index acceleration, and lazy loading; 4. Cursor paging locates the starting point of the next page through the unique value of the last record of the previous page, avoiding OFFSET, which is suitable for "next page" operation, and is not suitable for random jumps.

To view the size of the MySQL database and table, you can query the information_schema directly or use the command line tool. 1. Check the entire database size: Execute the SQL statement SELECTtable_schemaAS'Database',SUM(data_length index_length)/1024/1024AS'Size(MB)'FROMinformation_schema.tablesGROUPBYtable_schema; you can get the total size of all databases, or add WHERE conditions to limit the specific database; 2. Check the single table size: use SELECTta

MySQL supports transaction processing, and uses the InnoDB storage engine to ensure data consistency and integrity. 1. Transactions are a set of SQL operations, either all succeed or all fail to roll back; 2. ACID attributes include atomicity, consistency, isolation and persistence; 3. The statements that manually control transactions are STARTTRANSACTION, COMMIT and ROLLBACK; 4. The four isolation levels include read not committed, read submitted, repeatable read and serialization; 5. Use transactions correctly to avoid long-term operation, turn off automatic commits, and reasonably handle locks and exceptions. Through these mechanisms, MySQL can achieve high reliability and concurrent control.

Character set and sorting rules issues are common when cross-platform migration or multi-person development, resulting in garbled code or inconsistent query. There are three core solutions: First, check and unify the character set of database, table, and fields to utf8mb4, view through SHOWCREATEDATABASE/TABLE, and modify it with ALTER statement; second, specify the utf8mb4 character set when the client connects, and set it in connection parameters or execute SETNAMES; third, select the sorting rules reasonably, and recommend using utf8mb4_unicode_ci to ensure the accuracy of comparison and sorting, and specify or modify it through ALTER when building the library and table.

To set up asynchronous master-slave replication for MySQL, follow these steps: 1. Prepare the master server, enable binary logs and set a unique server-id, create a replication user and record the current log location; 2. Use mysqldump to back up the master library data and import it to the slave server; 3. Configure the server-id and relay-log of the slave server, use the CHANGEMASTER command to connect to the master library and start the replication thread; 4. Check for common problems, such as network, permissions, data consistency and self-increase conflicts, and monitor replication delays. Follow the steps above to ensure that the configuration is completed correctly.
