hive怎么导入本地数据_大数据

4条回答

@CcCc

2楼 · 2020-09-22 23:26

1.从本地文件系统中导入数据到hive表

（1）数据准备（/home/sopdm/test.dat）：

1,wyp,25,13188888888

2,test,30,13899999999

3,zs,34,89931412

（2）首先创建表

use sopdm;

drop table if exists sopdm.wyp;

create table if not exists sopdm.wyp(id int,name string,age int,tel string)

row format delimited

fields terminated by ','

stored as textfile;

（3）从本地文件系统中导入数据到Hive表

load data local inpath ‘/home/sopdm/test.dat’ into table sopdm.wyp;

（4）可以到wyp表的数据目录下查看，如下命令

dfs -ls /user/sopdm/hive/warehouse/sopdm.db/wyp;

2.从HDFS上导入数据到Hive表

（1）现在hdfs中创建一个input目录存放HDFS文件

hadoop fs -mkdir input; 或 hadoop fs -mkdir /user/sopdm/input;

（2）把本地文件上传到HDFS中,并重命名为test_hdfs.dat

hadoop fs -put /home/sopdm/test.dat /user/sopdm/input/test_hdfs.dat;

（3）查看文件

dfs -cat /user/sopdm/input/test_hdfs.dat;

（4）将内容导入hive表中

--拷贝“本地数据”到“hive”使用：load data local…

--转移“HDFS”到“hive”（必须同一个集群）使用：load data…

load data inpath ‘/user/sopdm/input/test_hdfs.dat’ into table sopdm.wyp;

3.从别的Hive表中导入数据到Hive表中

create table if not exists sopdm.wyp2(id int,name string,tel string)

row format delimited

fields terminated by ','

stored as textfile;

--overwrite是覆盖，into是追加

insert into table sopdm.wyp2

select id,name,tel from sopdm.wyp;

--多表插入

--高效方式-查询语句插入多个分区

from sopdm.wyp w

insert overwrite table sopdm.wyp2

select w.id,w.name,w.tel where w.age=25

insert overwrite table sopdm.wyp2

select w.id,w.name,w.tel where w.age=27;

4.创建Hive表的同时导入查询数据

create table sopdm.wyp3

as select id,name,tel,age from sopdm.wyp where age=25;

5.使用sqoop从关系数据库导入数据到Hive表

无需指教

3楼 · 2020-09-23 08:36

一，Hive数据导入的几种方式

首先列出讲述下面几种导入方式的数据和hive表。

导入：

本地文件导入到Hive表；
Hive表导入到Hive表;
HDFS文件导入到Hive表;
创建表的过程中从其他表导入;
通过sqoop将mysql库导入到Hive表；示例见《通过sqoop进行mysql与hive的导入导出》和《定时从大数据平台同步HIVE数据到oracle》

导出：

Hive表导出到本地文件系统；
Hive表导出到HDFS；
通过sqoop将Hive表导出到mysql库；

Hive表：

创建testA：

CREATE TABLE testA (  
    id INT,  
    name string,  
    area string  
) PARTITIONED BY (create_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

创建testB：

CREATE TABLE testB (  
    id INT,  
    name string,  
    area string,  
    code string  
) PARTITIONED BY (create_time string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

数据文件（sourceA.txt）：

1,fish1,SZ  
2,fish2,SH  
3,fish3,HZ  
4,fish4,QD  
5,fish5,SR

数据文件（sourceB.txt）：

1,zy1,SZ,1001  2,zy2,SH,1002  3,zy3,HZ,1003  4,zy4,QD,1004  5,zy5,SR,1005

（1）本地文件导入到Hive表

复制代码

hive> LOAD DATA LOCAL INPATH '/home/hadoop/sourceA.txt' INTO TABLE testA PARTITION(create_time='2015-07-08');  
Copying data from file:/home/hadoop/sourceA.txt  
Copying file: file:/home/hadoop/sourceA.txt  
Loading data to table default.testa partition (create_time=2015-07-08)  
Partition default.testa{create_time=2015-07-08} stats: [numFiles=1, numRows=0, totalSize=58, rawDataSize=0]  OK  
Time taken: 0.237 seconds  
hive> LOAD DATA LOCAL INPATH '/home/hadoop/sourceB.txt' INTO TABLE testB PARTITION(create_time='2015-07-09');  
Copying data from file:/home/hadoop/sourceB.txt  
Copying file: file:/home/hadoop/sourceB.txt  
Loading data to table default.testb partition (create_time=2015-07-09)  
Partition default.testb{create_time=2015-07-09} stats: [numFiles=1, numRows=0, totalSize=73, rawDataSize=0]  OK  
Time taken: 0.212 seconds  
hive> select * from testA;  
OK  
1   fish1   SZ  2015-07-08  2   fish2   SH  2015-07-08  3   fish3   HZ  2015-07-08  4   fish4   QD  2015-07-08  5   fish5   SR  2015-07-08  Time taken: 0.029 seconds, Fetched: 5 row(s)  
hive> select * from testB;  
OK  
1   zy1 SZ  1001    2015-07-09  2   zy2 SH  1002    2015-07-09  3   zy3 HZ  1003    2015-07-09  4   zy4 QD  1004    2015-07-09  5   zy5 SR  1005    2015-07-09  Time taken: 0.047 seconds, Fetched: 5 row(s)

复制代码

（2）Hive表导入到Hive表

将testB的数据导入到testA表

复制代码

hive> INSERT INTO TABLE testA PARTITION(create_time='2015-07-11') select id, name, area from testB where id = 1; ...(省略) OK Time taken: 14.744 seconds hive> INSERT INTO TABLE testA PARTITION(create_time) select id, name, area, code from testB where id = 2;

...(省略)  
OKTime taken: 19.852 secondshive> select * from testA;OK2 zy2 SH 10021 fish1 SZ 2015-07-082 fish2 SH 2015-07-083 fish3 HZ 2015-07-084 fish4 QD 2015-07-085 fish5 SR 2015-07-081 zy1 SZ 2015-07-11Time taken: 0.032 seconds, Fetched: 7 row(s)

复制代码

说明：

1，将testB中id=1的行，导入到testA，分区为2015-07-11

2，将testB中id=2的行，导入到testA，分区create_time为id=2行的code值。

（3）HDFS文件导入到Hive表

将sourceA.txt和sourceB.txt传到HDFS中，路径分别是/home/hadoop/sourceA.txt和/home/hadoop/sourceB.txt中

复制代码

hive> LOAD DATA INPATH '/home/hadoop/sourceA.txt' INTO TABLE testA PARTITION(create_time='2015-07-08'); ...(省略) OK Time taken: 0.237 seconds hive> LOAD DATA INPATH '/home/hadoop/sourceB.txt' INTO TABLE testB PARTITION(create_time='2015-07-09');

...(省略)  
OK  
Time taken: 0.212 seconds  
hive> select * from testA;  
OK  
1   fish1   SZ  2015-07-08  2   fish2   SH  2015-07-08  3   fish3   HZ  2015-07-08  4   fish4   QD  2015-07-08  5   fish5   SR  2015-07-08  Time taken: 0.029 seconds, Fetched: 5 row(s)  
hive> select * from testB;  
OK  
1   zy1 SZ  1001    2015-07-09  2   zy2 SH  1002    2015-07-09  3   zy3 HZ  1003    2015-07-09  4   zy4 QD  1004    2015-07-09  5   zy5 SR  1005    2015-07-09  Time taken: 0.047 seconds, Fetched: 5 row(s)

复制代码

/home/hadoop/sourceA.txt'导入到testA表

/home/hadoop/sourceB.txt'导入到testB表

（4）创建表的过程中从其他表导入

复制代码

hive> create table testC as select name, code from testB;  
Total jobs = 3  Launching Job 1 out of 3  Number of reduce tasks is set to 0 since there's no reduce operator  
Starting Job = job_1449746265797_0106, Tracking URL = http://hadoopcluster79:8088/proxy/application_1449746265797_0106/  
Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job  -kill job_1449746265797_0106  
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
2015-12-24 16:40:17,981 Stage-1 map = 0%,  reduce = 0%  
2015-12-24 16:40:23,115 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.11 sec  
MapReduce Total cumulative CPU time: 1 seconds 110 msec  
Ended Job = job_1449746265797_0106  
Stage-4 is selected by condition resolver.  
Stage-3 is filtered out by condition resolver.  
Stage-5 is filtered out by condition resolver.  
Moving data to: hdfs://hadoop2cluster/tmp/hive-root/hive_2015-12-24_16-40-09_983_6048680148773453194-1/-ext-10001  
Moving data to: hdfs://hadoop2cluster/home/hadoop/hivedata/warehouse/testc  
Table default.testc stats: [numFiles=1, numRows=0, totalSize=45, rawDataSize=0]  
MapReduce Jobs Launched:   
Job 0: Map: 1   Cumulative CPU: 1.11 sec   HDFS Read: 297 HDFS Write: 45 SUCCESS  
Total MapReduce CPU Time Spent: 1 seconds 110 msec  
OK  
Time taken: 14.292 seconds  
hive> desc testC;  
OK  
name                    string                                        
code                    string                                        
Time taken: 0.032 seconds, Fetched: 2 row(s)

复制代码

二、Hive数据导出的几种方式

(1)导出到本地文件系统

复制代码

hive> INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/output' ROW FORMAT DELIMITED FIELDS TERMINATED by ',' select * from testA;  
Total jobs = 1  Launching Job 1 out of 1  Number of reduce tasks is set to 0 since there's no reduce operator  
Starting Job = job_1451024007879_0001, Tracking URL = http://hadoopcluster79:8088/proxy/application_1451024007879_0001/  
Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job  -kill job_1451024007879_0001  
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
2015-12-25 17:04:30,447 Stage-1 map = 0%,  reduce = 0%  
2015-12-25 17:04:35,616 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.16 sec  
MapReduce Total cumulative CPU time: 1 seconds 160 msec  
Ended Job = job_1451024007879_0001  
Copying data to local directory /home/hadoop/output  
Copying data to local directory /home/hadoop/output  
MapReduce Jobs Launched:   
Job 0: Map: 1   Cumulative CPU: 1.16 sec   HDFS Read: 305 HDFS Write: 110 SUCCESS  
Total MapReduce CPU Time Spent: 1 seconds 160 msec  
OK  
Time taken: 16.701 seconds

复制代码

查看数据结果：

[hadoop@hadoopcluster78 output]$ cat /home/hadoop/output/000000_0   1,fish1,SZ,2015-07-08  2,fish2,SH,2015-07-08  3,fish3,HZ,2015-07-08  4,fish4,QD,2015-07-08  5,fish5,SR,2015-07-08

通过INSERT OVERWRITE LOCAL DIRECTORY将hive表testA数据导入到/home/hadoop目录，众所周知，HQL会启动Mapreduce完成，其实/home/hadoop就是Mapreduce输出路径，产生的结果存放在文件名为：000000_0。

(2)导出到HDFS

导入到HDFS和导入本地文件类似，去掉HQL语句的LOCAL就可以了

复制代码

hive> INSERT OVERWRITE DIRECTORY '/home/hadoop/output' select * from testA;   
Total jobs = 3  Launching Job 1 out of 3  Number of reduce tasks is set to 0 since there's no reduce operator  
Starting Job = job_1451024007879_0002, Tracking URL = http://hadoopcluster79:8088/proxy/application_1451024007879_0002/  
Kill Command = /home/hadoop/apache/hadoop-2.4.1/bin/hadoop job  -kill job_1451024007879_0002  
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0  
2015-12-25 17:08:51,034 Stage-1 map = 0%,  reduce = 0%  
2015-12-25 17:08:59,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.4 sec  
MapReduce Total cumulative CPU time: 1 seconds 400 msec  
Ended Job = job_1451024007879_0002  
Stage-3 is selected by condition resolver.  
Stage-2 is filtered out by condition resolver.  
Stage-4 is filtered out by condition resolver.  
Moving data to: hdfs://hadoop2cluster/home/hadoop/hivedata/hive-hadoop/hive_2015-12-25_17-08-43_733_1768532778392261937-1/-ext-10000  
Moving data to: /home/hadoop/output  
MapReduce Jobs Launched:   
Job 0: Map: 1   Cumulative CPU: 1.4 sec   HDFS Read: 305 HDFS Write: 110 SUCCESS  
Total MapReduce CPU Time Spent: 1 seconds 400 msec  
OK  
Time taken: 16.667 seconds

复制代码

查看hfds输出文件：

[hadoop@hadoopcluster78 bin]$ ./hadoop fs -cat /home/hadoop/output/000000_0  1fish1SZ2015-07-08  2fish2SH2015-07-08  3fish3HZ2015-07-08  4fish4QD2015-07-08  5fish5SR2015-07-08

其他

采用hive的-e和-f参数来导出数据。

参数为： -e 的使用方式，后面接SQL语句。>>后面为输出文件路径

复制代码

[hadoop@hadoopcluster78 bin]$ ./hive -e "select * from testA" >> /home/hadoop/output/testA.txt  
15/12/25 17:15:07 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead  
  
Logging initialized using configuration in file:/home/hadoop/apache/hive-0.13.1/conf/hive-log4j.properties  
OK  
Time taken: 1.128 seconds, Fetched: 5 row(s)  
[hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/testA.txt   
1   fish1   SZ  2015-07-08  2   fish2   SH  2015-07-08  3   fish3   HZ  2015-07-08  4   fish4   QD  2015-07-08  5   fish5   SR  2015-07-08

复制代码

参数为： -f 的使用方式，后面接存放sql语句的文件。>>后面为输出文件路径

SQL语句文件：

[hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/sql.sql   
select * from testA

使用-f参数执行：

[hadoop@hadoopcluster78 bin]$ ./hive -f /home/hadoop/output/sql.sql >> /home/hadoop/output/testB.txt  
15/12/25 17:20:52 WARN conf.HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead  
  
Logging initialized using configuration in file:/home/hadoop/apache/hive-0.13.1/conf/hive-log4j.properties  
OK  
Time taken: 1.1 seconds, Fetched: 5 row(s)

参看结果：

[hadoop@hadoopcluster78 bin]$ cat /home/hadoop/output/testB.txt   
1   fish1   SZ  2015-07-08  2   fish2   SH  2015-07-08  3   fish3   HZ  2015-07-08  4   fish4   QD  2015-07-08  5   fish5   SR  2015-07-08

苏橙C

4楼 · 2020-09-23 10:23

1.从本地文件系统中导入数据到hive表

（1）数据准备（/home/sopdm/test.dat）：

1,wyp,25,13188888888

2,test,30,13899999999

3,zs,34,89931412

（2）首先创建表

use sopdm;

drop table if exists sopdm.wyp;

create table if not exists sopdm.wyp(id int,name string,age int,tel string)

row format delimited

fields terminated by ','

stored as textfile;

（3）从本地文件系统中导入数据到Hive表

load data local inpath ‘/home/sopdm/test.dat’ into table sopdm.wyp;

（4）可以到wyp表的数据目录下查看，如下命令

dfs -ls /user/sopdm/hive/warehouse/sopdm.db/wyp;

魏魏姐

5楼 · 2020-09-23 11:02

sqoop导入hive后怎么查看导入的数据导入数据的进程的并发数，默认是4。如果导入的数据不大的话，不妨设置成1，这样导入更快。一般来说Sqoop会使用主键来平均地分割数据。并发导入的时候可以设置相关的分割列等等，具体的做法参考官方的文档。