Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add s3 writer #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions package.xml
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,13 @@
</includes>
<outputDirectory>datax</outputDirectory>
</fileSet>
<fileSet>
<directory>s3writer/target/datax/</directory>
<includes>
<include>**/*.*</include>
</includes>
<outputDirectory>datax</outputDirectory>
</fileSet>
<fileSet>
<directory>ftpwriter/target/datax/</directory>
<includes>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,13 @@ public class Key {

// writer file type suffix, like .txt .csv
public static final String SUFFIX = "suffix";

public static final String S3_BUCKET = "s3Bucket";

public static final String S3_ACCESS_KEY = "s3AccessKey";

public static final String S3_SECRET_KEY = "s3SecretKey";

public static final String S3_ENDPOINT = "s3Endpoint";

}
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@
<module>rdbmswriter</module>
<module>hbase11xwriter</module>
<module>hbase094xwriter</module>
<module>s3writer</module>

<!-- some support module -->
<module>plugin-rdbms-util</module>
Expand Down
207 changes: 207 additions & 0 deletions s3writer/doc/s3writer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# DataX S3Writer 说明


------------

## 1 快速介绍

S3Writer提供了向S3写入类CSV格式的一个或者多个表文件。

**写入S3文件内容存放的是一张逻辑意义上的二维表,例如CSV格式的文本信息。**


## 2 功能与限制

S3Writer实现了从DataX协议转为S3TXT文件功能,S3文件本身是无结构化数据存储,S3Writer如下几个方面约定:

1. 支持且仅支持写入 TXT的文件,且要求TXT中shema为一张二维表。

2. 支持类CSV格式文件,自定义分隔符。

3. 支持文本压缩,现有压缩格式为gzip、bzip2。

6. 支持多线程写入,每个线程写入不同子文件。

7. 文件支持滚动,当文件大于某个size值或者行数值,文件需要切换。 [暂不支持]

我们不能做到:

1. 单个文件不能支持并发写入。


## 3 功能说明


### 3.1 配置样例

```json
{
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": ["jdbc:mysql://xxx:3306/xxx"],
"table": ["yyy"]
}
],
"password": "root",
"username": "root",
"where": ""
}
},
"writer": {
"name": "s3writer",
"parameter": {
"s3Bucket": "xxx",
"s3AccessKey": "xxx",
"s3SecretKey": "xxx+",
"s3Endpoint": "s3.cn-north-1.amazonaws.com.cn",

"dateFormat": "",
"fieldDelimiter": ",",
"fileName": "yyy",
"path": "xxx/xxx",
"writeMode": "truncate"
}
}
}
],
"setting": {
"speed": {
"channel": 10
}
}
}
}
}
```

### 3.2 参数说明

* **path**

* 描述:S3文件系统的路径信息,S3Writer会写入Path目录下属多个文件。 <br />

* 必选:是 <br />

* 默认值:无 <br />

* **fileName**

* 描述:S3Writer写入的文件名,该文件名会添加随机的后缀作为每个线程写入实际文件名。 <br />

* 必选:是 <br />

* 默认值:无 <br />

* **writeMode**

* 描述:S3Writer写入前数据清理处理模式: <br />

* truncate,写入前清理目录下一fileName前缀的所有文件。
* append,写入前不做任何处理,DataX S3Writer直接使用filename写入,并保证文件名不冲突。
* nonConflict,如果目录下有fileName前缀的文件,直接报错。

* 必选:是 <br />

* 默认值:无 <br />

* **fieldDelimiter**

* 描述:读取的字段分隔符 <br />

* 必选:否 <br />

* 默认值:, <br />

* **compress**

* 描述:文本压缩类型,默认不填写意味着没有压缩。支持压缩类型为zip、lzo、lzop、tgz、bzip2。 <br />

* 必选:否 <br />

* 默认值:无压缩 <br />

* **encoding**

* 描述:读取文件的编码配置。<br />

* 必选:否 <br />

* 默认值:utf-8 <br />


* **nullFormat**

* 描述:文本文件中无法使用标准字符串定义null(空指针),DataX提供nullFormat定义哪些字符串可以表示为null。<br />

例如如果用户配置: nullFormat="\N",那么如果源头数据是"\N",DataX视作null字段。

* 必选:否 <br />

* 默认值:\N <br />

* **dateFormat**

* 描述:日期类型的数据序列化到文件中时的格式,例如 "dateFormat": "yyyy-MM-dd"。<br />

* 必选:否 <br />

* 默认值:无 <br />

* **fileFormat**

* 描述:文件写出的格式,包括csv (http://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC) 和text两种,csv是严格的csv格式,如果待写数据包括列分隔符,则会按照csv的转义语法转义,转义符号为双引号";text格式是用列分隔符简单分割待写数据,对于待写数据包括列分隔符情况下不做转义。<br />

* 必选:否 <br />

* 默认值:text <br />

* **header**

* 描述:txt写出时的表头,示例['id', 'name', 'age']。<br />

* 必选:否 <br />

* 默认值:无 <br />

### 3.3 类型转换


S3文件本身不提供数据类型,该类型是DataX S3Writer定义:

| DataX 内部类型| S3文件 数据类型 |
| -------- | ----- |
|
| Long |Long |
| Double |Double|
| String |String|
| Boolean |Boolean |
| Date |Date |

其中:

* S3文件 Long是指S3文件文本中使用整形的字符串表示形式,例如"19901219"。
* S3文件 Double是指S3文件文本中使用Double的字符串表示形式,例如"3.1415"。
* S3文件 Boolean是指S3文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不区分大小写。
* S3文件 Date是指S3文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。


## 4 性能报告


## 5 约束限制


## 6 FAQ



88 changes: 88 additions & 0 deletions s3writer/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-all</artifactId>
<version>0.0.1-SNAPSHOT</version>
</parent>

<artifactId>s3writer</artifactId>
<name>s3writer</name>
<description>S3Writer提供了本地写入TEXT功能,建议开发、测试环境使用。</description>
<packaging>jar</packaging>

<dependencies>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>plugin-unstructured-storage-util</artifactId>
<version>${datax-project-version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>16.0.1</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个版本感觉有点太低了

</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-core</artifactId>
<version>1.11.52</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.11.52</version>
</dependency>
</dependencies>

<build>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
35 changes: 35 additions & 0 deletions s3writer/src/main/assembly/package.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
<include>plugin_job_template.json</include>
</includes>
<outputDirectory>plugin/writer/s3writer</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>s3writer-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/writer/s3writer</outputDirectory>
</fileSet>
</fileSets>

<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/writer/s3writer/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
package com.alibaba.datax.plugin.writer.s3writer;

public class Key {
// must have
public static final String PATH = "path";
}
Loading