Skip to content

tt crud import

Ilya Grishnov Igorevich edited this page Jun 24, 2022 · 2 revisions

Welcome to the tt crud import wiki that describes the current functionality and limitations of this implementation.

  • tt crud - module for interact with the CRUD module of tarantool.
  • tt crud import - subcommand for import data into tarantool via CRUD.

Current type restrictions

The following types of target space are supported:

  • boolean (values [true,t] in fields of input data are interpreted as true, if format field require bool; for flase - [false,f])
  • string
  • integer
  • unsigned
  • double (unstable, see issue)
  • number
  • decimal

Before importing a record into a space, the fields values of record are converted according to the format of this space. If a field in target space has a type that is not specified at the list above, an attempt will be made to insert a string value there.

At the moment, thousandth separators [`] and [ ](space) for numeric values are supported by default, so -1`000`000 and - 1 000 000 will be -1000000. Exponential notation of numeric values in the input data is also acceptable (like -2.7e+20). Decimal separator is [.].

For CSV format, only standard delimiter [,], quote (["]) and escape (double quote character) characters are currently supported. Like RFC 4180.

By default, for null values using double delimiter character, so valOfField1,,valOfField3 in input file will be interpreted valOfField1,null,valOfField3 at the stage of insertion into the database.

There is no generation of unique keys for inserted data. There is no support for data types such as uuid or timestamp.

CLI description

USAGE

tt crud import URI FILE SPACE [flags]

  • URI - address of router.
  • FILE - file with input data for import.
  • SPACE - target space name to import data.

FLAGS

  • --batch-size uint32

Crud batch size during import. (default 100)

  • --error string

Name of file with rows that were not imported. Overwrite existed file. (default "error")

  • --format string

Format of input data. Currently only "csv" is supported. (default "csv")

  • --header

First line is a header, not a data.

  • -h, --help

Help for import.

  • --log string

Name of log file with information about occurred errors. Appended existed file. (default "import")

  • --match string

Use correspondence between header fields in input file and target space fields. Now it require option header as "true". If there are fields in the space format that are not specified in the header, an attempt will be made to insert null into them. If there are fields in the header that are not specified in the space format, they will be ignored. Now only "header" value for this option is supported. No yet possible to set a manual match, like "spaceId=csvFoo,spaceName=csvBar,...".

  • --null string

Sets value to be interpreted as NULL when importing. By default, an empty value. Example for csv: field1val,,field3val, where field2val will be taken as NULL.

  • --onerror string

If any error occurs, either skips the problematic line and goes on or stops work. Allows values "stop" or "skip". (default "stop")

  • --operation string

Value "insert" or "replace". (default "insert")

  • -p, --password string

Connection password.

  • --progress

Progress file from last launch will be taken into account. File stores the positions of lines that could not be imported at the last launch. Also stores the stop position from the last start. As a result, an attempt will be repeated to insert lines with specified positions, and then work will continue from stop position. At each launch, the content of the progress.json file is completely overwritten.

  • --success string

Name of file with rows that were imported. Overwrite existed file. (default "success")

  • -u, --username string

Connection username.

Demonstration and examples

Output format

$ ./tt crud import localhost:3301 ./Employees.csv mytestspace --username=guest --header --match=header --onerror=skip --operation=replace

   • Running crud import:
    
Crud init complete:	[true]
Target space exist:	[true]
In case of error:	[skip]
PID of this process:	[49148]

WARNING: Process is not sensitive to SIGINT (ctrl+c), use kill -9 49148

[ read/ignored : 30000/0 | parsed ok/err : 30000/0 | import ok/err : 30000/0 ]

	IMPORT SUMMARY

	total read:		 30000
	ignored (--progress):	 0
	parsed success:		 30000
	parsed error:		 0
	import success:		 30000
	import error:		 0
	speed (rec per sec):	 6746

	import logs file:	 import.log
	failed recs file:	 error.csv
	imported recs file:	 success.csv
	import progress file:	progress.json

Demonstration

For demonstration, you can use the file playground.lua from crud repository. Please note that for work crud import, you need crud with batching (need crud.insert_many(...), crud.replace_many(...), etc) and vshard (vshard as playground.lua requirement, not crud import requirement).

Start playground:

$ git clone https://github.com/tarantool/crud.git
$ cd crud
$ tarantoolctl rocks make
$ ./doc/playground.lua

Check playground via tt connect:

$ tt connect localhost:3301
# localhost:3301> crud.select('developers')
---
- metadata: [{'name': 'id', 'type': 'unsigned'}, {'name': 'bucket_id', 'type': 'unsigned'},
    {'name': 'name', 'type': 'string'}, {'type': 'string', 'name': 'surname', 'is_nullable': true},
    {'type': 'number', 'name': 'age', 'is_nullable': false}]
  rows:
  - [1, 477, 'Alexey', 'Adams', 20]
  - [2, 401, 'Sergey', 'Allred', 21]
  - [3, 2804, 'Pavel', 'Adams', 27]
  - [4, 1161, 'Mikhail', 'Liston', 51]
  - [5, 1172, 'Dmitry', 'Jacobi', 16]
  - [6, 1064, 'Alexey', 'Sidorov', 31]
- null
...

Input file (2-ed record has duplicated key problem, 1-st and 3-rd are ok):

$ cat developers.csv
id,bucket_id,name,surname,age
7,700,"Ned",Flanders,35
7,900,Marge,Simpson,33
8,800,Homer,"Simpson",40

Run import:

$ tt crud import localhost:3301 ./developers.csv developers --username=guest --header --match=header --onerror=stop --batch-size=1

   • Running crud import:
    
Crud init complete:	[true]
Target space exist:	[true]
In case of error:	[stop]
PID of this process:	[57709]

WARNING: Process is not sensitive to SIGINT (ctrl+c), use kill -9 57709

[ read/ignored : 1/0 | parsed ok/err : 1/0 | import ok/err : 1/0 ]

Onerror: An error has occurred, the import has been stopped. See the details in the log, error, success and progress files.


	IMPORT SUMMARY

	total read:		 2
	ignored (--progress):	 0
	parsed success:		 2
	parsed error:		 0
	import success:		 1
	import error:		 1
	speed (rec per sec):	 89

	import logs file:	 import.log
	failed recs file:	 error.csv
	imported recs file:	 success.csv
	import progress file:	progress.json

Logs file:

$ cat import.log

			Crud import start work timestamp: 2022-06-20 11:53:09.333438 +0300 MSK m=+0.030275632
			File description: this file contains logs information about occurred errors
			Note: errors are indicated relative to the space format, not relative to the header

...
timestamp: 2022-06-20 11:53:09.341663 +0300 MSK m=+0.038500590
line position: 3
problem record: 7,900,Marge,Simpson,33
error description: CallError: Failed for 33cba7a0-8420-452d-b8b6-0f96fd63f4fa: Function returned an error: Duplicate key exists in unique index "primary_index" in space "developers" with old tuple - [7, 700, "Ned", "Flanders", 35] and new tuple - [7, 900, "Marge", "Simpson", 33]
...

Success file:

$ cat success.csv
id,bucket_id,name,surname,age
7,700,"Ned",Flanders,35

Error file:

$ cat error.csv
id,bucket_id,name,surname,age
7,900,Marge,Simpson,33

Progress file:

$ cat progress.json
{
    "startTimestamp":"2022-06-20 11:53:09.319323 +0300 MSK m=+0.016160728",
    "lastDumpTimestamp":"2022-06-20 11:53:09.341703 +0300 MSK m=+0.038540474",
    "endOfFileReached":false,
    "lastPosition":3,
    "retryPositions":[3]
}

Let's correct input file. Now 2-en record has not duplicated key problem:

$ cat developers.csv
id,bucket_id,name,surname,age
7,700,"Ned",Flanders,35
9,900,Marge,Simpson,33
8,800,Homer,"Simpson",40

So let's continue working from the stop point, repeating the problem lines before it. Use --progress option for it.

$ tt crud import localhost:3301 ./developers.csv developers --username=guest --header --match=header --onerror=stop --batch-size=1 --progress

   • Running crud import:
    
Crud init complete:	[true]
Target space exist:	[true]
In case of error:	[stop]
PID of this process:	[60611]

WARNING: Process is not sensitive to SIGINT (ctrl+c), use kill -9 60611

[ read/ignored : 3/1 | parsed ok/err : 3/0 | import ok/err : 2/0 ]

	IMPORT SUMMARY

	total read:		 3
	ignored (--progress):	 1
	parsed success:		 3
	parsed error:		 0
	import success:		 2
	import error:		 0
	speed (rec per sec):	 142

	import logs file:	 import.log
	failed recs file:	 error.csv
	imported recs file:	 success.csv
	import progress file:	progress.json

Now error.csv has no record. You can make sure that all records (see id = 7,8,9) are imported via tt connect:

# localhost:3301> crud.select('developers')
---
- metadata: [{'name': 'id', 'type': 'unsigned'}, {'name': 'bucket_id', 'type': 'unsigned'},
    {'name': 'name', 'type': 'string'}, {'type': 'string', 'name': 'surname', 'is_nullable': true},
    {'type': 'number', 'name': 'age', 'is_nullable': false}]
  rows:
  - [1, 477, 'Alexey', 'Adams', 20]
  - [2, 401, 'Sergey', 'Allred', 21]
  - [3, 2804, 'Pavel', 'Adams', 27]
  - [4, 1161, 'Mikhail', 'Liston', 51]
  - [5, 1172, 'Dmitry', 'Jacobi', 16]
  - [6, 1064, 'Alexey', 'Sidorov', 31]
  - [7, 700, 'Ned', 'Flanders', 35]
  - [8, 800, 'Homer', 'Simpson', 40]
  - [9, 900, 'Marge', 'Simpson', 33]
- null
...
Clone this wiki locally