-
Notifications
You must be signed in to change notification settings - Fork 28
databricks/tpch-dbgen
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# @(#)README 2.4.0 Table of Contents =================== 0. What is this document? 1. What is DBGEN? 2. What will DBGEN create? 3. How is DBGEN built? 4. Command Line Options for DBGEN 5. Building Large Data Sets with DBGEN 6. DBGEN limitations and compliant usage 7. Sample DBGEN executions 8. What is QGEN? 9. What will QGEN create? 10. How is QGEN built? 11. Command Line Options for QGEN 12. Query Template Syntax 13. Sample QGEN executions and Query Templates 14. Environment variable 15. Version Numbering in DBGEN and QGEN 16. Validated Platforms 0. What is this document? This is the general README file for DBGEN and QGEN, the data- base population and executable query text generation programs used in the TPC-H benchmark. It covers the proper use of DBGEN and QGEN. For information on porting the utility to your particular platform see Porting.Notes. 1. What is DBGEN? DBGEN is a database population program for use with the TPC-H benchmark. It is written in ANSI 'C' for portability, and has been successfully ported to over a dozen different systems. While the TPC-H specification allow an implementor to use any utility to populate the benchmark database, the resultant population must exactly match the output of DBGEN. The source code has been provided to make the process of building a compliant database population as simple as possible. 2. What will DBGEN create? Without any command line options, DBGEN will generate 8 separate ascii files. Each file will contain pipe-delimited load data for one of the tables defined in the TPC-H database schema. The default tables will contain the load data required for a scale factor 1 database. By default the file will be created in the current directory and be named <table>.tbl. As an example, customer.tbl will contain the load data for the customer table. When invoked with the '-U' flag, DBGEN will create the data sets to be used in the update functions and the SQL syntax required to delete the data sets. The update files will be created in the same directory as the load data files and will be named "u_<table>.set". The delete syntax will be written to "delete.set". For instance, the data set to be used in the third query set to update the lineitem table will be named "u_lineitem.tbl.3", and the SQL to remove those rows will be found in "delete.3". The size of the update files can be controlled with the '-r' flag. 3. How is DBGEN built? Create an appropriate makefile, using makefile.suite as a basis, and type make. Refer to Porting.Notes for more details and for suggested compile time options. 4. Command Line Options for DBGEN DBGEN's output is controlled by a combination of command line options and environment variables. Command line options are assumed to be single letter flags preceded by a minus sign. They may be followed by an optional argument. option argument default action ------ -------- ------- ------ -h Display a usage summary -f none Force. Existing data files will be overwritten. -F none yes Flat file output. -D none Direct database load. ld_XXXX() routines must be defined in load_stub.c -s <scale> 1 Scale of the database population. Scale 1.0 represents ~1 GB of data -T <table> Generate the data for a particular table ONLY. Arguments: p -- part/partuspp, c -- customer, s -- supplier, o -- orders/lineitem, n -- nation, r -- region, l -- code (same as n and r), O -- orders, L -- lineitem, P -- part, S -- partsupp -O d Generate SQL for delete function instead of key ranges -O f Allow over-ride of default output file names -O h Generate headers in flat ascii files. hd_XXX routines must be defined in load_stub.c -O m Flat files generate fixed length records -O r Generate key ranges for the UF2 update function -O v Verify data set without generating it. -r <percentage> 10 Scale each udpate file to the given percentage (expressed in basis points) of the data set -v none Verbose. Progress messages are displayed as data is generated. -n <name> Use database <name> for in-line load -C <children> Use <children> separate processes to generate data -S <n> Generate the <n>th part of a multi-part load or update set -U <updates> Create a specified number of data sets in flat files for the update/delete functions -i <n> Split the inserted rows in an refresh pair between <n> files -d <n> Split the deleted rows in an refresh pair between <n> files 5. DBGEN limitations and compliant usage DBGEN is meant to be a robust population generator for use with the TPC-H benchmark. It is hoped that DBGEN will make it easier to experiment with and become proficient in the execution of TPC decision support benchmarks. As a result, it includes a number of command line options which are not, strictly speaking, necessary to generate a compliant data set for a TPC-D run. In addition, some command line options will accept arguments which result in the generation of NON-COMPLIANT data sets. Options which should be used with care include: -s -- scale factor. TPC-H runs are only compliant when run against SF's of 1, 10, 100, 300, 1000, 3000, 10000, 30000, 100000 -r -- refresh percentage. TPC-H runs are only compliant when run with -r 10, the default. 6. Sample DBGEN executions DBGEN has been built to allow as much flexibility as possible, but is fundementally intended to generate two things: a database population against which the queries in TPC-H can be run, and the updates that are used during the update functions in TPC-H. Here are some sample uses of DBGEN. 1. To generate the database population for the qualification database dbgen -s 1 2. To generate the lineitem table only, for a scale factor 10 database, and over-write any existing flat files: dbgen -s 10 -f -T L 4. To geterate a 100GB data set in 1GB pieces, generate only the part and partsupplier tables, and include some progress reports along the way: dbgen -s 100 -S 1 -C 100 -T p -v (to generate the first 1GB file) dbgen -s 100 -S 2 -C 100 -T p -v (to generate the second 1GB file) (and so on, incrementing the argument to -S each time) 5. To generate the update files needed for a 4 stream run of the throughput test at 100 GB, using an existing set of seed files from an 8 process load: dbgen -s 100 -U 4 -C 8 7. What is QGEN? QGEN is a query generation program for use with the TPC-H benchmark. It is written in ANSI 'C' for portability, and has been successfully ported to over a dozen different systems. While the benchmark specifications allow an implementor to use any utility to create the benchmark query sets, QGEN has been provided to make the process of building a benchmark implementation as simple as possible. 8. What will QGEN create? QGEN is a filter, triggered by :'s. It does line-at-a-time reads of its input (more on that later), scanning for :foo, where foo determines the substitution that occurs. Including: :<int> replace with the appropriate value for parameter <int> :b replace with START_TRAN (from tpcd.h) :c replace with SET_DBASE (from tpcd.h) :n<int> replace with SET_ROWCOUNT(<int>) (from tpcd.h) :o replace with SET_OUTPUT (from tpcd.h) :q replace with query number :s replace with stream number :x replace with GEN_QUERY_PLAN (from tpcd.h) Qgen takes an assortment of command line options, controlling which of these options should be active during the translation from template to EQT, and a list of query "names". It then translates the template found in $DSS_QUERY/<name>.sql and puts the result of stdout. Here is a sample query template: { Sccsid: @(#)1.sql 9.1.1.1 1/25/95 10:51:56 } :n 0 :o select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval :1 day group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; And here is what is generated: $ qgen -d 1 {return 0 rows} select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date('1998-12-01') - interval (90) day to day group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; See "Query Template Syntax" below for more detail on converting your prefered query phrasing for use with QGEN. 9. How is QGEN built? QGEN is built by the same makefile that creates DBGEN. If the makefile is successfully creating DBGEN, no further compilation modifications should be necessary. You may need to modify some of the options which allow QGEN to integrate with your preferred query tool. Refer to Porting.Notes for more detail. 10. Command Line Options for QGEN Like DBGEN, QGEN is controlled by a combination of command line options and environment variables (See "Environment Variables", below for more detail). Command line options are assumed to be single letter flags preceded by a minus sign. They may be followed by an optional argument. option argument default action ------ -------- ------- ------ -c none Retain comments in translation of template to EQT -d none Default. Use the parameter substitutions required for query validation -h Display a usage summary -i <file> Use contents of <file> to init a query stream -l <file> Save query parameters to <file> -n <name> Use database <name> for queries -N Always use default rowcount, and ignore :n directives -o <path> Save query n's output in <path>/n.<stream> Uses -p option, and uses :o tag -p <stream> Use the query permutation defined for stream <stream>. If this option is omited, EQT will be generated for the queries named on the command line. -r <n> Seed the rnadom number generator with <n> -s <n> Set scale to <n> for parameter substitutions. -t <file> Use contents of <file> to complete a query stream -T none Use time table format for date substitution -v none Verbose. Progress messages are displayed as data is generated. -x none Generate a query plan as part of query execution. 11. Query Template Syntax QGEN is a simple ASCII text filter, meant to translate query generalized query syntax("query template") into the executable query text(EQT) re- quired by the benchmarks. It provides a number of shorthands and syntactic extensions that allow the automatic generation of query parameters and some control over the operation of the benchmark implementation. QGEN first strips all comments from the query template, recognizing both {comment} and --comment styles. Next it traverses the query template one line at a time, locating required substitution points, called parameter tags. The values substituted for a given tag are summarized below. QGEN does not support nested substitutions. That is, if the text substituted for tag itself contains a valid tag the second tag will not be expanded. Tag Converted To Based on === ============ ======== :c database <dbname>;(1) -n from the command line :x set explain on;(1) -x from the command line :<number> paremeter <number> :s stream number :o output to outpath/qnum.stream;(1) -o from command line, -s from command line :b BEGIN WORK;(1) -a from comand line :e COMMIT WORK(1) -a from command line :q query number :n <number> sets rowcount to be returned to <number>, unless -N appears on the command line Notes: (1) This is Informix-specific syntax. Refer to Porting.Notes for tailoring the generated text to your database environment. 12. Sample QGEN executions and Query Templates QGEN translates generic query templates into valid SQL. In addition, it allows conditional inclusion of the commands necessary to connect to a database, produce diagnostic output, etc. Here are some sample of QGEN usage, and the way that command line parameters and the query templates interact to produce valid SQL. Template, in $DSS_QUERY/1.sql: :c :o select count(*) from foo; :x select count(*) from lineitem where l_orderdate < ':1'; 1. "qgen 1", would produce: select count(*) from foo; select count(*) from lineitem where l_orderdate < '1997-01-01'; Assuming that 1 January 1997 was a valid substitution for parameter 1. 2. "qgen -d -c dss1 1, would produce: database dss1; select count(*) from foo; select count(*) from lineitem where l_orderdate < '1995-07-18'; Assuming that 18 July 1995 was the default substitution for parameter 1, and using Informix syntax. 3. "qgen -d -c dss1 -x -o somepath 1, would produce: database dss1; output to "somepath/1.0" select count(*) from foo; set explain on; select count(*) from lineitem where l_orderdate < '1995-07-18'; Assuming that 18 July 1995 was the default substitution for parameter 1, and using Informix syntax. 13. Environment Variables Enviroment variables are used to control features of DBGEN and QGEN which are unlikely to change from one execution to another. Variable Default Action ------- ------- ------ DSS_PATH . Directory in which to build flat files DSS_CONFIG . Directory in which to find configuration files DSS_DIST dists.dss Name of distribution definition file DSS_QUERY . Directory in which to find query templates 14. Version Numbering in DBGEN and QGEN DBGEN and QGEN use a common version numbering algorithm. Each executable is stamped with a version number which is displayed in the usage messages available with the '-h' option. A version number is of the form: V.R.P.M | | | | | | | | | | | | | | | -- modification: alphabetic, incremented for any trivial changes | | | to the source (e.g, porting ifdef's) | | ---- patch level: numeric, incremented for any minor bug fix | | (e.g, qgen parameter range) | ------- release: numeric, incremented for each minor revision of the | specification |-------- version: numeric, incremented for each major revision of the specification An implementation of TPC-H is valid only if it conforms to the following version usage rules: -- The Version of DBGEN and QGEN must match the integer portion of the current specification revision 15. The current revisions are: DBGEN: 2.4.0 QGEN: 2.4.0 16. Validated Platforms The following platforms have been validated to produce the reference data set for TPC-H 2.4.0 Processor Operating System (version) Compiler (version) Compiler Flags ---------------------------------------------------------------------------- POWER5 AIX 64-bit (5.3) C for AIX Compiler, v7 -q64 (no -g) IA-64 HPUX 64-bit () icc Linux 32-bit () gcc
About
Patched version of dbgen
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published