Skip to content

Latest commit

 

History

History
1015 lines (878 loc) · 31.4 KB

extended-search-in-cassandra.md

File metadata and controls

1015 lines (878 loc) · 31.4 KB

Extended Search in Cassandra

Cassandra index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable search.

It is also fully compatible with Apache Spark and Apache Hadoop, allowing you to filter data at database level. This speeds up jobs reducing the amount of data to be collected and processed.

Indexing is achieved through a Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio Cassandra is one of the core modules on which Stratio's BigData platform (SDS) is based.

Table of Contents

Overview

Lucene search technology integration into Cassandra provides:

  • Big data full text search
  • Relevance scoring and sorting
  • General top-k queries
  • Complex boolean queries (and, or, not)
  • Near real-time search
  • Custom analyzers
  • CQL3 support
  • Wide rows support
  • Partition and cluster composite keys support
  • Support for indexing columns part of primary key
  • Third-party drivers compatibility
  • Spark compatibility
  • Hadoop compatibility

Not yet supported:

  • Thrift API
  • Legacy compact storage option
  • Indexing counter columns
  • Columns with TTL
  • CQL user defined types
  • Static columns

Index Creation

###Syntax

CREATE CUSTOM INDEX (IF NOT EXISTS)? <index_name>
                                  ON <table_name> ( <magic_column> )
                               USING 'com.stratio.cassandra.index.RowIndex'
                        WITH OPTIONS = <options>

where:

  • <magic_column> is the name of a text column that does not contain any data and will be used to show the scoring for each resulting row of a query,
  • <options> is a JSON object:
<options> := { ('refresh_seconds'      : '<int_value>',)?
               ('ram_buffer_mb'        : '<int_value>',)?
               ('max_merge_mb'         : '<int_value>',)?
               ('max_cached_mb'        : '<int_value>',)?
               ('indexing_threads'     : '<int_value>',)?
               ('indexing_queues_size' : '<int_value>',)?
               'schema'                : '<schema_definition>'};

Options, except “schema”, take a positive integer value enclosed in single quotes:

  • refresh_seconds: number of seconds before refreshing the index (between writers and readers). Defaults to ’60’.
  • ram_buffer_mb: size of the write buffer. Its content will be committed to disk when full. Defaults to ’64’.
  • max_merge_mb: defaults to ’5’.
  • max_cached_mb: defaults to ’30’.
  • indexing_threads: number of asynchronous indexing threads. ’0’ means synchronous indexing. Defaults to ’0’.
  • indexing_queues_size: max number of queued documents per asynchronous indexing thread. Defaults to ’50’.
  • schema: see below
<schema_definition> := {
    (analyzers : { <analyzer_definition> (, <analyzer_definition>)* } ,)?
    (default_analyzer : "<analyzer_name>",)?
    fields : { <field_definition> (, <field_definition>)* }
}

Where default_analyzer defaults to ‘org.apache.lucene.analysis.standard.StandardAnalyzer’.

<analyzer_definition> := <analyzer_name> : {
    type : "<analyzer_type>" (, <option> : "<value>")*
}

Analyzer definition options depend on the analyzer type. Details and default values are listed in the table below.

Analyzer type Option Value type Default value
classpath class string null
snowball language string null
stopwords string null
<field_definition> := <column_name> : {
    type : "<field_type>" (, <option> : "<value>")*
}

Field definition options depend on the field type. Details and default values are listed in the table below.

Field type Option Value type Default value
bigdec integer_digits positive integer 32
decimal_digits positive integer 32
bigint digits positive integer 32
date pattern date format (string) yyyy/MM/dd HH:mm:ss.SSS
double, float, integer, long boost float 0.1f
text analyzer class name (string) default_analyzer of the schema

Note that Cassandra allows one custom index per table. On the other hand, Cassandra does not allow a modify operation on indexes. To modify an index it needs to be deleted first and created again.

###Example

This code below and the one for creating the corresponding keyspace and table is available in a CQL script that can be sourced from the Cassandra shell: test-users-create.cql.

CREATE CUSTOM INDEX IF NOT EXISTS users_index
ON test.users (stratio_col)
USING 'com.stratio.cassandra.index.RowIndex'
WITH OPTIONS = {
    'refresh_seconds'      : '1',
    'ram_buffer_mb'        : '64',
    'max_merge_mb'         : '5',
    'max_cached_mb'        : '30',
    'indexing_threads'     : '4',
    'indexing_queues_size' : '50',
    'schema' : '{
        analyzers : {
              my_custom_analyzer : {
                  type:"snowball",
                  language:"Spanish",
                  stopwords : "el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"}
        },
        default_analyzer : "english",
        fields : {
            name   : {type     : "string"},
            gender : {type     : "string"},
            animal : {type     : "string"},
            age    : {type     : "integer"},
            food   : {type     : "string"},
            number : {type     : "integer"},
            bool   : {type     : "boolean"},
            date   : {type     : "date",
                      pattern  : "yyyy/MM/dd"},
            mapz   : {type     : "string"},
            setz   : {type     : "string"},
            listz  : {type     : "string"},
            phrase : {type     : "text",
                      analyzer : "my_custom_analyzer"}
        }
    }'
};

Queries

###Syntax:

SELECT ( <fields> | * )
FROM <table_name>
WHERE <magic_column> = '{ (   query  : <query>  )?
                          ( , filter : <filter> )?
                          ( , sort   : <sort>   )?
                        }';

where <query> and <filter> are a JSON object:

<query> := { type : <type> (, <option> : ( <value> | <value_list> ) )+ }

and <sort> is another JSON object:

    <sort> := { fields : <sort_field> (, <sort_field> )* }
    <sort_field> := { field : <field> (, reverse : <reverse> )? }

When searching by <query>, results are returned sorted by descending relevance without pagination. The results will be located in the column ‘stratio_relevance’.

Filter types and options are the same as the query ones. The difference with queries is that filters have no effect on scoring.

Sort option is used to specify the order in which the indexed rows will be traversed. When sorting is used, the query scoring is delayed.

If no query or sorting options are specified then the results are returned in the Cassandra’s natural order, which is defined by the partitioner and the column name comparator.

Types of query and their options are summarized in the table below. Details for each of them are available in individual sections and the examples can be downloaded as a CQL script: extended-search-examples.cql.

In addition to the options described in the table, all query types have a “boost” option that acts as a weight on the resulting score.

Query type Supported Field type Options
Boolean subqueries
  • must: a list of conditions.
  • should: a list of conditions.
  • not: a list of conditions.
Contains All
  • field: the field name.
  • values: the matched field values.
Fuzzy bytes
inet
string
text
  • field: the field name.
  • value: the field value.
  • max_edits (default = 2): a integer value between 0 and 2 (the Levenshtein automaton maximum supported distance).
  • prefix_length (default = 0): integer representing the length of common non-fuzzy prefix.
  • max_expansions (default = 50): an integer for the maximum number of terms to match.
  • transpositions (default = true): if transpositions should be treated as a primitive edit operation (Damerau-Levenshtein distance). When false, comparisons will implement the classic Levenshtein distance.
Match All
  • field: the field name.
  • value: the field value.
Phrase bytes
inet
text
  • field: the field name.
  • values: list of values.
  • slop (default = 0): number of other words permitted between words.
Prefix bytes
inet
string
text
  • field: fieldname.
  • value: fieldvalue.
Range All
  • field: field name.
  • lower (default = $-\infty$ for number): lower bound of the range.
  • include_lower (default = false): if the left value is included in the results (>=)
  • upper (default = $+\infty$ for number): upper bound of the range.
  • include_upper (default = false): if the right value is included in the results (<=).
Regexp bytes
inet
string
text
  • field: fieldname.
  • value: regular expression.
Wildcard bytes
inet
string
text
  • field: field name.
  • value: wildcard expression.

###Boolean query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                           type     : "boolean",
                           ( must   : [(query,)?] , )?
                           ( should : [(query,)?] , )?
                           ( not    : [(query,)?] , )? } }';

where:

  • must: represents the conjunction of queries: query1 AND query2 AND … AND queryn
  • should: represents the disjunction of queries: query1 OR query12 OR … OR queryn
  • not: represents the negation of the disjunction of queries: NOT(query1 OR query2 OR … OR queryn)

Since "not" will be applied to the results of a "must" or "should" condition, it can not be used in isolation.

Example 1: will return rows where name ends with “a” AND food starts with “tu”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type : "boolean",
                        must : [{type : "wildcard", field : "name", value : "*a"},
                                {type : "wildcard", field : "food", value : "tu*"}]}}';

Example 2: will return rows where food starts with “tu” but name does not end with “a”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type : "boolean",
                        not  : [{type : "wildcard", field : "name", value : "*a"}],
                        must : [{type : "wildcard", field : "food", value : "tu*"}]}}';

Example 3: will return rows where name ends with “a” or food starts with “tu”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "boolean",
                        should : [{type : "wildcard", field : "name", value : "*a"},
                                  {type : "wildcard", field : "food", value : "tu*"}]}}';

###Contains query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  : "contains",
                            field : <fieldname> ,
                            values : <value_list> }}';

Example 1: will return rows where name matches “Alicia” or “mancha”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "contains",
                        field  : "name",
                        values : ["Alicia","mancha"] }}';

Example 2: will return rows where date matches “2014/01/01″, “2014/01/02″ or “2014/01/03″

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "contains",
                        field  : "date",
                        values : ["2014/01/01", "2014/01/02", "2014/01/03"] }}';

###Fuzzy query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  : "fuzzy",
                            field : <fieldname> ,
                            value : <value>
                            (, max_edits     : <max_edits> )?
                            (, prefix_length : <prefix_length> )?
                            (, max_expansions: <max_expansion> )?
                            (, transpositions: <transposition> )?
                          }}';

where:

  • max_edits (default = 2): a integer value between 0 and 2. Will return rows which distance from <value> to <field> content has a distance of at most <max_edits>. Distance will be interpreted according to the value of “transpositions”.
  • prefix_length (default = 0): an integer value being the length of the common non-fuzzy prefix
  • max_expansions (default = 50): an integer for the maximum number of terms to match
  • transpositions (default = true): if transpositions should be treated as a primitive edit operation (Damerau-Levenshtein distance). When false, comparisons will implement the classic Levenshtein distance.

Example 1: will return any rows where “phrase” contains a word that differs in one edit operation from “puma”, such as “pumas”.

SELECT * FROM test.users
WHERE stratio_col = '{query : { type      : "fuzzy",
                                field     : "phrase",
                                value     : "puma",
                                max_edits : 1 }}';

Example 2: same as example 1 but will limit the results to rows where phrase contains a word that starts with “pu”.

SELECT * FROM test.users
WHERE stratio_col = '{query : { type          : "fuzzy",
                                field         : "phrase",
                                value         : "puma",
                                max_edits     : 1,
                                prefix_length : 2 }}';

###Match query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  : "match",
                            field : <fieldname> ,
                            value : <value> }}';

Example 1: will return rows where name matches “Alicia”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "match",
                        field : "name",
                        value : "Alicia" }}';

Example 2: will return rows where phrase contains “mancha”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "match",
                        field : "phrase",
                        value : "mancha" }}';

Example 3: will return rows where date matches “2014/01/01″

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "match",
                        field : "date",
                        value : "2014/01/01" }}';

###Phrase query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  :"phrase",
                            field : <fieldname> ,
                            values : <value_list>
                            (, slop : <slop> )?
                        }}';

where:

  • values: an ordered list of values.
  • slop (default = 0): number of words permitted between words.

Example 1: will return rows where “phrase” contains the word “camisa” followed by the word “manchada”.

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "phrase",
                        field  : "phrase",
                        values : ["camisa", "manchada"] }}';

Example 2: will return rows where “phrase” contains the word “mancha” followed by the word “camisa” having 0 to 2 words in between.

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "phrase",
                        field  : "phrase",
                        values : ["mancha", "camisa"],
                        slop   : 2 }}';

###Prefix query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  : "prefix",
                            field : <fieldname> ,
                            value : <value> }}';

Example: will return rows where “phrase” contains a word starting with “lu”. If the column is indexed as “text” and uses an analyzer, words ignored by the analyzer will not be retrieved.

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "prefix",
                        field         : "phrase",
                        value         : "lu" }}';

###Range query

Syntax:

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type    : "range",
                        field   : <fieldname>
                        (, lower : <min> , include_lower : <min_included> )?
                        (, upper : <max> , include_upper : <max_included> )?
                     }}';

where:

  • lower: lower bound of the range.
  • include_lower (default = false): if the lower bound is included (left-closed range).
  • upper: upper bound of the range.
  • include_upper (default = false): if the upper bound is included (right-closed range).

Lower and upper will default to $-/+\infty$ for number. In the case of byte and string like data (bytes, inet, string, text), all values from lower up to upper will be returned if both are specified. If only “lower” is specified, all rows with values from “lower” will be returned. If only “upper” is specified then all rows with field values up to “upper” will be returned. If both are omitted than all rows will be returned.

Example 1: will return rows where age is in [1, ∞)

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "range",
                        field         : "age",
                        lower         : 1,
                        include_lower : true }}';

Example 2: will return rows where age is in (-∞, 0]

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "range",
                        field         : "age",
                        upper         : 0,
                        include_upper : true }}';

Example 3: will return rows where age is in [-1, 1]

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "range",
                        field         : "age",
                        lower         : -1,
                        upper         : 1,
                        include_lower : true,
                        include_upper : true }}';

Example 4: will return rows where date is in [2014/01/01, 2014/01/02]

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "range",
                        field         : "date",
                        lower         : "2014/01/01",
                        upper         : "2014/01/02",
                        include_lower : true,
                        include_upper : true }}';

###Regexp query

Syntax:

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "regexp",
                        field : <fieldname>,
                        value : <regexp>
                     }}';

where:

Example: will return rows where name contains a word that starts with “p” and a vowel repeated twice (e.g. “pape”).

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "regexp",
                        field : "name",
                        value : "[J][aeiou]{2}.*" }}';

###Wildcard query

Syntax:

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type    : "wildcard" ,
                        field   : <fieldname> ,
                        value   : <wildcard_exp>
                     }}';

where:

  • value: a wildcard expression. Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. ” is the escape character.

Example: will return rows where food starts with or is “tu”.

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "wildcard",
                        field : "food",
                        value : "tu*" }}';

Spark and Hadoop Integration

Spark and Hadoop integrations are fully supported because Lucene queries can be combined with token range queries and pagination, which are the basis of MapReduce frameworks support.

###Token Range Queries

The token function allows computing the token for a given partition key. The primary key of the example table “users” is ((name, gender), animal, age) where (name, gender) is the partition key. When combining the token function and a Lucene-based filter in a where clause, the filter on tokens is applied first and then the condition of the filter clause.

Example: will retrieve rows which tokens are greater than (‘Alicia’, ‘female’) and then test them against the match condition.

SELECT name,gender
  FROM test.users
 WHERE stratio_col='{filter : {type : "match", field : "food", value : "chips"}}'
   AND token(name, gender) > token('Alicia', 'female');

###Pagination

Pagination over filtered results is fully supported. You can retrieve the rows starting from a certain key. For example, if the primary key is (userid, createdAt), you can query:

SELECT *
  FROM tweets
  WHERE stratio_col = ‘{ filter : {type:”match",  field:”text", value:”cassandra”} }’
    AND userid = 3543534
    AND createdAt > 2011-02-03 04:05+0000
  LIMIT 5000;

Datatypes Mapping

###CQL to Field type

CQL type Description Field type Query types
ascii US-ASCII character string string/text All
bigint 64-bit signed long long boolean
contains
match
range
blob Arbitrary bytes (no validation), expressed as hexadecimal bytes All
boolean true or false boolean All
counter Distributed counter value (64-bit long) not supported
decimal Variable-precision decimal bigdec All
double 64-bit IEEE-754 floating point double boolean
contains
match
range
float 32-bit IEEE-754 floating point float boolean
contains
match
range
inet IP address string in IPv4 or IPv6 format inet All
int 32-bit signed integer integer boolean
contains
match
range
list<T> A collection of one or more ordered elements Type of list elements see element type
map<K,V> A JSON-style array of literals: { literal : literal, literal : literal … } Type of values see element type
set<T> A collection of one or more elements Type of set elements see element type
text UTF-8 encoded string string/text All
timestamp Date plus time, encoded as 8 bytes since epoch date boolean
contains
match
range
uuid Type 1 or type 4 UUID uuid All
timeuuid Type 1 UUID only (CQL3) uuid All
varchar UTF-8 encoded string string/text All
varint Arbitrary-precision integer bigint All

###Field type to CQL

field type CQL type Supported in Query types
bigdec decimal All
bigint varint All
boolean boolean All
bytes blob All
date timestamp boolean
match
range
double double boolean
match
range
float float boolean
match
range
inet inet All
integer int boolean
match
range
long bigint boolean
match
range
string/text ascii
text
varchar
All
uuid uuid
timeuuid
All