Extended Search in Cassandra

Cassandra index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable search.

It is also fully compatible with Apache Spark and Apache Hadoop, allowing you to filter data at database level. This speeds up jobs reducing the amount of data to be collected and processed.

Indexing is achieved through a Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio Cassandra is one of the core modules on which Stratio's BigData platform (SDS) is based.

Overview

Lucene search technology integration into Cassandra provides:

Big data full text search
Relevance scoring and sorting
General top-k queries
Complex boolean queries (and, or, not)
Near real-time search
Custom analyzers
CQL3 support
Wide rows support
Partition and cluster composite keys support
Support for indexing columns part of primary key
Third-party drivers compatibility
Spark compatibility
Hadoop compatibility

Not yet supported:

Thrift API
Legacy compact storage option
Indexing counter columns
Columns with TTL
CQL user defined types
Static columns

Index Creation

###Syntax

CREATE CUSTOM INDEX (IF NOT EXISTS)? <index_name>
                                  ON <table_name> ( <magic_column> )
                               USING 'com.stratio.cassandra.index.RowIndex'
                        WITH OPTIONS = <options>

where:

<magic_column> is the name of a text column that does not contain any data and will be used to show the scoring for each resulting row of a query,
<options> is a JSON object:

<options> := { ('refresh_seconds'      : '<int_value>',)?
               ('ram_buffer_mb'        : '<int_value>',)?
               ('max_merge_mb'         : '<int_value>',)?
               ('max_cached_mb'        : '<int_value>',)?
               ('indexing_threads'     : '<int_value>',)?
               ('indexing_queues_size' : '<int_value>',)?
               'schema'                : '<schema_definition>'};

Options, except “schema”, take a positive integer value enclosed in single quotes:

refresh_seconds: number of seconds before refreshing the index (between writers and readers). Defaults to ’60’.
ram_buffer_mb: size of the write buffer. Its content will be committed to disk when full. Defaults to ’64’.
max_merge_mb: defaults to ’5’.
max_cached_mb: defaults to ’30’.
indexing_threads: number of asynchronous indexing threads. ’0’ means synchronous indexing. Defaults to ’0’.
indexing_queues_size: max number of queued documents per asynchronous indexing thread. Defaults to ’50’.
schema: see below

<schema_definition> := {
    (analyzers : { <analyzer_definition> (, <analyzer_definition>)* } ,)?
    (default_analyzer : "<analyzer_name>",)?
    fields : { <field_definition> (, <field_definition>)* }
}

Where default_analyzer defaults to ‘org.apache.lucene.analysis.standard.StandardAnalyzer’.

<analyzer_definition> := <analyzer_name> : {
    type : "<analyzer_type>" (, <option> : "<value>")*
}

Analyzer definition options depend on the analyzer type. Details and default values are listed in the table below.

Analyzer type	Option	Value type	Default value
classpath	class	string	null
snowball	language	string	null
	stopwords	string	null

<field_definition> := <column_name> : {
    type : "<field_type>" (, <option> : "<value>")*
}

Field definition options depend on the field type. Details and default values are listed in the table below.

Field type	Option	Value type	Default value
bigdec	integer_digits	positive integer	32
	decimal_digits	positive integer	32
bigint	digits	positive integer	32
date	pattern	date format (string)	yyyy/MM/dd HH:mm:ss.SSS
double, float, integer, long	boost	float	0.1f
text	analyzer	class name (string)	default_analyzer of the schema

Note that Cassandra allows one custom index per table. On the other hand, Cassandra does not allow a modify operation on indexes. To modify an index it needs to be deleted first and created again.

###Example

This code below and the one for creating the corresponding keyspace and table is available in a CQL script that can be sourced from the Cassandra shell: test-users-create.cql.

CREATE CUSTOM INDEX IF NOT EXISTS users_index
ON test.users (stratio_col)
USING 'com.stratio.cassandra.index.RowIndex'
WITH OPTIONS = {
    'refresh_seconds'      : '1',
    'ram_buffer_mb'        : '64',
    'max_merge_mb'         : '5',
    'max_cached_mb'        : '30',
    'indexing_threads'     : '4',
    'indexing_queues_size' : '50',
    'schema' : '{
        analyzers : {
              my_custom_analyzer : {
                  type:"snowball",
                  language:"Spanish",
                  stopwords : "el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"}
        },
        default_analyzer : "english",
        fields : {
            name   : {type     : "string"},
            gender : {type     : "string"},
            animal : {type     : "string"},
            age    : {type     : "integer"},
            food   : {type     : "string"},
            number : {type     : "integer"},
            bool   : {type     : "boolean"},
            date   : {type     : "date",
                      pattern  : "yyyy/MM/dd"},
            mapz   : {type     : "string"},
            setz   : {type     : "string"},
            listz  : {type     : "string"},
            phrase : {type     : "text",
                      analyzer : "my_custom_analyzer"}
        }
    }'
};

Queries

###Syntax:

SELECT ( <fields> | * )
FROM <table_name>
WHERE <magic_column> = '{ (   query  : <query>  )?
                          ( , filter : <filter> )?
                          ( , sort   : <sort>   )?
                        }';

where <query> and <filter> are a JSON object:

<query> := { type : <type> (, <option> : ( <value> | <value_list> ) )+ }

and <sort> is another JSON object:

    <sort> := { fields : <sort_field> (, <sort_field> )* }
    <sort_field> := { field : <field> (, reverse : <reverse> )? }

When searching by <query>, results are returned sorted by descending relevance without pagination. The results will be located in the column ‘stratio_relevance’.

Filter types and options are the same as the query ones. The difference with queries is that filters have no effect on scoring.

Sort option is used to specify the order in which the indexed rows will be traversed. When sorting is used, the query scoring is delayed.

If no query or sorting options are specified then the results are returned in the Cassandra’s natural order, which is defined by the partitioner and the column name comparator.

Types of query and their options are summarized in the table below. Details for each of them are available in individual sections and the examples can be downloaded as a CQL script: extended-search-examples.cql.

In addition to the options described in the table, all query types have a “boost” option that acts as a weight on the resulting score.

Query type	Supported Field type	Options
Boolean	subqueries	must: a list of conditions. should: a list of conditions. not: a list of conditions.
Contains	All	field: the field name. values: the matched field values.
Fuzzy	bytes inet string text	field: the field name. value: the field value. max_edits (default = 2): a integer value between 0 and 2 (the Levenshtein automaton maximum supported distance). prefix_length (default = 0): integer representing the length of common non-fuzzy prefix. max_expansions (default = 50): an integer for the maximum number of terms to match. transpositions (default = true): if transpositions should be treated as a primitive edit operation (Damerau-Levenshtein distance). When false, comparisons will implement the classic Levenshtein distance.
Match	All	field: the field name. value: the field value.
Phrase	bytes inet text	field: the field name. values: list of values. slop (default = 0): number of other words permitted between words.
Prefix	bytes inet string text	field: fieldname. value: fieldvalue.
Range	All	field: field name. lower (default = $-\infty$ for number): lower bound of the range. include_lower (default = false): if the left value is included in the results (>=) upper (default = $+\infty$ for number): upper bound of the range. include_upper (default = false): if the right value is included in the results (<=).
Regexp	bytes inet string text	field: fieldname. value: regular expression.
Wildcard	bytes inet string text	field: field name. value: wildcard expression.

###Boolean query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                           type     : "boolean",
                           ( must   : [(query,)?] , )?
                           ( should : [(query,)?] , )?
                           ( not    : [(query,)?] , )? } }';

where:

must: represents the conjunction of queries: query₁ AND query₂ AND … AND query_n
should: represents the disjunction of queries: query₁ OR query₁₂ OR … OR query_n
not: represents the negation of the disjunction of queries: NOT(query₁ OR query₂ OR … OR query_n)

Since "not" will be applied to the results of a "must" or "should" condition, it can not be used in isolation.

Example 1: will return rows where name ends with “a” AND food starts with “tu”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type : "boolean",
                        must : [{type : "wildcard", field : "name", value : "*a"},
                                {type : "wildcard", field : "food", value : "tu*"}]}}';

Example 2: will return rows where food starts with “tu” but name does not end with “a”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type : "boolean",
                        not  : [{type : "wildcard", field : "name", value : "*a"}],
                        must : [{type : "wildcard", field : "food", value : "tu*"}]}}';

Example 3: will return rows where name ends with “a” or food starts with “tu”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "boolean",
                        should : [{type : "wildcard", field : "name", value : "*a"},
                                  {type : "wildcard", field : "food", value : "tu*"}]}}';

###Contains query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  : "contains",
                            field : <fieldname> ,
                            values : <value_list> }}';

Example 1: will return rows where name matches “Alicia” or “mancha”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "contains",
                        field  : "name",
                        values : ["Alicia","mancha"] }}';

Example 2: will return rows where date matches “2014/01/01″, “2014/01/02″ or “2014/01/03″

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "contains",
                        field  : "date",
                        values : ["2014/01/01", "2014/01/02", "2014/01/03"] }}';

###Fuzzy query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  : "fuzzy",
                            field : <fieldname> ,
                            value : <value>
                            (, max_edits     : <max_edits> )?
                            (, prefix_length : <prefix_length> )?
                            (, max_expansions: <max_expansion> )?
                            (, transpositions: <transposition> )?
                          }}';

where:

max_edits (default = 2): a integer value between 0 and 2. Will return rows which distance from <value> to <field> content has a distance of at most <max_edits>. Distance will be interpreted according to the value of “transpositions”.
prefix_length (default = 0): an integer value being the length of the common non-fuzzy prefix
max_expansions (default = 50): an integer for the maximum number of terms to match
transpositions (default = true): if transpositions should be treated as a primitive edit operation (Damerau-Levenshtein distance). When false, comparisons will implement the classic Levenshtein distance.

Example 1: will return any rows where “phrase” contains a word that differs in one edit operation from “puma”, such as “pumas”.

SELECT * FROM test.users
WHERE stratio_col = '{query : { type      : "fuzzy",
                                field     : "phrase",
                                value     : "puma",
                                max_edits : 1 }}';

Example 2: same as example 1 but will limit the results to rows where phrase contains a word that starts with “pu”.

SELECT * FROM test.users
WHERE stratio_col = '{query : { type          : "fuzzy",
                                field         : "phrase",
                                value         : "puma",
                                max_edits     : 1,
                                prefix_length : 2 }}';

###Match query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  : "match",
                            field : <fieldname> ,
                            value : <value> }}';

Example 1: will return rows where name matches “Alicia”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "match",
                        field : "name",
                        value : "Alicia" }}';

Example 2: will return rows where phrase contains “mancha”

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "match",
                        field : "phrase",
                        value : "mancha" }}';

Example 3: will return rows where date matches “2014/01/01″

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "match",
                        field : "date",
                        value : "2014/01/01" }}';

###Phrase query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  :"phrase",
                            field : <fieldname> ,
                            values : <value_list>
                            (, slop : <slop> )?
                        }}';

where:

values: an ordered list of values.
slop (default = 0): number of words permitted between words.

Example 1: will return rows where “phrase” contains the word “camisa” followed by the word “manchada”.

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "phrase",
                        field  : "phrase",
                        values : ["camisa", "manchada"] }}';

Example 2: will return rows where “phrase” contains the word “mancha” followed by the word “camisa” having 0 to 2 words in between.

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type   : "phrase",
                        field  : "phrase",
                        values : ["mancha", "camisa"],
                        slop   : 2 }}';

###Prefix query

Syntax:

SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
                            type  : "prefix",
                            field : <fieldname> ,
                            value : <value> }}';

Example: will return rows where “phrase” contains a word starting with “lu”. If the column is indexed as “text” and uses an analyzer, words ignored by the analyzer will not be retrieved.

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "prefix",
                        field         : "phrase",
                        value         : "lu" }}';

###Range query

Syntax:

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type    : "range",
                        field   : <fieldname>
                        (, lower : <min> , include_lower : <min_included> )?
                        (, upper : <max> , include_upper : <max_included> )?
                     }}';

where:

lower: lower bound of the range.
include_lower (default = false): if the lower bound is included (left-closed range).
upper: upper bound of the range.
include_upper (default = false): if the upper bound is included (right-closed range).

Lower and upper will default to $-/+\infty$ for number. In the case of byte and string like data (bytes, inet, string, text), all values from lower up to upper will be returned if both are specified. If only “lower” is specified, all rows with values from “lower” will be returned. If only “upper” is specified then all rows with field values up to “upper” will be returned. If both are omitted than all rows will be returned.

Example 1: will return rows where age is in [1, ∞)

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "range",
                        field         : "age",
                        lower         : 1,
                        include_lower : true }}';

Example 2: will return rows where age is in (-∞, 0]

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "range",
                        field         : "age",
                        upper         : 0,
                        include_upper : true }}';

Example 3: will return rows where age is in [-1, 1]

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "range",
                        field         : "age",
                        lower         : -1,
                        upper         : 1,
                        include_lower : true,
                        include_upper : true }}';

Example 4: will return rows where date is in [2014/01/01, 2014/01/02]

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type          : "range",
                        field         : "date",
                        lower         : "2014/01/01",
                        upper         : "2014/01/02",
                        include_lower : true,
                        include_upper : true }}';

###Regexp query

Syntax:

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "regexp",
                        field : <fieldname>,
                        value : <regexp>
                     }}';

where:

value: a regular expression. See org.apache.lucene.util.automaton.RegExp for syntax reference.

Example: will return rows where name contains a word that starts with “p” and a vowel repeated twice (e.g. “pape”).

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "regexp",
                        field : "name",
                        value : "[J][aeiou]{2}.*" }}';

###Wildcard query

Syntax:

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type    : "wildcard" ,
                        field   : <fieldname> ,
                        value   : <wildcard_exp>
                     }}';

where:

value: a wildcard expression. Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. ” is the escape character.

Example: will return rows where food starts with or is “tu”.

SELECT * FROM test.users
WHERE stratio_col = '{query : {
                        type  : "wildcard",
                        field : "food",
                        value : "tu*" }}';

Spark and Hadoop Integration

Spark and Hadoop integrations are fully supported because Lucene queries can be combined with token range queries and pagination, which are the basis of MapReduce frameworks support.

###Token Range Queries

The token function allows computing the token for a given partition key. The primary key of the example table “users” is ((name, gender), animal, age) where (name, gender) is the partition key. When combining the token function and a Lucene-based filter in a where clause, the filter on tokens is applied first and then the condition of the filter clause.

Example: will retrieve rows which tokens are greater than (‘Alicia’, ‘female’) and then test them against the match condition.

SELECT name,gender
  FROM test.users
 WHERE stratio_col='{filter : {type : "match", field : "food", value : "chips"}}'
   AND token(name, gender) > token('Alicia', 'female');

###Pagination

Pagination over filtered results is fully supported. You can retrieve the rows starting from a certain key. For example, if the primary key is (userid, createdAt), you can query:

SELECT *
  FROM tweets
  WHERE stratio_col = ‘{ filter : {type:”match",  field:”text", value:”cassandra”} }’
    AND userid = 3543534
    AND createdAt > 2011-02-03 04:05+0000
  LIMIT 5000;

Datatypes Mapping

###CQL to Field type

CQL type	Description	Field type	Query types
ascii	US-ASCII character string	string/text	All
bigint	64-bit signed long	long	boolean contains match range
blob	Arbitrary bytes (no validation), expressed as hexadecimal	bytes	All
boolean	true or false	boolean	All
counter	Distributed counter value (64-bit long)	not supported
decimal	Variable-precision decimal	bigdec	All
double	64-bit IEEE-754 floating point	double	boolean contains match range
float	32-bit IEEE-754 floating point	float	boolean contains match range
inet	IP address string in IPv4 or IPv6 format	inet	All
int	32-bit signed integer	integer	boolean contains match range
list<T>	A collection of one or more ordered elements	Type of list elements	see element type
map<K,V>	A JSON-style array of literals: { literal : literal, literal : literal … }	Type of values	see element type
set<T>	A collection of one or more elements	Type of set elements	see element type
text	UTF-8 encoded string	string/text	All
timestamp	Date plus time, encoded as 8 bytes since epoch	date	boolean contains match range
uuid	Type 1 or type 4 UUID	uuid	All
timeuuid	Type 1 UUID only (CQL3)	uuid	All
varchar	UTF-8 encoded string	string/text	All
varint	Arbitrary-precision integer	bigint	All

###Field type to CQL

field type	CQL type	Supported in Query types
bigdec	decimal	All
bigint	varint	All
boolean	boolean	All
bytes	blob	All
date	timestamp	boolean match range
double	double	boolean match range
float	float	boolean match range
inet	inet	All
integer	int	boolean match range
long	bigint	boolean match range
string/text	ascii text varchar	All
uuid	uuid timeuuid	All

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extended-search-in-cassandra.md

extended-search-in-cassandra.md

Extended Search in Cassandra

Table of Contents

Overview

Index Creation

Queries

Spark and Hadoop Integration

Datatypes Mapping

Files

extended-search-in-cassandra.md

Latest commit

History

extended-search-in-cassandra.md

File metadata and controls

Extended Search in Cassandra

Table of Contents

Overview

Index Creation

Queries

Spark and Hadoop Integration

Datatypes Mapping