Cassandra index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable search.
It is also fully compatible with Apache Spark and Apache Hadoop, allowing you to filter data at database level. This speeds up jobs reducing the amount of data to be collected and processed.
Indexing is achieved through a Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio Cassandra is one of the core modules on which Stratio's BigData platform (SDS) is based.
Lucene search technology integration into Cassandra provides:
- Big data full text search
- Relevance scoring and sorting
- General top-k queries
- Complex boolean queries (and, or, not)
- Near real-time search
- Custom analyzers
- CQL3 support
- Wide rows support
- Partition and cluster composite keys support
- Support for indexing columns part of primary key
- Third-party drivers compatibility
- Spark compatibility
- Hadoop compatibility
Not yet supported:
- Thrift API
- Legacy compact storage option
- Indexing
counter
columns - Columns with TTL
- CQL user defined types
- Static columns
###Syntax
CREATE CUSTOM INDEX (IF NOT EXISTS)? <index_name>
ON <table_name> ( <magic_column> )
USING 'com.stratio.cassandra.index.RowIndex'
WITH OPTIONS = <options>
where:
- <magic_column> is the name of a text column that does not contain any data and will be used to show the scoring for each resulting row of a query,
- <options> is a JSON object:
<options> := { ('refresh_seconds' : '<int_value>',)?
('ram_buffer_mb' : '<int_value>',)?
('max_merge_mb' : '<int_value>',)?
('max_cached_mb' : '<int_value>',)?
('indexing_threads' : '<int_value>',)?
('indexing_queues_size' : '<int_value>',)?
'schema' : '<schema_definition>'};
Options, except “schema”, take a positive integer value enclosed in single quotes:
- refresh_seconds: number of seconds before refreshing the index (between writers and readers). Defaults to ’60’.
- ram_buffer_mb: size of the write buffer. Its content will be committed to disk when full. Defaults to ’64’.
- max_merge_mb: defaults to ’5’.
- max_cached_mb: defaults to ’30’.
- indexing_threads: number of asynchronous indexing threads. ’0’ means synchronous indexing. Defaults to ’0’.
- indexing_queues_size: max number of queued documents per asynchronous indexing thread. Defaults to ’50’.
- schema: see below
<schema_definition> := {
(analyzers : { <analyzer_definition> (, <analyzer_definition>)* } ,)?
(default_analyzer : "<analyzer_name>",)?
fields : { <field_definition> (, <field_definition>)* }
}
Where default_analyzer defaults to ‘org.apache.lucene.analysis.standard.StandardAnalyzer’.
<analyzer_definition> := <analyzer_name> : {
type : "<analyzer_type>" (, <option> : "<value>")*
}
Analyzer definition options depend on the analyzer type. Details and default values are listed in the table below.
Analyzer type | Option | Value type | Default value |
---|---|---|---|
classpath | class | string | null |
snowball | language | string | null |
stopwords | string | null |
<field_definition> := <column_name> : {
type : "<field_type>" (, <option> : "<value>")*
}
Field definition options depend on the field type. Details and default values are listed in the table below.
Field type | Option | Value type | Default value |
---|---|---|---|
bigdec | integer_digits | positive integer | 32 |
decimal_digits | positive integer | 32 | |
bigint | digits | positive integer | 32 |
date | pattern | date format (string) | yyyy/MM/dd HH:mm:ss.SSS |
double, float, integer, long | boost | float | 0.1f |
text | analyzer | class name (string) | default_analyzer of the schema |
Note that Cassandra allows one custom index per table. On the other hand, Cassandra does not allow a modify operation on indexes. To modify an index it needs to be deleted first and created again.
###Example
This code below and the one for creating the corresponding keyspace and table is available in a CQL script that can be sourced from the Cassandra shell: test-users-create.cql.
CREATE CUSTOM INDEX IF NOT EXISTS users_index
ON test.users (stratio_col)
USING 'com.stratio.cassandra.index.RowIndex'
WITH OPTIONS = {
'refresh_seconds' : '1',
'ram_buffer_mb' : '64',
'max_merge_mb' : '5',
'max_cached_mb' : '30',
'indexing_threads' : '4',
'indexing_queues_size' : '50',
'schema' : '{
analyzers : {
my_custom_analyzer : {
type:"snowball",
language:"Spanish",
stopwords : "el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"}
},
default_analyzer : "english",
fields : {
name : {type : "string"},
gender : {type : "string"},
animal : {type : "string"},
age : {type : "integer"},
food : {type : "string"},
number : {type : "integer"},
bool : {type : "boolean"},
date : {type : "date",
pattern : "yyyy/MM/dd"},
mapz : {type : "string"},
setz : {type : "string"},
listz : {type : "string"},
phrase : {type : "text",
analyzer : "my_custom_analyzer"}
}
}'
};
###Syntax:
SELECT ( <fields> | * )
FROM <table_name>
WHERE <magic_column> = '{ ( query : <query> )?
( , filter : <filter> )?
( , sort : <sort> )?
}';
where <query> and <filter> are a JSON object:
<query> := { type : <type> (, <option> : ( <value> | <value_list> ) )+ }
and <sort> is another JSON object:
<sort> := { fields : <sort_field> (, <sort_field> )* }
<sort_field> := { field : <field> (, reverse : <reverse> )? }
When searching by <query>, results are returned sorted by descending relevance without pagination. The results will be located in the column ‘stratio_relevance’.
Filter types and options are the same as the query ones. The difference with queries is that filters have no effect on scoring.
Sort option is used to specify the order in which the indexed rows will be traversed. When sorting is used, the query scoring is delayed.
If no query or sorting options are specified then the results are returned in the Cassandra’s natural order, which is defined by the partitioner and the column name comparator.
Types of query and their options are summarized in the table below. Details for each of them are available in individual sections and the examples can be downloaded as a CQL script: extended-search-examples.cql.
In addition to the options described in the table, all query types have a “boost” option that acts as a weight on the resulting score.
Query type | Supported Field type | Options |
---|---|---|
Boolean | subqueries |
|
Contains | All |
|
Fuzzy | bytes inet string text |
|
Match | All |
|
Phrase | bytes inet text |
|
Prefix | bytes inet string text |
|
Range | All |
|
Regexp | bytes inet string text |
|
Wildcard | bytes inet string text |
|
###Boolean query
Syntax:
SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
type : "boolean",
( must : [(query,)?] , )?
( should : [(query,)?] , )?
( not : [(query,)?] , )? } }';
where:
- must: represents the conjunction of queries: query1 AND query2 AND … AND queryn
- should: represents the disjunction of queries: query1 OR query12 OR … OR queryn
- not: represents the negation of the disjunction of queries: NOT(query1 OR query2 OR … OR queryn)
Since "not" will be applied to the results of a "must" or "should" condition, it can not be used in isolation.
Example 1: will return rows where name ends with “a” AND food starts with “tu”
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "boolean",
must : [{type : "wildcard", field : "name", value : "*a"},
{type : "wildcard", field : "food", value : "tu*"}]}}';
Example 2: will return rows where food starts with “tu” but name does not end with “a”
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "boolean",
not : [{type : "wildcard", field : "name", value : "*a"}],
must : [{type : "wildcard", field : "food", value : "tu*"}]}}';
Example 3: will return rows where name ends with “a” or food starts with “tu”
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "boolean",
should : [{type : "wildcard", field : "name", value : "*a"},
{type : "wildcard", field : "food", value : "tu*"}]}}';
###Contains query
Syntax:
SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
type : "contains",
field : <fieldname> ,
values : <value_list> }}';
Example 1: will return rows where name matches “Alicia” or “mancha”
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "contains",
field : "name",
values : ["Alicia","mancha"] }}';
Example 2: will return rows where date matches “2014/01/01″, “2014/01/02″ or “2014/01/03″
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "contains",
field : "date",
values : ["2014/01/01", "2014/01/02", "2014/01/03"] }}';
###Fuzzy query
Syntax:
SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
type : "fuzzy",
field : <fieldname> ,
value : <value>
(, max_edits : <max_edits> )?
(, prefix_length : <prefix_length> )?
(, max_expansions: <max_expansion> )?
(, transpositions: <transposition> )?
}}';
where:
- max_edits (default = 2): a integer value between 0 and 2. Will return rows which distance from <value> to <field> content has a distance of at most <max_edits>. Distance will be interpreted according to the value of “transpositions”.
- prefix_length (default = 0): an integer value being the length of the common non-fuzzy prefix
- max_expansions (default = 50): an integer for the maximum number of terms to match
- transpositions (default = true): if transpositions should be treated as a primitive edit operation (Damerau-Levenshtein distance). When false, comparisons will implement the classic Levenshtein distance.
Example 1: will return any rows where “phrase” contains a word that differs in one edit operation from “puma”, such as “pumas”.
SELECT * FROM test.users
WHERE stratio_col = '{query : { type : "fuzzy",
field : "phrase",
value : "puma",
max_edits : 1 }}';
Example 2: same as example 1 but will limit the results to rows where phrase contains a word that starts with “pu”.
SELECT * FROM test.users
WHERE stratio_col = '{query : { type : "fuzzy",
field : "phrase",
value : "puma",
max_edits : 1,
prefix_length : 2 }}';
###Match query
Syntax:
SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
type : "match",
field : <fieldname> ,
value : <value> }}';
Example 1: will return rows where name matches “Alicia”
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "match",
field : "name",
value : "Alicia" }}';
Example 2: will return rows where phrase contains “mancha”
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "match",
field : "phrase",
value : "mancha" }}';
Example 3: will return rows where date matches “2014/01/01″
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "match",
field : "date",
value : "2014/01/01" }}';
###Phrase query
Syntax:
SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
type :"phrase",
field : <fieldname> ,
values : <value_list>
(, slop : <slop> )?
}}';
where:
- values: an ordered list of values.
- slop (default = 0): number of words permitted between words.
Example 1: will return rows where “phrase” contains the word “camisa” followed by the word “manchada”.
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "phrase",
field : "phrase",
values : ["camisa", "manchada"] }}';
Example 2: will return rows where “phrase” contains the word “mancha” followed by the word “camisa” having 0 to 2 words in between.
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "phrase",
field : "phrase",
values : ["mancha", "camisa"],
slop : 2 }}';
###Prefix query
Syntax:
SELECT ( <fields> | * )
FROM <table>
WHERE <magic_column> = '{ query : {
type : "prefix",
field : <fieldname> ,
value : <value> }}';
Example: will return rows where “phrase” contains a word starting with “lu”. If the column is indexed as “text” and uses an analyzer, words ignored by the analyzer will not be retrieved.
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "prefix",
field : "phrase",
value : "lu" }}';
###Range query
Syntax:
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "range",
field : <fieldname>
(, lower : <min> , include_lower : <min_included> )?
(, upper : <max> , include_upper : <max_included> )?
}}';
where:
- lower: lower bound of the range.
- include_lower (default = false): if the lower bound is included (left-closed range).
- upper: upper bound of the range.
- include_upper (default = false): if the upper bound is included (right-closed range).
Lower and upper will default to
Example 1: will return rows where age is in [1, ∞)
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "range",
field : "age",
lower : 1,
include_lower : true }}';
Example 2: will return rows where age is in (-∞, 0]
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "range",
field : "age",
upper : 0,
include_upper : true }}';
Example 3: will return rows where age is in [-1, 1]
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "range",
field : "age",
lower : -1,
upper : 1,
include_lower : true,
include_upper : true }}';
Example 4: will return rows where date is in [2014/01/01, 2014/01/02]
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "range",
field : "date",
lower : "2014/01/01",
upper : "2014/01/02",
include_lower : true,
include_upper : true }}';
###Regexp query
Syntax:
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "regexp",
field : <fieldname>,
value : <regexp>
}}';
where:
- value: a regular expression. See org.apache.lucene.util.automaton.RegExp for syntax reference.
Example: will return rows where name contains a word that starts with “p” and a vowel repeated twice (e.g. “pape”).
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "regexp",
field : "name",
value : "[J][aeiou]{2}.*" }}';
###Wildcard query
Syntax:
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "wildcard" ,
field : <fieldname> ,
value : <wildcard_exp>
}}';
where:
- value: a wildcard expression. Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. ” is the escape character.
Example: will return rows where food starts with or is “tu”.
SELECT * FROM test.users
WHERE stratio_col = '{query : {
type : "wildcard",
field : "food",
value : "tu*" }}';
Spark and Hadoop integrations are fully supported because Lucene queries can be combined with token range queries and pagination, which are the basis of MapReduce frameworks support.
###Token Range Queries
The token function allows computing the token for a given partition key. The primary key of the example table “users” is ((name, gender), animal, age) where (name, gender) is the partition key. When combining the token function and a Lucene-based filter in a where clause, the filter on tokens is applied first and then the condition of the filter clause.
Example: will retrieve rows which tokens are greater than (‘Alicia’, ‘female’) and then test them against the match condition.
SELECT name,gender
FROM test.users
WHERE stratio_col='{filter : {type : "match", field : "food", value : "chips"}}'
AND token(name, gender) > token('Alicia', 'female');
###Pagination
Pagination over filtered results is fully supported. You can retrieve the rows starting from a certain key. For example, if the primary key is (userid, createdAt), you can query:
SELECT *
FROM tweets
WHERE stratio_col = ‘{ filter : {type:”match", field:”text", value:”cassandra”} }’
AND userid = 3543534
AND createdAt > 2011-02-03 04:05+0000
LIMIT 5000;
###CQL to Field type
CQL type | Description | Field type | Query types |
---|---|---|---|
ascii | US-ASCII character string | string/text | All |
bigint | 64-bit signed long | long | boolean contains match range |
blob | Arbitrary bytes (no validation), expressed as hexadecimal | bytes | All |
boolean | true or false | boolean | All |
counter | Distributed counter value (64-bit long) | not supported | |
decimal | Variable-precision decimal | bigdec | All |
double | 64-bit IEEE-754 floating point | double | boolean contains match range |
float | 32-bit IEEE-754 floating point | float | boolean contains match range |
inet | IP address string in IPv4 or IPv6 format | inet | All |
int | 32-bit signed integer | integer | boolean contains match range |
list<T> | A collection of one or more ordered elements | Type of list elements | see element type |
map<K,V> | A JSON-style array of literals: { literal : literal, literal : literal … } | Type of values | see element type |
set<T> | A collection of one or more elements | Type of set elements | see element type |
text | UTF-8 encoded string | string/text | All |
timestamp | Date plus time, encoded as 8 bytes since epoch | date | boolean contains match range |
uuid | Type 1 or type 4 UUID | uuid | All |
timeuuid | Type 1 UUID only (CQL3) | uuid | All |
varchar | UTF-8 encoded string | string/text | All |
varint | Arbitrary-precision integer | bigint | All |
###Field type to CQL
field type | CQL type | Supported in Query types |
---|---|---|
bigdec | decimal | All |
bigint | varint | All |
boolean | boolean | All |
bytes | blob | All |
date | timestamp | boolean match range |
double | double | boolean match range |
float | float | boolean match range |
inet | inet | All |
integer | int | boolean match range |
long | bigint | boolean match range |
string/text | ascii text varchar |
All |
uuid | uuid timeuuid |
All |