- Use Data is Plural list of datasets as a dataset itself
- https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
- Save as tsv (tab-separated values)
(The sequence below is similar but different from the one in the presentation)
- Start server in its own directory
- Create directory (e.g. sroot)
- Copy solr.xml into that directory from <solr install>/server/solr/
bin/solr start -s .../sroot
- Create core
bin/solr create -c dip -d _.../configsets/minimal_
- Try indexing downloaded tsv file
- Based on https://lucene.apache.org/solr/guide/7_4/post-tool.html#indexing-csv and https://lucene.apache.org/solr/guide/7_4/uploading-data-with-index-handlers.html#csv-formatted-index-updates
bin/post -c dip -params "separator=%09" -type "text/csv" .../dip-data.tsv
- Fails as there is no id field.
- Let's use rowid as unique id (
rowid=id
) bin/post -c dip -params "separator=%09&rowid=id" -type "text/csv" .../dip-data.tsv
- Check in Solr Admin UI (Query screen)
- 605 records, seven fields (edition, position, headline, text, links, hattips, id)
- Search for data (in
q
field): 441 records - Search for DATA - same, searches are case-insensitive
- Let's quickly check how many entries we get by issue:
- http://localhost:8983/solr/dip/select?facet.field=edition&facet=on&q=*:*&rows=0
- Very consistent: 5 per issue
- What about by year (with facet query):
http://localhost:8983/solr/dip/select?facet=on&q=*:*&rows=0 &facet.query={!prefix%20f=edition}2014 &facet.query={!prefix%20f=edition}2015 &facet.query={!prefix%20f=edition}2016 &facet.query={!prefix%20f=edition}2017 &facet.query={!prefix%20f=edition}2018
- Ok, we can tell we really ramped-up in 2016 and then dropped in 2017. We'll see what this year is like.
-
First, let's check the definition (Admin UI/Schema), notice how it maps to dynamic field *
-
If we look in the schema's definition file, it is not there explicitly - that's the benefit of the current approach. Schemaless would be a different method as it detects types, but it has its own challenges that we (as commiters) are still trying to figure out. I like this one more.
-
So, let's create an explicit definition, possible via API/Admin UI because we are in a managed schema mode and are not locked-down.
-
In Admin UI, delete links and recreate them as stored/indexed/multivalued.
-
Normally, we could just reindex, but because it is docvalues and single->multivalued is a challenge - we need to delete the index first.
-
Command line delete:
bin/post -c dip -d $"<delete><query>*:*</query></delete>"
-
Or can do it in the Admin UI/Documents screen, by putting this delete message as Solr Command (raw XML or JSON) . Can also do it as JSON:
{"delete": { "query":"*:*" } }
-
So, now we can index the TSV file again, and add instructions to split the links field on space (by adding f.links.split and f.links.separator parameters)
bin/post -c dip -params "separator=%09&rowid=id&f.links.split=true&f.links.separator=%20" -type "text/csv" .../dip-data.tsv
- Now we can rerun the basic query and see the links are split
- Let's now find the records that have more than 3 links in one record. It is surprisingly hard because we are - effectively - searching dictionary, not the records themselves. So, "shape the data for search" - precalculate.
- Introducing Update Request Processor - once the data format is normallized but before we hit the schema.
- There is a lot of these URPs can do. If one can find them in the Reference Guide.... (Well-Configured Solr Instance/Configuring solrconfig.xml/Update Request Processors)
- We want CountFieldValuesUpdateProcessorFactory but by itself it will replace a value; so we want to clone it first with CloneFieldUpdateProcessorFactory
- We cannot do this in AdminUI yet, so we have two options:
- Update solrconfig.xml by hand (in non-SolrCloud setup) and reload the core;
- Use Config API which creates an configoverlay.json file. Not everything that can be done by hand in solrconfig.xml can be done with Config API yet, but most things can be.
- We are going to use Config API, in Documents screen, by setting Request Handler to "/config" and Document Type to "Solr Command":
{
"add-updateprocessor": {
"name": "cloneLinksToCount",
"class": "solr.CloneFieldUpdateProcessorFactory",
"source": "links",
"dest":"linksCount"
},
"add-updateprocessor": {
"name": "countLinks",
"class": "solr.CountFieldValuesUpdateProcessorFactory",
"fieldName": "linksCount"
}
}
Notice that we have a key-name duplication, that's one of the Solr's little extras to allow fit repeating constructs into JSON. It is not very good, but JSON is somewhat limited comparing to XML.
-
Let's check we see the definition in the configoverlay.json file in the Files screen of the AdminUI
-
We are also going to explictly define linksCount field. We do not have to if we are ok with it as a string. But it would be nicer to actually recognize it is an integer, so we need to define the field first .
- In Documents screen, set Request-Handler to /schema and command:
{"add-field-type" : { "name":"pint", "class":"solr.IntPointField", "docValues":"true" }}
- In Admin UI, create new field of int type with default set to 0. No need to delete records, we can just reindex.
-
So, let's reindex, including the processor parameter that will invoke those URPs:
bin/post -c dip -params "separator=%09&rowid=id&f.links.split=true&f.links.separator=%20&processor=cloneLinksToCount,countLinks" -type "text/csv" .../dip-data.tsv
-
And finally the query to find records with more than 3 links:
http://localhost:8983/solr/dip/select?q=linksCount:[3 TO *]&sort=linksCount asc
- POST to http://localhost:8983/solr/dip/config/params with the raw JSON body of:
{
"set":{
"DIP_INDEX":{
"separator":"\t",
"rowid":"id",
"f.link.split": "true",
"f.links.separator": " ",
"processor":"cloneLinksToCount,countLinks"
}}}
- The post command can now be:
bin/post -c dip -params "useParams=DIP_INDEX" -type "text/csv" .../dip-data.tsv
- Use JSON Facet API
- Could use json.facet parameter, but it is a bit ugly for multi-lined json
- Let's use POST body as a full request
- POST to http://localhost:8983/solr/dip/select with the raw JSON body of:
{
"params":{
"q":"*:*",
"rows":5
},
"facet":{
"avgLinks": "avg(linksCount)",
"maxLinks": "max(linksCount)",
"minLinks": "min(linksCount)"
}
}
- We should see 5 entries and then facets block with average of 3.4..., maximum of 13 and minimum of 1.
- POST to the same URL with the raw JSON body of:
{
params:{
q:"*:*",
rows:1
},
facet:{
by_position: {
type: terms,
field: position,
facet: {
avgLinks: "avg(linksCount)",
maxLinks: "max(linksCount)",
}
}
}
}
- We should get 1 record and then a facet breakdown by position and for each of them get avgLinks and maxLinks. Clearly, the lower (earlier) positions, have tendency for more links here.
- Our goal was to find another dataset, let's search for "cat"
- We get some (4) "cat" but not all of them are about feline
- Because we search (defined in solrconfig.xml) _text_ (notice underscores) and that's (in managed-schema) a copyfield of * to _text_
- Let's search against just headline and text, they are text_basic in dynamicField definition [http://localhost:8983/solr/dip/select?q=headline:cat text:cat]
- That works, but is getting quite long, can we specify the fields we are looking at?
- Yes, as the query is processed by one (or multiple) query parsers and the 'df' is for the default one
- eDisMax gives us a lot more options, including fields to search with boosts
defType=edismax &q=cat &qf=headline^10 text^5 _text_
- http://splainer.io/ (from OpenSourceConnections) is great to check what is going on there, can be run against your own local instance
- So, that's eDisMax, we have another 25 or so specialized parsers which you can use standalone or in combination
- What about cats ?
- Gives a different result because we are not doing any language processing. Not because we cannot, because we do not know what you want.
- Compare text_basic with (default schema's) text_en, text_en_splitting, text_en_split_tight
- And that's just English, Solr has many more types and you can compose your own, just have a look at
- text_fa - demonstrating arabic and persian specific handling, and CharFilters
- My own definition that allows to search Thai with English phonetics, using bundled IBM's ICU4J library: thai_english
- My own definition to search phone number by suffix and associated presentation: phone
- The pipeline can be different for index and search, can have CharFilters, Tokenizer, and TokenFilters, see my Solr Start resource website
- So, let's tell Solr that we know text and headline are English (text_en)
- Can do it via API, but - for non-Cloud instance - the config files are local and we can hack a bit and do copy/paste
- Copy field type definition
- from server/solr/configsets/_default/conf/managed-schema
- to .../server/dip/conf/managed-schema (not original configset)
- anywhere in the file, because Solr does not care about the order in managed_schema (does in solrconfig.xml)
- Copy associated resource files as well:
- lang/stopwords_en.txt
- protwords.txt
- synonyms.txt
- Create explicit field definitions for text and headline using that field type
<field name="text" type="text_en" indexed="true" stored="true"/>
- Reload the core in Admin UI
- Reindex
- Copy field type definition
- Rerun our searches (in splainer) and see that we now find plurals and non-plurals
- Also Admin UI's Analysis console allows to see how index and/or query pipeline deal with the text
- Can do it via API, but - for non-Cloud instance - the config files are local and we can hack a bit and do copy/paste
- So, let's pick a cute dataset to play with
- Getting Solr running, including with its own location
- Creating new core, with custom configuration
- Working with a TSV file, adding automatically-generated IDs
- Modifying schema and configuration using API/Admin UI
- Pre-processing records with Update Request Processors
- Indexing, basic searching, and deleting records
- Basic and advanced facets, including statistics
- Solr is an Index not a database, so much
- Work backwards from search requirements
- Be ready to reindex
- Be ready to de-normalize, duplicate, and double-handle