Get the dataset

Use Data is Plural list of datasets as a dataset itself
https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
Save as tsv (tab-separated values)

(The sequence below is similar but different from the one in the presentation)

Setup Solr core

Start server in its own directory
1. Create directory (e.g. sroot)
2. Copy solr.xml into that directory from <solr install>/server/solr/
3. bin/solr start -s .../sroot
Create core
1. bin/solr create -c dip -d _.../configsets/minimal_
Try indexing downloaded tsv file
1. Based on https://lucene.apache.org/solr/guide/7_4/post-tool.html#indexing-csv and https://lucene.apache.org/solr/guide/7_4/uploading-data-with-index-handlers.html#csv-formatted-index-updates
2. bin/post -c dip -params "separator=%09" -type "text/csv" .../dip-data.tsv
3. Fails as there is no id field.
4. Let's use rowid as unique id (rowid=id)
5. bin/post -c dip -params "separator=%09&rowid=id" -type "text/csv" .../dip-data.tsv
Check in Solr Admin UI (Query screen)
1. 605 records, seven fields (edition, position, headline, text, links, hattips, id)
2. Search for data (in q field): 441 records
3. Search for DATA - same, searches are case-insensitive
Let's quickly check how many entries we get by issue:
1. http://localhost:8983/solr/dip/select?facet.field=edition&facet=on&q=*:*&rows=0
2. Very consistent: 5 per issue
3. What about by year (with facet query):
```
http://localhost:8983/solr/dip/select?facet=on&q=*:*&rows=0
&facet.query={!prefix%20f=edition}2014
&facet.query={!prefix%20f=edition}2015
&facet.query={!prefix%20f=edition}2016
&facet.query={!prefix%20f=edition}2017
&facet.query={!prefix%20f=edition}2018
```
4. Ok, we can tell we really ramped-up in 2016 and then dropped in 2017. We'll see what this year is like.

Iterate on schema design

Make links multivalued, split on the space

First, let's check the definition (Admin UI/Schema), notice how it maps to dynamic field *
If we look in the schema's definition file, it is not there explicitly - that's the benefit of the current approach. Schemaless would be a different method as it detects types, but it has its own challenges that we (as commiters) are still trying to figure out. I like this one more.
So, let's create an explicit definition, possible via API/Admin UI because we are in a managed schema mode and are not locked-down.
In Admin UI, delete links and recreate them as stored/indexed/multivalued.
Normally, we could just reindex, but because it is docvalues and single->multivalued is a challenge - we need to delete the index first.
Command line delete:

bin/post -c dip -d $"<delete><query>*:*</query></delete>"
Or can do it in the Admin UI/Documents screen, by putting this delete message as Solr Command (raw XML or JSON) . Can also do it as JSON:

{"delete": { "query":"*:*" } }
So, now we can index the TSV file again, and add instructions to split the links field on space (by adding f.links.split and f.links.separator parameters)

bin/post -c dip -params "separator=%09&rowid=id&f.links.split=true&f.links.separator=%20" -type "text/csv" .../dip-data.tsv

Now we can rerun the basic query and see the links are split

Find records with more than 3 links

Let's now find the records that have more than 3 links in one record. It is surprisingly hard because we are - effectively - searching dictionary, not the records themselves. So, "shape the data for search" - precalculate.
Introducing Update Request Processor - once the data format is normallized but before we hit the schema.
There is a lot of these URPs can do. If one can find them in the Reference Guide.... (Well-Configured Solr Instance/Configuring solrconfig.xml/Update Request Processors)
We want CountFieldValuesUpdateProcessorFactory but by itself it will replace a value; so we want to clone it first with CloneFieldUpdateProcessorFactory
We cannot do this in AdminUI yet, so we have two options:
1. Update solrconfig.xml by hand (in non-SolrCloud setup) and reload the core;
2. Use Config API which creates an configoverlay.json file. Not everything that can be done by hand in solrconfig.xml can be done with Config API yet, but most things can be.
We are going to use Config API, in Documents screen, by setting Request Handler to "/config" and Document Type to "Solr Command":

{
    "add-updateprocessor": {
            "name": "cloneLinksToCount",
            "class": "solr.CloneFieldUpdateProcessorFactory",
            "source": "links",
            "dest":"linksCount"
    },
    "add-updateprocessor": {
        "name": "countLinks",
        "class": "solr.CountFieldValuesUpdateProcessorFactory",
        "fieldName": "linksCount"
    }
}

Notice that we have a key-name duplication, that's one of the Solr's little extras to allow fit repeating constructs into JSON. It is not very good, but JSON is somewhat limited comparing to XML.

Let's check we see the definition in the configoverlay.json file in the Files screen of the AdminUI
We are also going to explictly define linksCount field. We do not have to if we are ok with it as a string. But it would be nicer to actually recognize it is an integer, so we need to define the field first .
1. In Documents screen, set Request-Handler to /schema and command:
```
{"add-field-type" : {
        "name":"pint",
        "class":"solr.IntPointField",
        "docValues":"true"
}}
```
1. In Admin UI, create new field of int type with default set to 0. No need to delete records, we can just reindex.

So, let's reindex, including the processor parameter that will invoke those URPs:

bin/post -c dip -params "separator=%09&rowid=id&f.links.split=true&f.links.separator=%20&processor=cloneLinksToCount,countLinks" -type "text/csv" .../dip-data.tsv

And finally the query to find records with more than 3 links: http://localhost:8983/solr/dip/select?q=linksCount:[3 TO *]&sort=linksCount asc

Lock update parameters using Request Parameter API

POST to http://localhost:8983/solr/dip/config/params with the raw JSON body of:

{
  "set":{
    "DIP_INDEX":{
      "separator":"\t",
      "rowid":"id",
      "f.link.split": "true",
      "f.links.separator": " ",
      "processor":"cloneLinksToCount,countLinks"
}}}

The post command can now be: bin/post -c dip -params "useParams=DIP_INDEX" -type "text/csv" .../dip-data.tsv

Calculate average number of links (global and per per position)

Use JSON Facet API
Could use json.facet parameter, but it is a bit ugly for multi-lined json
Let's use POST body as a full request

Average, min and max number of links, globally

POST to http://localhost:8983/solr/dip/select with the raw JSON body of:

{
    "params":{
        "q":"*:*",
        "rows":5
    },
    "facet":{
        "avgLinks": "avg(linksCount)",
        "maxLinks": "max(linksCount)",
        "minLinks": "min(linksCount)"
    }
}

We should see 5 entries and then facets block with average of 3.4..., maximum of 13 and minimum of 1.

Average and max number of links, broken down by the position

POST to the same URL with the raw JSON body of:

{
    params:{
        q:"*:*",
        rows:1
    },
    facet:{
        by_position: {
            type: terms,
            field: position,

            facet: {
                avgLinks: "avg(linksCount)",
                maxLinks: "max(linksCount)",
            }
        }
    }
}

We should get 1 record and then a facet breakdown by position and for each of them get avgLinks and maxLinks. Clearly, the lower (earlier) positions, have tendency for more links here.

Let's find another dataset to explore

Our goal was to find another dataset, let's search for "cat"
We get some (4) "cat" but not all of them are about feline
Because we search (defined in solrconfig.xml) _text_ (notice underscores) and that's (in managed-schema) a copyfield of * to _text_
Let's search against just headline and text, they are text_basic in dynamicField definition [http://localhost:8983/solr/dip/select?q=headline:cat text:cat]
That works, but is getting quite long, can we specify the fields we are looking at?
1. Yes, as the query is processed by one (or multiple) query parsers and the 'df' is for the default one
2. eDisMax gives us a lot more options, including fields to search with boosts
```
defType=edismax
&q=cat
&qf=headline^10 text^5 _text_
```
1. http://splainer.io/ (from OpenSourceConnections) is great to check what is going on there, can be run against your own local instance
So, that's eDisMax, we have another 25 or so specialized parsers which you can use standalone or in combination
What about cats ?
1. Gives a different result because we are not doing any language processing. Not because we cannot, because we do not know what you want.
2. Compare text_basic with (default schema's) text_en, text_en_splitting, text_en_split_tight
3. And that's just English, Solr has many more types and you can compose your own, just have a look at
  1. text_fa - demonstrating arabic and persian specific handling, and CharFilters
  2. My own definition that allows to search Thai with English phonetics, using bundled IBM's ICU4J library: thai_english
  3. My own definition to search phone number by suffix and associated presentation: phone
  4. The pipeline can be different for index and search, can have CharFilters, Tokenizer, and TokenFilters, see my Solr Start resource website
4. So, let's tell Solr that we know text and headline are English (text_en)
  1. Can do it via API, but - for non-Cloud instance - the config files are local and we can hack a bit and do copy/paste
    1. Copy field type definition
      1. from server/solr/configsets/_default/conf/managed-schema
      2. to .../server/dip/conf/managed-schema (not original configset)
      3. anywhere in the file, because Solr does not care about the order in managed_schema (does in solrconfig.xml)
    2. Copy associated resource files as well:
      1. lang/stopwords_en.txt
      2. protwords.txt
      3. synonyms.txt
    3. Create explicit field definitions for text and headline using that field type <field name="text" type="text_en" indexed="true" stored="true"/>
    4. Reload the core in Admin UI
    5. Reindex
  2. Rerun our searches (in splainer) and see that we now find plurals and non-plurals
  3. Also Admin UI's Analysis console allows to see how index and/or query pipeline deal with the text
So, let's pick a cute dataset to play with

So far we learned

Getting Solr running, including with its own location
Creating new core, with custom configuration
Working with a TSV file, adding automatically-generated IDs
Modifying schema and configuration using API/Admin UI
Pre-processing records with Update Request Processors
Indexing, basic searching, and deleting records
Basic and advanced facets, including statistics

Notes

Solr is an Index not a database, so much
Work backwards from search requirements
Be ready to reindex
Be ready to de-normalize, duplicate, and double-handle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dip-README.md

dip-README.md

Get the dataset

Setup Solr core

Iterate on schema design

Make links multivalued, split on the space

Find records with more than 3 links

Lock update parameters using Request Parameter API

Calculate average number of links (global and per per position)

Average, min and max number of links, globally

Average and max number of links, broken down by the position

Let's find another dataset to explore

So far we learned

Notes

Files

dip-README.md

Latest commit

History

dip-README.md

File metadata and controls

Get the dataset

Setup Solr core

Iterate on schema design

Make links multivalued, split on the space

Find records with more than 3 links

Lock update parameters using Request Parameter API

Calculate average number of links (global and per per position)

Average, min and max number of links, globally

Average and max number of links, broken down by the position

Let's find another dataset to explore

So far we learned

Notes