phinde - generic web search engine

Self-hosted search engine you can use for your static blog or about any other website you want search functionality for.

My live instance is at http://search.cweiske.de/ and indexes my website, blog and all linked URLs.

Features

Crawler and indexer with the ability to run many in parallel
Shows and highlights text that contains search words
Boolean search queries:
- foo bar searches for foo AND bar
- foo OR bar
- title:foo searches for foo only in the page title
Facets for tag, domain, language and type
Date search:
- before:2016-08-30 - modification date before that day
- after:2016-08-30 - modified after that day
- date::2016-08-30 - exact modification day match
Site search
- Query: foo bar site:example.org/dir/
- or use the site GET parameter: /?q=foo&site=example.org/dir
OpenSearch support with HTML and Atom result lists
Instant indexing with WebSub (formerly PubSubHubbub)

Dependencies

PHP 8.x
Elasticsearch 2.0
MySQL or MariaDB for WebSub subscriptions
Gearman (Debian 9: gearman-job-server, not gearman-server)
gearadmin command line tool (gearman-tools package)
PHP Gearman extension
Some PHP libraries that get installed with composer

Setup

Install and run Elasticsearch and Gearman
Install php-gearman and gearman-tools

Get a local copy of the code:

$ git clone https://git.cweiske.de/phinde.git phinde

Install dependencies via composer:
```
$ composer install --no-dev
```
Point your webserver's document root to phinde's www directory
Copy data/config.php.dist to data/config.php and adjust it. Make sure your add your domain to the crawl whitelist.
Create a MySQL database and import the schema from data/schema.sql
Run bin/setup.php which sets up the Elasticsearch schema

Put your homepage into the queue:

$ ./bin/process.php http://example.org/

Start at least one worker to process the crawl+index queue:
```
$ ./bin/phinde-worker.php
```
Check phinde's status page in your browser. The number of open tasks should be > 0, the number of workers also.

Re-index when your site changes

When your site changed, the search engine needs to re-crawl and re-index the pages.

Simply tell phinde that something changed by running:

$ ./bin/process.php http://example.org/foo.htm

phinde supports HTML pages and Atom feeds, so if your blog has a feed it's enough to let phinde reindex that one. It will find all linked pages automatically.

Website integration

Adding a simple search form to your website is easy. It needs two things:

<form> tag with an action that points to the phinde instance
Search text field with name of q.

Example:

<form method="get" action="http://phinde.example.org">
  <input type="text" name="q" placeholder="Search text"/>
  <button type="submit">Search</button>
</form>

System service

When using systemd, you can let it run multiple worker instances when the system boots up:

Copy files data/systemd/phinde*.service into /etc/systemd/system/
Adjust user and group names, and the work directories

Enable three worker processes:

$ systemctl daemon-reload
$ systemctl enable phinde@1
$ systemctl enable phinde@2
$ systemctl enable phinde@3
$ systemctl enable phinde
$ systemctl start phinde

Now three workers are running. Restarting the phinde service also restarts the workers.

Cron job

Run bin/renew-subscriptions.php once a day with cron. It will renew the WebSub subscriptions.

Howto

Delete index data from one domain:

$ curl -iv -XDELETE -H 'Content-Type: application/json' -d '{"query":{"term":{"domain":"example.org"}}}' http://127.0.0.1:9200/phinde/_query

That's delete-by-query 2.0, see https://www.elastic.co/guide/en/elasticsearch/plugins/2.0/delete-by-query-usage.html

Subscribe to a website/feed

Phinde supports WebSub to get subscribe to changes of a website. When phinde gets notified by the website's hub about changes, it will immediately crawl and index the changed pages.

Subscribe to a website's feed:

$ php bin/subscribe.php http://example.org/feed.atom

Phinde will determine the website's hub and send a registration request to it.

The status page will show the number of working, and the number of open subscriptions.

Unsubscribing also happens on command line:

$ php bin/unsubscribe.php http://example.org/feed.atom

About phinde

Source code

phinde's source code is available from http://git.cweiske.de/phinde.git or the mirror on github.

License

phinde is licensed under the AGPL v3 or later.

Author

phinde was written by Christian Weiske.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
bin		bin
data		data
docs/elasticsearch		docs/elasticsearch
src		src
tests		tests
www		www
.dumbjump		.dumbjump
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
composer.json		composer.json
composer.lock		composer.lock
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phinde - generic web search engine

Features

Dependencies

Setup

Re-index when your site changes

Website integration

System service

Cron job

Howto

Subscribe to a website/feed

About phinde

Source code

License

Author

About

Releases

Packages

Languages

License

cweiske/phinde

Folders and files

Latest commit

History

Repository files navigation

phinde - generic web search engine

Features

Dependencies

Setup

Re-index when your site changes

Website integration

System service

Cron job

Howto

Subscribe to a website/feed

About phinde

Source code

License

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages