Running the Searx metasearch engine on OpenBSD

       1603 words, 8 minutes

Searx is a free metasearch engine. This means that it will aggregate search results from several search engines, like Bing, DuckDuckGo, Google or Qwant. But it will also use search engines from services like DailyMotion, DeviantArt, FramaLibre, GitHub, Reddit or Wikipedia to extract search results. For more information, have a look at the searx online documentation .

It also removes Cookies and generate a random profile for each request you do. This is a step forward to privacy.

This software runs using Python. Which means a self-hosted instance can be run on any OS that supports it. And guess what, OpenBSD does provide a Python experience. So let’s run a self-hosted instance of searx on an OpenBSD VPS.

Note that this was done on OpenBSD 7.0/amd64.

Configure OpenBSD

OpenBSD is installed and configured as usual.

Then, create a user to run the searx software:

# vi /etc/login.conf
(...)
searx:\
	:openfiles=1024:\
	:tc=daemon:
# cap_mkdb /etc/login.conf

# useradd -g =uid -c "searx metasearch engine" \
  -L searx -s /bin/ksh -d /home/searx -m _searx

It seems the semaphores values are a bit low on OpenBSD, by default.

uwsgi: uwsgi_lock_ipcsem_init()/semget(): No space left on device  \
       [core/lock.c line 519]
uwsgi: uwsgi_ipcsem_clear()/semctl(): Invalid argument             \
       [core/lock.c line 643]

Raising the values is done using sysctl.

# vi /etc/sysctl.conf
(...)
kern.seminfo.semmni=20
kern.seminfo.semmns=120
kern.seminfo.semmnu=60
kern.seminfo.semmsl=120
kern.seminfo.semopm=200

openbsd# grep -v "^#" /etc/sysctl.conf | xargs sysctl -w

Using ipcs(1) later on show 18 used semaphores.

Finally, install Python pre-requisites:

# pkg_add git python%3.9 libxslt

Install Searx

There is a Linux-centered step by step installation documentation . Installation on OpenBSD does not differ much.

Switch to the searx user and download the source:

# su - _searx
$ git clone -b v1.0.0 https://github.com/searx/searx src

Create a Python virtual environment. This allows keeping OpenBSD clean and getting Python stuff that are not available in ports:

$ python3.9 -m venv pyenv
$ echo ". ~/pyenv/bin/activate" >> ~/.profile

Exit the user session and enter it back. Just to ensure the Python venv gets configured ok.

$ ^D
# su - _searx
(pyenv) openbsd$ command -v python && python --version
/home/searx/pyenv/bin/python
Python 3.9.7

Install searx’s dependencies:

(pyenv) openbsd$ pip install -U pip
(...)
Successfully installed pip-21.3

(pyenv) openbsd$ pip install -U setuptools
(...)
Successfully installed setuptools-58.2.0

(pyenv) openbsd$ pip install -U wheel
(...)
Successfully installed wheel-0.37.0

pyenv) openbsd$ pip install -U pyyaml
(...)
Successfully built pyyaml
Installing collected packages: pyyaml
Successfully installed pyyaml-6.0

Now install searx itself:

(pyenv) openbsd$ cd ~/src
(pyenv) openbsd$ pip install -e .
(...)
Successfully installed MarkupSafe-2.0.1 PySocks-1.7.1 Werkzeug-2.0.2
babel-2.9.0 certifi-2020.12.5 chardet-4.0.0 click-8.0.3 flask-1.1.2
flask-babel-2.0.0 idna-2.10 itsdangerous-2.0.1 jinja2-2.11.3 langdetect-1.0.8
lxml-4.6.3 pygments-2.8.0 python-dateutil-2.8.1 pytz-2021.3 pyyaml-5.4.1
requests-2.25.1 searx-1.0.0 six-1.16.0 urllib3-1.26.7

The searx configuration is done via a YAML file. Let’s have it outside the source tree so that it’s not overwritten on updates:

(pyenv) openbsd$ sed -e "s/ultrasecretkey/`openssl rand -hex 16`/g"          \
                 ~/src/searx/settings.yml > ~/settings.yml
(pyenv) openbsd$ echo 'export SEARX_SETTINGS_PATH="/home/searx/settings.yml"'\
                 >> ~/.profile

With all this done, we can now start the Python application:

(pyenv) openbsd$ . ~/.profile
(pyenv) openbsd$ python ~/src/searx/webapp.py
* Serving Flask app "webapp" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
INFO:werkzeug: * Running on http://127.0.0.1:8888/ (Press CTRL+C to quit)
^C

By default, the app listens on HTTP requests on http://127.0.0.1:8888/. One could access it this way. But for security and availability reasons, it is recommended to be run via WSGI and some reverse proxy engines. Look at the online recommended architecture details .

Setup Python uwsgi

In the Python venv, install the uwsgi plugin:

(pyenv) openbsd$ pip install uwsgi
(...)
Successfully built uwsgi
Installing collected packages: uwsgi
Successfully installed uwsgi-2.0.20

Create a configuration file:

(pyenv) openbsd$ cat >> ~/uwsgi.ini
[uwsgi]
# Who will run the code
uid=_searx
gid=_searx

# Number of workers (usually CPU count)
workers = 2

# set (python) default encoding UTF-8
env = LANG=C.UTF-8
env = LANGUAGE=C.UTF-8
env = LC_ALL=C.UTF-8

# chdir to specified directory before apps loading
chdir=/home/searx/src/searx

# searx configuration (settings.yml)
env = SEARX_SETTINGS_PATH=/home/searx/settings.yml

http = 127.0.0.1:8888
chmod-socket = 666

# Plugin to use and interpretor config
single-interpreter = true
master = true
plugin = python3,http
lazy-apps = true
enable-threads = true

# Module to import
module = searx.webapp

# Virtualenv and python path
pythonpath = /home/searx/src
virtualenv =/home/searx/pyenv

# Disable logging for privacy
disable-logging = false

# No keep alive
# See https://github.com/searx/searx-docker/issues/24
#add-header = Connection: close

# uwsgi serves the static files
# expires set to one day as Flask does
#static-map = /static=/home/searx/src/searx/static
#static-expires = /* 864000
static-gzip-all = True
#offload-threads = %k

# Cache
cache2 = name=searxcache,items=2000,blocks=2000,blocksize=4096,bitmap=1

# Bibliography
# ./src/dockerfiles/uwsgi.ini
# ./src/utils/templates/etc/uwsgi/apps-available/searx.ini
# https://uwsgi-docs.readthedocs.io/en/latest/HTTP.html

Check if uwsgi starts properly using:

(pyenv) openbsd$ /home/searx/pyenv/bin/uwsgi --ini /home/searx/uwsgi.ini

The daemon can be reached and tested using an SSH tunnel and a local Web browse:

# ssh -L 8888:localhost:8888 jca@openbsd

If that works, stop the daemons and quit the _searx user environment.

Write an OpenBSD starting script

An rc.d script will allow starting the searx/uwsgi service automatically:

openbsd# cat > /etc/rc.d/searx
#!/bin/ksh
#
# starting searx via uwsgi

daemon="uwsgi"
daemon_user="_searx"
daemon_group="_searx"
daemon_pid="/home/searx/searx.pid"
daemon_ini="/home/searx/uwsgi.ini"
daemon_flags="--ini ${daemon_ini} --pidfile ${daemon_pid} \ 
--ftok ${daemon_pid} --mime-file /usr/share/misc/mime.types \
--thunder-lock --log-syslog"

. /etc/rc.d/rc.subr

rc_reload=NO
rc_bg=YES

rc_start() {
        ${rcexec} ". ~/.profile ; ${daemon} ${daemon_flags}"
}

rc_stop() {
        ${rcexec} ". ~/.profile ; ${daemon} --stop ${daemon_pid}"
}

rc_cmd $1

#EOF

openbsd# chmod 0555 /etc/rc.d/searx
openbsd# rcctl enable searx
openbsd# rcctl start searx
searx(ok)

Expose searx via relayd(1)

The searx documentation recommends using the filtron reverse proxy to protect your searx instance. After looking at the source code and configuration, I decided to apply the protection using relayd(1) and pf(4).

Configuration of TLS is not covered here. Read acme-client(1) and acme-client.conf(5) for more details.

# vi /etc/relayd.conf
(...)
table <searx> { 127.0.0.1 }
(...)
http protocol https_protocol {
	(...)
	block
	(...)
	include "/etc/relayd.conf.cache"
	include "/etc/relayd.conf.bad-user-agent"

	block request quick header "Accept-Language" value ""
	(...)
	pass request quick header "Host" value "searx.openbsd.local" forward to <searx>
}

relay https_relay {
	(...)
	protocol https_protocol
	(...)
	forward to <searx> port 8888
}

# rcctl enable relayd
# rcctl start relayd

Not providing an Accept-Language header or trying to connect using the wrong FQDN will shut down the HTTPS connection.

The “bad-user-agent” section block HTTP requests using a User-Agent value known to be used by bots. You can get the updated list from the filtron sources . As User-Agent value can be forged, this has a limited power of protection on your instance.

# cat /etc/relayd.conf.bad-user-agent
(...)
block request quick header "User-Agent" value "YandexMobileBot/*"
block request quick header "User-Agent" value "YandexBot/*"
block request quick header "User-Agent" value "Yahoo! Slurp/*"
block request quick header "User-Agent" value "MJ12bot/*"
(...)

The “cache” block is just a way to force Cache-Control on some of the Web elements:

# cat /etc/relayd.conf.cache
match request path "*.css" tag "caching"
match request path "*.js"  tag "caching"
match request path "*.gif" tag "caching"
match request path "*.jpg" tag "caching"
match request path "*.png" tag "caching"
match request path "*.svg" tag "caching"

match response tagged "caching" header set "Cache-Control" value "max-age=86400"

Configuring pf(4)

Another set of filtron rules is used to limit access based on IP source. This can somehow be done using pf(4).

(...)
block quick from <bad_hosts>
pass in on egress inet proto tcp                \
        from any to (egress) port { http https }\
        modulate state                          \
        (source-track rule,                     \
        max-src-conn-rate 100/5,                \
        overload <bad_hosts> flush global)      \
        label "Web Access"

One thing to remember is that pf(4) knows about TCP sessions, not HTTP sessions. Furthermore, I noticed that relayd(8) adds a “Connection: close” HTTP header to the response from uwsgi. This means that a well constructed Web agent would close the TCP session after getting each objects. This also means that, to get a Web page with 2 CSS files and 3 images, you have to make 6 TCP connections. This seems weird to me because this would mean relayd(8) does not provide proper HTTP/1.1 support. And I’m pretty sure that devs have done their homework properly. So I must have just misunderstood / misconfigured something. Anyway, this is why I’m not using max-src-conn in the source-track rule and have a max-src-conn-rate so ‘high’. With a max-src-conn set to 50 and a max-src-conn-rate set to 20/5, I was constantly been blacklisted.

To identify a “usual” max-src-conn-rate, I used the following command while running a few searches:

# while true; do clear ; pfctl -s Sources; sleep 0.5; done

Using searx and various thoughts

So far, I have not found a way to use searx as a default search engine on my iOS devices. So I just set a Bookmark on the dock bar to open a searx tab in one tap.

On Firefox ESR v91, I could use the opensearch feature of the searx template. This allows using my searx instance from the search bar. This also seems to work with Chromium v93.

All in all, searx provides decent results. I even ended getting results never shown in Google or Duck Duck Go ; when searching for specific OpenBSD stuff. If you don’t want to run your own instance, you can browse to https://searx.space/ and identify one that you would trust.

I will probably look a making a specific theme for my searx instance. I’m not entirely convinced by the mobile look of the two default themes.

I will also have to dig a bit more on that “Connection” header from relayd. Finding the correct setting should enable better limitation and control on the HTTP session from the pf(4) states. Or I just don’t know what I talking about :p

PS: Thanks Hakan E. Duran for feedbacks.