Skip to content

A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)

License

Notifications You must be signed in to change notification settings

typesense/typesense-docsearch-scraper

Repository files navigation

Typesense DocSearch scraper

This is a maintained fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense.

You'd typically setup this scraper to run on your documentation site, and then use typesense-docsearch.js to add a search bar to your site.

What is Typesense?

If you're new to Typesense, it is an open source search engine that is simple to use, run and scale, with clean APIs and documentation.

Think of it as an open source alternative to Algolia and an easier-to-use, batteries-included alternative to ElasticSearch. Get a quick overview from this guide.

Usage

Read detailed step-by-step instructions on how to configure and setup the scraper on Typesense's dedicated documentation site: https://typesense.org/docs/guide/docsearch.html

Changelog

We use git tags to identify every release.

So to view the changelog for a release, you can compare tags using a GitHub link like this:

https://github.com/typesense/typesense-docsearch-scraper/compare/0.8.0...0.9.0.

Remember to change the version numbers in the URL as needed.

Compatibility

typesense-docsearch-scraper typesense-server
0.5.0 >= 0.22.1
0.4.x and below >= 0.21.0

Development Workflow

This section only applies if you're making changes to this scraper itself. If you only need to run the scraper, see Usage instructions above.

Releasing a new version

Basic/abbreviated instructions:

$ pipenv shell
$ ./docsearch docker:build
$ git tag -a 0.2.1 -m "0.2.1"
$ ./docsearch deploy:scraper
$ git push --follow-tags

Detailed instructions starting from a fresh Ubuntu Server 22.02:

# Install Docker:
# https://docs.docker.com/engine/install/ubuntu/
sudo apt update
sudo apt remove docker docker-engine docker.io containerd runc --yes
sudo apt install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release \
    --yes
sudo mkdir -m 0755 -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install \
  docker-ce \
  docker-ce-cli \
  containerd.io \
  docker-buildx-plugin \
  docker-compose-plugin \
  --yes
sudo docker run hello-world

# Run Docker as a non-root user:
# https://www.digitalocean.com/community/questions/how-to-fix-docker-got-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket
sudo usermod -aG docker ${USER}
exit
# (Relogin.)
docker run hello-world

# Install dependencies for pyenv:
# https://github.com/pyenv/pyenv/wiki#suggested-build-environment
sudo apt update
sudo apt install \
  build-essential \
  curl \
  libbz2-dev \
  libffi-dev \
  liblzma-dev \
  libncursesw5-dev \
  libreadline-dev \
  libsqlite3-dev \
  libssl-dev \
  libxml2-dev \
  libxmlsec1-dev \
  llvm \
  make \
  tk-dev \
  wget \
  xz-utils \
  zlib1g-dev \
  --yes

# Install pyenv:
# https://github.com/pyenv/pyenv#automatic-installer
curl https://pyenv.run | bash

# Add pyenv to path:
echo >> ~/.bashrc
echo '# Adding pyenv' >> ~/.bashrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc

# Install Python 3.11 inside pyenv:
pyenv install 3.11

# Set the active version of Python:
pyenv local 3.11

# Upgrade pip:
pip install --upgrade pip

# Install pipenv:
pip install --user pipenv

# There will be a warning:
# "The script virtualenv-clone is installed in '/home/[username]/.local.bin' which is not on PATH."
# Fix the warning by adding it to the PATH:
echo >> ~/.bashrc
echo '# Fixing pip warning' >> ~/.bashrc
echo 'PATH=$PATH:~/.local/bin' >> ~/.bashrc
source ~/.bashrc

# Ensure that you are in the "typesense-docsearch-scraper" directory.
# Then, install the Python dependencies for this project:
pipenv --python 3.11
pipenv lock --clear
pipenv install

# Then, open a shell with with the Python environment:
pipenv shell

# Enable containerd image store in Docker Engine: https://docs.docker.com/engine/storage/containerd/
# This allows to build cross-platform images below
# Add the following to
# /etc/docker/daemon.json
# {
#  "features": {
#     "containerd-snapshotter": true
#  }
# }
# sudo systemctl restart docker

# The following should say containerd, if not follow instructions above
docker info -f '{{ .DriverStatus }}'

# Build a new version of the base Docker container - ONLY NEEDED WHEN WE CHANGE DEPENDENCIES
export SCRAPER_BASE_VERSION="0.9.0" # Only need to change this when we update dependencies
docker buildx use typesense-builder || docker buildx create --name typesense-builder --driver docker-container --use --bootstrap # use same buildx context for all containers to build
docker buildx build --platform linux/amd64,linux/arm64 --load -f ./scraper/dev/docker/Dockerfile.base -t typesense/docsearch-scraper-base:${SCRAPER_BASE_VERSION} .
docker push typesense/docsearch-scraper-base:${SCRAPER_BASE_VERSION}
docker tag typesense/docsearch-scraper-base:${SCRAPER_BASE_VERSION} typesense/docsearch-scraper-base:latest
docker push typesense/docsearch-scraper-base:latest

# Build a new version of the scraper Docker container
export SCRAPER_VERSION="0.11.0.rc1"
export SCRAPER_BASE_VERSION="latest"
docker buildx use typesense-builder || docker buildx create --name typesense-builder --driver docker-container --use --bootstrap # use same buildx context for all containers to build
docker buildx build --platform linux/amd64,linux/arm64 --load -f ./scraper/dev/docker/Dockerfile --build-arg SCRAPER_BASE_VERSION=${SCRAPER_BASE_VERSION} -t typesense/docsearch-scraper:${SCRAPER_VERSION} .
docker push typesense/docsearch-scraper:${SCRAPER_VERSION}
docker tag typesense/docsearch-scraper:${SCRAPER_VERSION} typesense/docsearch-scraper:latest
docker push typesense/docsearch-scraper:latest

# Add a new Git tag.
git tag -a "${SCRAPER_VERSION}" -m "${SCRAPER_VERSION}"

# Sync with GitHub.
git push --follow-tags

Help

If you have any questions or run into any problems, please create a Github issue and we'll try our best to help.