Sosse

Sosse

Selenium-based web crawler, archiver, and search engine

386stars
21forks
Last commit: 2mo ago
Repo age: 4y old
Sosse screenshot

Sosse is an open-source search engine and web crawler designed to index, archive, and monitor web pages — including JavaScript-heavy sites — using browser-based rendering. It combines full-page archiving with flexible crawling policies and search capabilities for private or organizational use.

Key Features

  • Index and search web page content, including dynamically rendered pages via browser automation
  • Recurring and scheduled crawling with adaptive policies and queue management
  • Pixel-perfect archiving: preserve HTML and assets, rewrite links for local/offline viewing
  • Tagging and metadata support for organizing and filtering archived content
  • Batch file downloads and content deduplication for large-scale collection
  • Webhooks and RESTful API for integrations, automated processing, and AI-driven workflows
  • Atom feed generation and change detection for pages without feeds
  • Authentication and permission controls for accessing and searching private resources

Use Cases

  • Institutional web archiving and long-term preservation of web pages and assets
  • Internal site and document indexing for enterprise search and knowledge discovery
  • Continuous monitoring and competitive analysis with automated alerts and exports

Limitations and Considerations

  • Browser-based crawling (Selenium + headless browsers) increases resource usage and operational complexity compared to pure HTTP crawlers
  • Requires browser binaries and drivers plus a production database (PostgreSQL) for scalable deployments
  • Designed as a general-purpose crawler/search stack; very large-scale deployments may require additional tuning, infrastructure, and queue scaling strategies

Sosse is well suited for teams needing accurate rendering and archival fidelity for dynamic sites, combined with search and automation capabilities. It is distributed under a strong copyleft license and is commonly deployed using containerized images for evaluation and production.

Categories:

Tags:

Tech Stack:

Share:

Similar Services

Meilisearch

Meilisearch

Fast search engine API with full-text, vector, and hybrid search

55.4k
2.3k
Last commit: 2d ago

Meilisearch is a lightning-fast search engine API for apps and websites, offering typo-tolerant full-text search plus vector and AI-ready hybrid retrieval.

Alternative to:
Algolia
Algolia
+16
ArchiveBox

ArchiveBox

Open-source self-hosted web archiving and snapshotting tool

26.4k
1.4k
Last commit: 11d ago

Self-hosted tool to collect and preserve webpages, media, and bookmarks in durable formats (HTML, PDF, WARC, MP4) with a CLI, web UI, and search.

Alternative to:
Internet Archive Wayback Machine
Internet Archive Wayback Machine
+3
Typesense

Typesense

Fast, typo-tolerant search engine with keyword and vector search

25k
850
Last commit: 2d ago

Typesense is a developer-friendly search engine for instant, typo-tolerant search-as-you-type with faceting, filtering, geo search, and vector/semantic search APIs.

Alternative to:
Algolia
Algolia
+19
SearXNG

SearXNG

Privacy-focused metasearch engine for aggregating web results

24.2k
2.4k
Last commit: 22h ago

SearXNG is a privacy-respecting metasearch engine that aggregates results from many search services without tracking or profiling users.

Alternative to:
Google Search
Google Search
+6
ZincSearch

ZincSearch

A lightweight open-source search engine for full-text indexing.

17.7k
762
Last commit: 1mo ago

ZincSearch is a Go-based, lightweight search engine for full-text indexing with Elasticsearch API-compatible ingestion, a Vue UI, and a schema-less document model.

Alternative to:
Elastic Cloud (Elasticsearch Service)
Elastic Cloud (Elasticsearch Service)
+7
Onyx Community Edition

Onyx Community Edition

Self-hosted AI chat and enterprise search for any LLM

17.1k
2.3k
Last commit: 16h ago

Open-source platform for AI chat, RAG, agents, and enterprise search across your team’s connected knowledge sources, compatible with hosted and local LLMs.

Alternative to:
Onyx
Onyx
+19