ArchiveBox

ArchiveBox

Open-source self-hosted web archiving and snapshotting tool

26.9kstars
1.5kforks
Last commit: 1d ago
Repo age: 9y old
ArchiveBox screenshot

ArchiveBox is a self-hosted, open-source web archiving application that captures and preserves web pages and associated media in durable formats for long-term access. It can ingest URLs, browser history, bookmarks, RSS feeds, and other sources and produces redundant snapshot outputs for offline viewing and analysis.

Key Features

  • Multiple import sources: URLs, browser history, bookmarks, Pocket/Pinboard, RSS and more.
  • Saves snapshots in redundant, portable formats: original HTML+CSS+JS, singlefile HTML, screenshot PNG, PDF, WARC, JSON, MP3/MP4, and SQLite index.
  • Web UI + CLI + Python API: manage collections via a self-hosted web app, a command-line interface, or the Python library.
  • Search & indexing options: SQLite FTS or external search backends (e.g., Sonic) for fast full-text queries.
  • Extensible extractors: integrates with standard tools (chromium/chrome, yt-dlp, singlefile, readability) and can be configured to run optional extractors.

Use Cases

  • Journalists and researchers preserving cited pages and social media posts for reproducibility and evidence.
  • Legal and compliance teams capturing time-stamped snapshots for records and audits.
  • Individuals or organizations creating offline archives of bookmarks, blogs, or multimedia collections.

Limitations and Considerations

  • Storage and disk usage can grow quickly (especially when archiving video/audio); careful tuning of extractor settings and filesystem choice is recommended.
  • Several high-fidelity extractors rely on external system packages (Chromium/Chrome, Node, ffmpeg, yt-dlp); installing the full feature set requires additional runtime dependencies.

ArchiveBox is intended for users who need durable, self-hosted preservation of web content and provides multiple interfaces and storage-friendly outputs to support long-term access and programmatic workflows.

Categories:

Tags:

Tech Stack:

Share:

Similar Services

Meilisearch

Meilisearch

Fast search engine API with full-text, vector, and hybrid search

56.1k
2.4k
Last commit: 13h ago

Meilisearch is a lightning-fast search engine API for apps and websites, offering typo-tolerant full-text search plus vector and AI-ready hybrid retrieval.

Alternative to:
Algolia
Algolia
+16
SearXNG

SearXNG

Privacy-focused metasearch engine for aggregating web results

25.3k
2.5k
Last commit: 5d ago

SearXNG is a privacy-respecting metasearch engine that aggregates results from many search services without tracking or profiling users.

Alternative to:
Google Search
Google Search
+6
Typesense

Typesense

Fast, typo-tolerant search engine with keyword and vector search

25.3k
861
Last commit: 1d ago

Typesense is a developer-friendly search engine for instant, typo-tolerant search-as-you-type with faceting, filtering, geo search, and vector/semantic search APIs.

Alternative to:
Algolia
Algolia
+19
ZincSearch

ZincSearch

A lightweight open-source search engine for full-text indexing.

17.7k
770
Last commit: 1mo ago

ZincSearch is a Go-based, lightweight search engine for full-text indexing with Elasticsearch API-compatible ingestion, a Vue UI, and a schema-less document model.

Alternative to:
Elastic Cloud (Elasticsearch Service)
Elastic Cloud (Elasticsearch Service)
+7
Onyx Community Edition

Onyx Community Edition

Self-hosted AI chat and enterprise search for any LLM

17.6k
2.4k
Last commit: 17h ago

Open-source platform for AI chat, RAG, agents, and enterprise search across your team’s connected knowledge sources, compatible with hosted and local LLMs.

Alternative to:
Onyx
Onyx
+19
OpenSearch

OpenSearch

Distributed search and analytics engine with a RESTful API

12.4k
2.4k
Last commit: 17h ago

OpenSearch is an Apache 2.0 open source distributed search and analytics engine for indexing, querying, and analyzing large-scale data with REST APIs.

Alternative to:
Amazon OpenSearch Service
Amazon OpenSearch Service
+19