
ArchiveBox
Open-source self-hosted web archiving and snapshotting tool

ArchiveBox is a self-hosted, open-source web archiving application that captures and preserves web pages and associated media in durable formats for long-term access. It can ingest URLs, browser history, bookmarks, RSS feeds, and other sources and produces redundant snapshot outputs for offline viewing and analysis.
Key Features
- Multiple import sources: URLs, browser history, bookmarks, Pocket/Pinboard, RSS and more.
- Saves snapshots in redundant, portable formats: original HTML+CSS+JS, singlefile HTML, screenshot PNG, PDF, WARC, JSON, MP3/MP4, and SQLite index.
- Web UI + CLI + Python API: manage collections via a self-hosted web app, a command-line interface, or the Python library.
- Search & indexing options: SQLite FTS or external search backends (e.g., Sonic) for fast full-text queries.
- Extensible extractors: integrates with standard tools (chromium/chrome, yt-dlp, singlefile, readability) and can be configured to run optional extractors.
Use Cases
- Journalists and researchers preserving cited pages and social media posts for reproducibility and evidence.
- Legal and compliance teams capturing time-stamped snapshots for records and audits.
- Individuals or organizations creating offline archives of bookmarks, blogs, or multimedia collections.
Limitations and Considerations
- Storage and disk usage can grow quickly (especially when archiving video/audio); careful tuning of extractor settings and filesystem choice is recommended.
- Several high-fidelity extractors rely on external system packages (Chromium/Chrome, Node, ffmpeg, yt-dlp); installing the full feature set requires additional runtime dependencies.
ArchiveBox is intended for users who need durable, self-hosted preservation of web content and provides multiple interfaces and storage-friendly outputs to support long-term access and programmatic workflows.
Categories:
Tags:
Tech Stack:
Similar Services

Meilisearch
Fast search engine API with full-text, vector, and hybrid search
Meilisearch is a lightning-fast search engine API for apps and websites, offering typo-tolerant full-text search plus vector and AI-ready hybrid retrieval.

SearXNG
Privacy-focused metasearch engine for aggregating web results
SearXNG is a privacy-respecting metasearch engine that aggregates results from many search services without tracking or profiling users.

Typesense
Fast, typo-tolerant search engine with keyword and vector search
Typesense is a developer-friendly search engine for instant, typo-tolerant search-as-you-type with faceting, filtering, geo search, and vector/semantic search APIs.
ZincSearch
A lightweight open-source search engine for full-text indexing.
ZincSearch is a Go-based, lightweight search engine for full-text indexing with Elasticsearch API-compatible ingestion, a Vue UI, and a schema-less document model.
Onyx Community Edition
Self-hosted AI chat and enterprise search for any LLM
Open-source platform for AI chat, RAG, agents, and enterprise search across your team’s connected knowledge sources, compatible with hosted and local LLMs.

OpenSearch
Distributed search and analytics engine with a RESTful API
OpenSearch is an Apache 2.0 open source distributed search and analytics engine for indexing, querying, and analyzing large-scale data with REST APIs.



