
ArchiveBox
Self-hosted tool to collect and preserve webpages, media, and bookmarks in durable formats (HTML, PDF, WARC, MP4) with a CLI, web UI, and search.

ArchiveBox is a self-hosted, open-source web archiving application that captures and preserves web pages and associated media in durable formats for long-term access. It can ingest URLs, browser history, bookmarks, RSS feeds, and other sources and produces redundant snapshot outputs for offline viewing and analysis.
Key Features
- Multiple import sources: URLs, browser history, bookmarks, Pocket/Pinboard, RSS and more.
- Saves snapshots in redundant, portable formats: original HTML+CSS+JS, singlefile HTML, screenshot PNG, PDF, WARC, JSON, MP3/MP4, and SQLite index.
- Web UI + CLI + Python API: manage collections via a self-hosted web app, a command-line interface, or the Python library.
- Search & indexing options: SQLite FTS or external search backends (e.g., Sonic) for fast full-text queries.
- Extensible extractors: integrates with standard tools (chromium/chrome, yt-dlp, singlefile, readability) and can be configured to run optional extractors.
Use Cases
- Journalists and researchers preserving cited pages and social media posts for reproducibility and evidence.
- Legal and compliance teams capturing time-stamped snapshots for records and audits.
- Individuals or organizations creating offline archives of bookmarks, blogs, or multimedia collections.
Limitations and Considerations
- Storage and disk usage can grow quickly (especially when archiving video/audio); careful tuning of extractor settings and filesystem choice is recommended.
- Several high-fidelity extractors rely on external system packages (Chromium/Chrome, Node, ffmpeg, yt-dlp); installing the full feature set requires additional runtime dependencies.
ArchiveBox is intended for users who need durable, self-hosted preservation of web content and provides multiple interfaces and storage-friendly outputs to support long-term access and programmatic workflows.

