
ArchiveBox
Self-hosted tool to collect and preserve webpages, media, and bookmarks in durable formats (HTML, PDF, WARC, MP4) with a CLI, web UI, and search.

ArchiveBox is a self-hosted, open-source web archiving application that captures and preserves web pages and associated media in durable formats for long-term access. It can ingest URLs, browser history, bookmarks, RSS feeds, and other sources and produces redundant snapshot outputs for offline viewing and analysis. (archivebox.io)
Key Features
- Multiple import sources: URLs, browser history, bookmarks, Pocket/Pinboard, RSS and more. (archivebox.io)
- Saves snapshots in redundant, portable formats: original HTML+CSS+JS, singlefile HTML, screenshot PNG, PDF, WARC, JSON, MP3/MP4, and SQLite index. (github.com)
- Web UI + CLI + Python API: manage collections via a self-hosted web app, a command-line interface, or the Python library. (github.com)
- Search & indexing options: SQLite FTS or external search backends (e.g., Sonic) for fast full-text queries. (docs.archivebox.io)
- Extensible extractors: integrates with standard tools (chromium/chrome, yt-dlp, singlefile, readability) and can be configured to run optional extractors. (docs.archivebox.io)
Use Cases
- Journalists and researchers preserving cited pages and social media posts for reproducibility and evidence. (archivebox.io)
- Legal and compliance teams capturing time-stamped snapshots for records and audits. (archivebox.io)
- Individuals or organizations creating offline archives of bookmarks, blogs, or multimedia collections. (github.com)
Limitations and Considerations
- Storage and disk usage can grow quickly (especially when archiving video/audio); careful tuning of extractor settings and filesystem choice is recommended. (docs.archivebox.io)
- Several high-fidelity extractors rely on external system packages (Chromium/Chrome, Node, ffmpeg, yt-dlp); installing the full feature set requires additional runtime dependencies. (docs.archivebox.io)
ArchiveBox is intended for users who need durable, self-hosted preservation of web content and provides multiple interfaces and storage-friendly outputs to support long-term access and programmatic workflows. (archivebox.io)


