Discourse is a forum platform, which allows threaded discussions. It looks nice, and works smoothly. However it is somewhat hard to archive such a forum.
There are a couple of posts showing how to archive Discourse:
- https://meta.discourse.org/t/archive-an-old-forum-in-place-to-start-a-new-discourse-forum/13433/14 . That post describes how to get a WARC archive. However I just needed a html archive. Besides, not all pre-requisites are downloaded.
- https://github.com/kitsandkats/ArchiveDiscourse . Did not try, as it required a software build step.
- https://letswp.io/download-discourse-forum-wget/ . This post is quite nice, because it explains some of the needed parameters for wget to download a Discourse forum.
In the end I made a new wget script to download a Discourse forum. The key thing which the other solutions lacked was that they did not include all page pre-requisites like the pace css. In order to do that, I tweaked the wget script as:
time wget --mirror \
--page-requisites \
--span-hosts \
--domains=PRIVATE-DISCOURSE.COM,discourse-cdn.com \
--convert-links \
--adjust-extension \
--compression=auto \
--reject-regex "/search" \
--no-if-modified-since \
--no-check-certificate \
--execute robots=off \
--random-wait \
--wait=1 \
--user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)" \
--no-cookies \
--tries=3 \
https://YOUR.PRIVATE-DISCOURSE.COM
The key thing missing in the other scripts was --span-hosts
to enable downloading CSS and other static content, and adding --domains=PRIVATE-DISCOURSE.COM,discourse-cdn.com
to limit downloading content to domains directly associated with your own Discourse instance.
When you use the script, you need to replace YOUR.PRIVATE-DISCOURSE.COM
and PRIVATE-DISCOURSE.COM
with the URL and second level domain of your own instance. For Example: discussion.example.com
and example.com
.
The script will take it’s time, and you easily need to wait a couple of hours for the download to complete. This is by design to not overload your Discourse instance.
Good luck!