Discourse is a forum platform, which allows threaded discussions. It looks nice, and works smoothly. However it is somewhat hard to archive such a forum.
There are a couple of posts showing how to archive Discourse:
- https://meta.discourse.org/t/archive-an-old-forum-in-place-to-start-a-new-discourse-forum/13433/14 . That post describes how to get a WARC archive. However I just needed a html archive. Besides, not all pre-requisites are downloaded.
- https://github.com/kitsandkats/ArchiveDiscourse . Did not try, as it required a software build step.
- https://letswp.io/download-discourse-forum-wget/ . This post is quite nice, because it explains some of the needed parameters for wget to download a Discourse forum.
In the end I made a new wget script to download a Discourse forum. The key thing which the other solutions lacked was that they did not include all page pre-requisites like the pace css. In order to do that, I tweaked the wget script as:
time wget --mirror \ --page-requisites \ --span-hosts \ --domains=PRIVATE-DISCOURSE.COM,discourse-cdn.com \ --convert-links \ --adjust-extension \ --compression=auto \ --reject-regex "/search" \ --no-if-modified-since \ --no-check-certificate \ --execute robots=off \ --random-wait \ --wait=1 \ --user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)" \ --no-cookies \ --tries=3 \ https://YOUR.PRIVATE-DISCOURSE.COM
The key thing missing in the other scripts was
--span-hosts to enable downloading CSS and other static content, and adding
--domains=PRIVATE-DISCOURSE.COM,discourse-cdn.com to limit downloading content to domains directly associated with your own Discourse instance.
When you use the script, you need to replace
PRIVATE-DISCOURSE.COM with the URL and second level domain of your own instance. For Example:
The script will take it’s time, and you easily need to wait a couple of hours for the download to complete. This is by design to not overload your Discourse instance.