Photo by Christina @ wocintechchat.com / Unsplash

Archiving a Discourse Forum

Discourse is a forum platform, which allows threaded discussions. It looks nice, and works smoothly. However it is somewhat hard to archive such a forum.

There are a couple of posts showing how to archive Discourse:

In the end I made a new wget script to download a Discourse forum. The key thing which the other solutions lacked was that they did not include all page pre-requisites like the pace css. In order to do that, I tweaked the wget script as:

time wget --mirror \
      --page-requisites \
      --span-hosts \
      --domains=PRIVATE-DISCOURSE.COM,discourse-cdn.com \
      --convert-links \
      --adjust-extension \
      --compression=auto \
      --reject-regex "/search" \
      --no-if-modified-since \
      --no-check-certificate \
      --execute robots=off \
      --random-wait \
      --wait=1 \
      --user-agent="Googlebot/2.1 (+http://www.google.com/bot.html)" \
      --no-cookies \
      --tries=3 \
      https://YOUR.PRIVATE-DISCOURSE.COM

The key thing missing in the other scripts was --span-hosts to enable downloading CSS and other static content, and adding --domains=PRIVATE-DISCOURSE.COM,discourse-cdn.com to limit downloading content to domains directly associated with your own Discourse instance.

When you use the script, you need to replace YOUR.PRIVATE-DISCOURSE.COM and PRIVATE-DISCOURSE.COM with the URL and second level domain of your own instance. For Example: discussion.example.com and example.com.

The script will take it’s time, and you easily need to wait a couple of hours for the download to complete. This is by design to not overload your Discourse instance.

Good luck!