longnow-for-markdown

Take a markdown file and feed links to the internet archive
Log | Files | Refs | README

README.md (4901B)


      1 This utility takes a markdown file, and creates a new markdown file in which each link is accompanied by an archive.org link, in the format [...](original link) ([a](archive.org link)).
      2 
      3 I use it to archive links in [this forecasting newsletter](https://forecasting.substack.com), which contains the following footer:
      4 
      5 > Note to the future: All links are added automatically to the Internet Archive, using this [tool](https://github.com/NunoSempere/longNowForMd) ([a](https://web.archive.org/web/20220109144543/https://github.com/NunoSempere/longNowForMd)). "(a)" for archived links was inspired by [Milan Griffes](https://www.flightfromperfection.com/) ([a](https://web.archive.org/web/20220109144604/https://www.flightfromperfection.com/)), [Andrew Zuckerman](https://www.andzuck.com/) ([a](https://web.archive.org/web/20211202120912/https://www.andzuck.com/)), and [Alexey Guzey](https://guzey.com/) ([a](https://web.archive.org/web/20220109144733/https://guzey.com/)).
      6 
      7 ## How to install
      8 Add [this file](https://github.com/NunoSempere/longNowForMd/blob/master/longnow.sh) to your path, for instance by moving it to the `/usr/bin` folder and giving it execute permissions (with `chmod 755 longnow`)
      9 
     10 ```
     11 curl https://raw.githubusercontent.com/NunoSempere/longNowForMd/master/longnow.sh > longnow
     12 cat longnow ## probably a good idea to at least see what's there before giving it execute permissions
     13 sudo chmod 755 longnow
     14 mv longnow /bin/longnow
     15 ```
     16 
     17 In addition, this utility requires [archivenow](https://github.com/oduwsdl/archivenow) as a dependency, which itself requires a python installation. archivenow can be installed with
     18 
     19 ```
     20 pip install archivenow ## respectively, pip3, pipx, etc. depending on the system. I use pipx
     21 ```
     22 
     23 It also requires [jq](https://stedolan.github.io/jq/download/), which can be installed as:
     24 
     25 ```
     26 sudo apt install jq
     27 ```
     28 
     29 if on Debian, or using your distribution's package manager otherwise.
     30 
     31 As of the newest iteration of this program, if archive.org already has a snapshot of the page, that snapshot is taken instead. This results in massive time savings, but could imply that a less up to date copy is used. If this behavior is not desired, it can be easily excised manually, by removing the lines around `if [ "$urlAlreadyInArchiveOnline" == "" ]; then`.
     32 
     33 ## How to use
     34 
     35 ```
     36 $ longnow file.md
     37 ```
     38 
     39 For a reasonably sized file, the process will take a long time, so this is more of a "fire and forget, and then come back in a couple of hours" tool. The process can be safely stopped and restarted at any point, and archive links are remembered, but the errors file is created again each time.
     40 
     41 ## To do
     42 - Deal elegantly with images. Right now, they are also archived, and have to be removed manually afterwards.
     43 - Possibly: Throttle requests to the internet archive less. Right now, I'm sending a link roughly every 12 seconds, and then sleeping for a minute every 15 requests. This is probably too much throttling (the theoretical limit is 15 requests per minute), but I think that it does reduce the error rate. 
     44 - Do the same thing but for html files, or other formats
     45 - Present to r/DataHoarders
     46 - Pull requests are welcome.
     47 
     48 ## How to use to back up Google Files
     49 
     50 You can download a .odt file from Google, and then convert it to a markdown file with 
     51 
     52 ```
     53 function odtToMd(){
     54 
     55   input="$1"
     56   root="$(echo "$input" | sed 's/.odt//g' )"
     57   output="$root.md"
     58 
     59   pandoc -s "$input" -t markdown-raw_html-native_divs-native_spans-fenced_divs-bracketed_spans | awk ' /^$/ { print "\n"; } /./ { printf("%s ", $0); } END { print ""; } ' | sed -r 's/([0-9]+\.)/\n\1/g' | sed -r 's/\*\*(.*)\*\*/## \1/g'  | tr -s " " | sed -r 's/\\//g' | sed -r 's/\[\*/\[/g' | sed -r 's/\*\]/\]/g' > "$output"
     60   ## Explanation: 
     61   ## markdown-raw_html-native_divs-native_spans-fenced_divs-bracketed_spans: various flags to generate some markdown I like
     62   ## sed -r 's/\*\*(.*)\*\*/## \1/g': transform **Header** into ## Header
     63   ## sed -r 's/\\//g': Delete annoying "\"s
     64   ## awk ' /^$/ { print "\n"; } /./ { printf("%s ", $0); } END { print ""; } ': compress paragraphs; see https://unix.stackexchange.com/questions/6910/there-must-be-a-better-way-to-replace-single-newlines-only
     65   ## sed -r 's/([0-9]*\.)/\n\1/g': Makes lists nicer.
     66   ## tr -s " ": Replaces multiple spaces
     67 }
     68 
     69 ## Use: odtToMd file.odt
     70 ```
     71 
     72 Then run this tool (`longnow file.md`). Afterwards, convert the output file (`file.longnow.md`) back to html with 
     73 
     74 ```
     75 function mdToHTML(){
     76   input="$1"
     77   root="$(echo "$input" | sed 's/.md//g' )"
     78   output="$root.html"
     79   pandoc -r gfm "$source" -o "$output"
     80   ## sed -i 's|\[ \]\(([^\)]*)\)| |g' "$source" ## This removes links around spaces, which are very annoying. See https://unix.stackexchange.com/questions/297686/non-greedy-match-with-sed-regex-emulate-perls
     81 }
     82 
     83 ## Use: mdToHTML file.md
     84 ```
     85 
     86 Then copy and paste the html into a Google doc and fix fomatting mistakes.