Skip to content

Conversation

@WardCunningham
Copy link
Owner

@WardCunningham WardCunningham commented Dec 4, 2018

We add some tools for finding and correcting character code problems. Rather than work on the whole wiki database, we select out troublesome pages into a trouble directory. As we improve our json.* script chain we will produce possibly improved pages in a pages directory. From this we can select out specific remaining problems which we diagnose at the line level and collect in the invalid directory. From this we inspect the octal characters and find substitutions then repeat the process looking for improvements.

bulk convert

ruby json.rb

This reports progress on individual files using the codes ., , and x for retrieved text, copy or still trouble. Counts will measure progress.

x......x..,x...xx.xx....,..xxxx.xx..x..x...x..xx.x..x,x.x.xx.x,x.x...xx.,.x.x,..
.........x.....,..x,x..x.....xx..x...........x......xx..x...xxx...xx........xxx,
..x,.x.xx..x.....x...x.x.xxxxxx.xx.x,..x.x,x.x,x.xxx,....,x.xx.x.xxx.x.x,...xxx.
......,xx...x.,x.x..xx.x,.x,.x,xx..x,.x.x.xx.x.xxxx.xx...............xx..x..x.x.
x.x..x..x..xxxxx.x...x.x.x...x...xxxxx.xxxx..x.x...xx.....x.,x.x.xx...x.x.......
xxx..,..,,.x..x..,x,x..x.x..x..x.xx.x,x.x...x.,.x.xxxxxx..xx....,..,x..xx......x
,.xx..x.xx.,........xxxx...xx..,.xx..xxx.x..,x.,x,.....x.xx,xxx...x..x...xx..xxx
.x....xxxxxxx,x..xxxxxx....xx..xx.xxx......x....x..xx,xxx,xx..x...xxxxx..x....x.
x,..xx.xx,.....x.,.x.x.,x..x.x.,x,...x....x..,..xxx.xxxx.x...x.x.x.x.....x.x.xxx
..x..,.xxx..x..xx.x.x..x,.,x.,x,x.....x.xx....xxx.x..x.x,.x.xx....x...xx..x..x.x
...x..x.xx.xxx.x.xx...x...xx..x.xxx...xxxx...xxx,,x,.x,,..xx...xx...x...,x..x...
xxx,xx.xxx,..xx.x.,xx.xx,.xxxx...xxxx,.xx.x.x.xx,x..xxxxxxxx..xxxxx.,.,,x....xxx
x.x..xx.x.x...xx,..x....xx..x.x,,x..xx....x..,x..,x,x.x.x..x.........x.,...xxxx.
....x.x.x..x.xx..xxxxxx..xx.,.
557 text ok
78 copy ok
435 with trouble

diagnosis

rm invalid/*
cat pages/* | \
  jq -r 'select(.trouble)|.page' | \
  while read i; do
    cat trouble/$i | perl check.pl | ruby check.rb > invalid/$i
  done

A typical invalid file shows numbered lines with invalid characters.

cat invalid/AmbientOrb
24 "Automated Continuous Integration and the Ambient Orb™"
75 "Automated Continuous Integration and the Ambient Orb™"

Good technique is to flip through the reports working from the shortest entries first.

(cd invalid; for i in `ls -Sr`; do echo $i; cat $i; echo;  read x; done)

When a particular line of a particular file is of interest, isolate that line and dumb it in octal.

cat trouble/AmbientOrb | head -24 | tail -1 | od -c
0000000    "   A   u   t   o   m   a   t   e   d       C   o   n   t   i
0000020    n   u   o   u   s       I   n   t   e   g   r   a   t   i   o
0000040    n       a   n   d       t   h   e       A   m   b   i   e   n
0000060    t       O   r   b 231   "  \r  \n                            
0000071

In this case the invalid character is octal 231.

substitution

Perl is happy to change these characters to something preferable. In this case, the tm is probably a joke and won't be missed.

s/\o{231}//g;

These commands go into json.pl and check.pl which have a similar structure. We repeat from the beginning and find less work to do. (In this case, way less work todo having worked a few substitutions before repeating.)

624 text ok
69 copy ok
377 with trouble

This was referenced Dec 4, 2018
@WardCunningham
Copy link
Owner Author

I've revised the json.rb converter to record in the output more information about decisions it has made. This lets me assess results by selecting subsets with jq. Here, for example, is a successful recovery:

{
  "date": "November 3, 2011",
  "text": "We aim to make simple things simple and complex things possible...",
  "rev": "22",
  "page": "AlanKayOnSmalltalk",
  "copy": true
}

On complete failure I still produce the page name so that I can apply more detailed diagnostics driven from jq results.

{
  "page": "AdelinoRodrigues",
  "trouble": true
}

See where I begin diagnosis with jq -r 'select(.trouble)|.page' above.

@btrower
Copy link

btrower commented Nov 26, 2020 via email

@WardCunningham
Copy link
Owner Author

Thank you for your understanding.

A small percentage of pages have character set problems than never bothered the perl that implemented the original wiki. As I have chosen to work in ruby and more recently javascript I find that I can't touch these files. In a pinch I have written c code with getchar and putchar which can eat through anything. (I'm not using standard io with might be picky.)

In other news, last fall a Portland State University capstone project worked through these problems and others but the pandemic got in the way of the final tech transfer. Ongoing work should start there.

The read-write access to the original content is through federated wiki. You can edit pages and your edits will persist for your own benefit in browser local storage. If you want to share edits you can host a federated wiki instance of your own and save any edits you make there. If an interest group were willing to take ownership of some content, maybe continuous integration pages, of implementations of fizz-buzz, I could find some way to announce this work as a sister project when people read the original.

@btrower
Copy link

btrower commented Nov 28, 2020 via email

@WardCunningham
Copy link
Owner Author

Oops. I meant to include a link. Here I searched for wiki and picked a few pages to illustrate.

http://ward.asia.wiki.org/view/wiki-find-page/wiki.sfw.c2.com/abuse-on-wiki/wiki.sfw.c2.com/an-outsiders-review-of-wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants