Subset Wiki

The c2.com wiki is too big to see in a single federated wiki site. Better to break it up into topics. My first thought was to use the existing category lables but have chosen to use frequently found title words instead. commit

The original wiki insisted on page titles made from multiple alphabetic words. This forced authors to name pages carefully if sometimes awkwardly.

✔ Study word usage in existing titles.

✔ Select subset based on subdomain for sitemap.

✔ Allow each subdomain to access all wiki content.

✔ Provide unique flag for each subdomain.

Words

Wiki pages are stored in a flat-file database (.wdb). Our analysis starts by breaking these names into individual words. We count uniques and then list the largest first.

ls wiki.wdb | perl -pe 's/([a-z])([A-Z])/$1\n$2/g' | sort | uniq -c | sort -nr | less

Every word is a potential subset. We'd like words that are both distinguishing and meaningful. We'd also like them to label 100 to 500 pages. A few words identify subsets larger than that.

1708 Wiki 1645 The 1445 Of 902 And 843 Is 818 Programming 743 Language 692 In 689 To 559 For 547 Xp 535 Category

The previous prototype offered only the most recently edited pages in the sitemap.

ls -t wiki.wdb | head -300

We continue to prefer more recently edited pages in the rare cases where the subset would exceed our expanded sitemap page limit of 500.

ls -t wiki.wdb | grep $subdomain | head -500

Deployed

We created six subset categories with three to seven subdomains referenced in each one. Mostly these aligned with interests in the 1995-2000 time period.

We pick topics that have been of lasting interest and subset them into their own federated wiki sites.

In use, one would choose one category and then start browsing from the aggregated recent changes then loaded. All categories together represents 7,000 pages.