CharsetConverterContrib

Convert entire Foswiki RCS databases from one character set to UTF-8

This module is used to convert the character set encoding used in RcsWrap and RcsLite stores.

The character set encoding determines the range of characters that can be used in for naming wiki topics and attachments, and in content stored in topics.

(To understand what this means on a technical level, read Foswiki:Development.UnderstandingEncodings)

Before Foswiki 2.0, Foswiki had to be configured with a {Site}{CharSet}, which set the encoding used for characters in topic and attachment names, and topic content.

The default encoding used by Foswiki before 2.0 was iso-8859-1, which was a reasonable choice for many western languages. However there are many other languages (for example, Arabic, Chinese, Hebrew, Hindi) that have characters that do not appear in this character set. Even some basic characters like the euro symbol are missing from iso-8859-1. For this reason, Foswiki has now moved to supporting the standard UTF-8 character encoding, which is designed to support a very wide range of characters.

Unfortunately once you chose a {Site}{CharSet} and created a bunch of topics, it became very risky to change because the charset is associated with the entire database, and not with individual topics. It was even possible to paste content in a different encoding into the text editor and have it stored in that encoding, resulting in what looked like garbled topics.

Ideally all Foswikis should use UTF-8, even those that are still using older Foswikis, but we have a legacy of existing sites that don't. So we need some way to convert an RCS-based wiki from any existing character encoding to UTF-8.

And that's what this module provides. If you have a store that is:
  1. Set up to use some {Site}{CharSet} other than UTF-8
  2. Using a mixture of encodings in content
  3. Using RcsWrap or RcsLite as it's {Store}{Implementation}
then this module can convert it to using UTF-8, including all the topic histories.

Even if you don't have an immediate need for non-western character sets this is worth doing, as Foswiki 2.0 and later work exclusively with UTF-8 content.

Note that this module converts all the histories of all your topics, as well as the latest version of the topic. It also maps all web, topic and attachment names. It does not, however, touch the content of attachments.

Installation

This extension is tested with Foswiki 1.1.0 and later. If your Foswiki installation is older than that, then upgrade your Foswiki first.

Note that the extension is not required and is not recommended on Foswiki 2.0 or later. If your requirement is part of an upgrade to Foswiki 2.0, then either:
  1. convert the 1.1.x Foswiki to UTF-8 using this extension first, or
  2. use tools/bulk_copy.pl, as recommended in the release notes.

Version 1.2 of this extension has limited support for Foswiki 2.0 systems. It can be used with caution and a backup to detect and correct characterset issues on a 2.0 installation.

You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.

Open configure, and open the "Extensions" section. Use "Find More Extensions" to get a list of available extensions. Select "Install".

If you have any problems, or if the extension isn't available in configure, then you can still install manually from the command-line. See http://foswiki.org/Support/ManuallyInstallingExtensions for more help.

Usage

ALERT! The conversion process updates data in-place, and cannot be reversed. Be sure to take a backup before running this tool.

The convertor is used from the command-line on your wiki server (if you do not have access to the command line then we are sorry, but there is currently no way for you to use the conversion).

To use the convertor,
  • first shut down your site. You don't want anyone modifying topics while it is running.
  • then cd to the tools directory in your installation and perl convert_charset.pl -i.
  • If that runs cleanly without reporting any errors, you can:
    • perl convert_charset.pl

The script will convert the Foswiki RCS database pointed at by {DataDir} and {PubDir} from the existing character set (as set by {Site}{CharSet}) to UTF8.

Options:
-i info - report what would be done only, do not convert anything
-q quiet - work silently (unless there's an error)
-a abort - on error (default is to report and continue)
-r repair - detect the encoding of each string and repair inconsistencies.
  Expert options
-web=webname Restrict conversion to a single web and it's subwebs.
-encoding=charset Override the source encoding. (Required if running the conversion on Foswiki 2.0.)

Only use -r if your site may contain content which cannot be decoded using the {Site}{CharSet} (if this is the case, -i will abort with an error).

if the -r option is given, then any number of additional repair options can follow. These are of two types:
  • detected-encoding=actual-encoding
  • topic-path=actual-encoding
The first allows you to override the encoding of all strings detected as detected-encoding, while the second allows you to select an individual topic and override the encoding of the content of just that topic. If you need to override the encoding of a web or topic name, use :N after the topic-path e.g. Sandbox/NorthKorea:N=EUC-KR

Although this exension is intended for use on Foswiki 1.1, there may be cases where an individual web requires conversion on a Foswiki 2.0 system. For example, conversion of a single web migrated at a later date from an older system. For example, convert the oops web from iso-8859-1 on a system already converted to utf-8. *Use extreme caution converting individual webs. Foswiki does not support mixed encoding. perl convert_charset.pl -web=Oops -encoding=iso-8859-1 -i

Once you have run the script without -i, all:
  • web names
  • topic names
  • attachment names
  • topic content
will be converted to UTF-8. The conversion is performed in place on the data and pub directories.

Note that no conversion is performed on
  • log files
  • working/
  • temporary files
  • password files
  • Links to attachments that were entity encoded.

Once conversion is complete you must change your {Site}{CharSet} to 'utf-8'.

Info

Change History:  
1.4 (15 Sep 2015) Foswikitask:Item13702 - Actually use the encoding detected by -r repair option.
1.3 (15 Jul 2015) Foswikitask:Item13523 - Better job of detecting Foswiki 2.0.
1.2 (1 Jun 2015 ) Foswikitask:Item13442 - Add repair option to detect exceptions to the encoding. Add limited support for Foswiki 2.0.
Add more flexible overrides for detected encoding.
1.1 (11 Jun 2014)  

Dependencies

NameVersionDescription
File::Copy>0Required
Foswiki>=1.1Required
Topic revision: r1 - 15 Jul 2015, ProjectContributor
This site is powered by FoswikiCopyright © by the contributing authors. All material on this site is the property of the contributing authors.
Ideas, requests, problems regarding SedWiki? Send feedback