Convert entire Foswiki RCS databases from one character set to UTF-8
This module is used to convert the character set encoding used in
RcsWrap and RcsLite stores.
The character set encoding determines the range of characters that can be
used in for naming wiki topics and attachments, and in content
stored in topics.
(To understand what this means on a technical level, read
Foswiki:Development.UnderstandingEncodings)
Before Foswiki 2.0, Foswiki had to be configured with a
{Site}{CharSet}
,
which set the encoding used for characters in topic and attachment names,
and topic content.
The default encoding used by Foswiki before 2.0 was iso-8859-1, which was a
reasonable choice for many western languages. However there are many
other languages (for example, Arabic, Chinese, Hebrew, Hindi) that have
characters that do not appear in this character set. Even some basic characters
like the euro symbol are missing from iso-8859-1. For this reason, Foswiki
has now moved to supporting the standard UTF-8 character encoding, which
is designed to support a very wide range of characters.
Unfortunately once you chose a
{Site}{CharSet}
and created a bunch
of topics, it became very risky to change because the charset is
associated with the entire database, and not with individual topics.
It was even possible to paste content in a different encoding into the
text editor and have it stored in that encoding, resulting in what looked
like garbled topics.
Ideally all Foswikis should use UTF-8, even those that are still using
older Foswikis, but we have a legacy of existing sites that don't. So we
need some way to convert an RCS-based wiki from any existing character
encoding to UTF-8.
And that's what this module provides. If you have a store that is:
- Set up to use some
{Site}{CharSet}
other than UTF-8
- Using a mixture of encodings in content
- Using
RcsWrap
or RcsLite
as it's {Store}{Implementation}
then this module can convert it to using UTF-8, including all the topic
histories.
Even if you don't have an immediate need for non-western character sets
this is worth doing, as Foswiki 2.0 and later work exclusively with
UTF-8 content.
Note that this module converts all the histories of all your topics,
as well as the latest version of the topic. It also maps all web,
topic and attachment names. It does not, however, touch the
content of
attachments.
Installation
This extension is tested with Foswiki 1.1.0 and later. If your Foswiki
installation is older than that, then upgrade your Foswiki first.
Note that the extension
is not required and
is not recommended on Foswiki
2.0 or later. If your requirement is part of an upgrade to Foswiki 2.0,
then either:
- convert the 1.1.x Foswiki to UTF-8 using this extension first, or
- use
tools/bulk_copy.pl
, as recommended in the release notes.
Version 1.2 of this extension has
limited support for Foswiki 2.0
systems. It can be used
with caution and a backup to detect and correct characterset
issues on a 2.0 installation.
You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.
Open configure, and open the "Extensions" section. Use "Find More Extensions" to get a list of available extensions. Select "Install".
If you have any problems, or if the extension isn't available in
configure
, then you can still install manually from the command-line. See
http://foswiki.org/Support/ManuallyInstallingExtensions for more help.
Usage
The conversion process updates data in-place, and cannot be reversed. Be sure to take a backup before running this tool.
The convertor is used from the command-line on your wiki server (if you do
not have access to the command line then we are sorry, but there is currently
no way for you to use the conversion).
To use the convertor,
- first shut down your site. You don't want anyone modifying topics while it is running.
- then
cd
to the tools
directory in your installation and perl convert_charset.pl -i
.
- If that runs cleanly without reporting any errors, you can:
The script will convert the Foswiki RCS database pointed at by
{DataDir} and {PubDir} from the existing character set (as set
by {Site}{CharSet}) to UTF8.
Options:
-i |
info - report what would be done only, do not convert anything |
-q |
quiet - work silently (unless there's an error) |
-a |
abort - on error (default is to report and continue) |
-r |
repair - detect the encoding of each string and repair inconsistencies. |
|
Expert options |
-web=webname |
Restrict conversion to a single web and it's subwebs. |
-encoding=charset |
Override the source encoding. (Required if running the conversion on Foswiki 2.0.) |
Only use
-r
if your site may contain content which cannot be decoded
using the {Site}{CharSet} (if this is the case, -i will abort with an
error).
if the -r option is given, then any number of additional repair options
can follow. These are of two types:
-
detected-encoding=actual-encoding
-
topic-path=actual-encoding
The first allows you to override the encoding of
all strings detected as
detected-encoding
, while the second allows you to select an individual topic
and override the encoding of the content of just that topic. If you need to
override the encoding of a web or topic name, use
:N
after the topic-path
e.g.
Sandbox/NorthKorea:N=EUC-KR
Although this exension is intended for use on Foswiki 1.1, there may be cases
where an individual web requires conversion on a Foswiki 2.0 system. For example,
conversion of a single web migrated at a later date from an older system. For
example, convert the oops web from
iso-8859-1
on a system already converted
to
utf-8
. *Use extreme caution converting individual webs. Foswiki does
not support mixed encoding.
perl convert_charset.pl -web=Oops -encoding=iso-8859-1 -i
Once you have run the script without -i, all:
- web names
- topic names
- attachment names
- topic content
will be converted to UTF-8. The conversion is performed
in place on the data
and pub directories.
Note that no conversion is performed on
- log files
- working/
- temporary files
- password files
- Links to attachments that were entity encoded.
Once conversion is complete you
must change your
{Site}{CharSet}
to 'utf-8'.
Info
Change History: |
|
1.4 (15 Sep 2015) |
Foswikitask:Item13702 - Actually use the encoding detected by -r repair option. |
1.3 (15 Jul 2015) |
Foswikitask:Item13523 - Better job of detecting Foswiki 2.0. |
1.2 (1 Jun 2015 ) |
Foswikitask:Item13442 - Add repair option to detect exceptions to the encoding. Add limited support for Foswiki 2.0. Add more flexible overrides for detected encoding. |
1.1 (11 Jun 2014) |
|
Dependencies
Name | Version | Description |
---|
File::Copy | >0 | Required |
Foswiki | >=1.1 | Required |