DcmCharString: add some support for multi-byte characters #124

mrbean-bremen · 2025-08-08T09:26:25Z

add DcmCharString::getVM() and getOFString(), which handle multi-byte charsets
DcmByteString::containsExtendedCharacters(): add check for ESCAPE characters (only allowed in code extensions)
removed obsolete DcmCharString::containsExtendedCharacters()

This is not finished, there are some more methods that probably need to be adapted (checkStringValue, verify, getOFStringArray, putOFStringAtPos, writeJson), but I wanted to get some feedback if this is the correct way to do this.

EDIT: added support for putOFStringAtPos, getOFStringArray and writeJson should work out of the box, and the static functions checkStringValue and verify are somewhat out of scope here.

As mentioned in a comment - another way to handle this would be to always convert the string value before calling these methods (and cache the converted value). This would probably make the handling easier, but I'm somewhat reluctant because of the needed caching and cache invalidation, and the possible performance impact (though I haven't done any performance tests).

dcmdata/libsrc/dcchrstr.cc

dcmdata/libsrc/dcbytstr.cc

dcmdata/libsrc/dcchrstr.cc

mrbean-bremen · 2025-08-08T09:30:22Z

dcmdata/libsrc/dcchrstr.cc

+OFCondition DcmCharString::getIndexOfPosition(const long pos, const char*& start, const char*& end, unsigned long& vm)
 {
-    OFBool result = OFFalse;
+    // the vast majority of values have VM 0 or 1, so optimize for these


I'm not too happy with this helper method, but I also didn't want to copy the relevant code into both getVM() and getOFString().
I'm also not sure about the performance impact. My assumption is that the actual character set checks are only needed for very few tags (multi-valued tags with a single-value multi-byte encodings or with code extensions).

I've also tried to cache the positions of the values in the string, but this got ugly and I reverted it. Another idea would be to always convert the string to utf-8 before calling any of the involved functions, and cache this value. That would make the handling easier. Not sure about this, either, though.

mrbean-bremen · 2025-08-08T09:41:15Z

dcmdata/tests/tchval.cc

 //  CHECK_BAD ( "LO-07", DcmLongString::checkStringValueu("OFFIS e.V., Escherweg 2, 26121 Oldenburg, Germany, http://www.offis.de/", "1") )
  CHECK_GOOD( "LO-08", DcmLongString::checkStringValue("\\ _2_ \\ _3_ \\ _4_ \\ _5_ \\", "6") )
-  CHECK_GOOD( "LO-09", DcmLongString::checkStringValue("ESC\033aping", "1") )
+  CHECK_BAD( "LO-09", DcmLongString::checkStringValue("ESC only allowed for charset extension \033", "1") )


I changed these tests, as the ESCAPE character is only allowed for the escape sequence, as per PS 3.5, 6.1.3:

The ESC character shall be used only for ISO 2022 character set control sequences, in accordance with Section 6.1.2.5.

mrbean-bremen · 2025-08-08T09:44:10Z

dcmdata/libsrc/dcbytstr.cc

    /* only check if parameter is true since derived VRs are not affected
       by the attribute SpecificCharacterSet (0008,0005) */
-    if (checkAllStrings)
+    if (checkAllStrings || isAffectedBySpecificCharacterSet())


this check makes the separate method in DcmCharString obsolete

mrbean-bremen · 2025-08-08T15:02:52Z

I keep getting unrelated changes in the commit, due to the RCS tags... have rebased it a couple of times therefore.

jriesmeier · 2025-09-05T08:48:00Z

@mrbean-bremen Thank you for your PR. As I already wrote in the DCMTK forum, I like the idea and your approach in principle. And, it seems to work, which is also nice :-)

However, what I don't like is the "bulkiness" of the DcmCharString::getIndexOfPosition() function, i.e. its length and that it serves as a multi-purpose function, although I understand that it helps to avoid code duplication. What I also do not like is that character set-specific code is "hidden" in this bulky function, e.g. the special handling of "GBK" and "GB18030". Maybe, this could be moved to DcmSpecificCharacterSet?

There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK.

mrbean-bremen · 2025-09-05T11:56:51Z

However, what I don't like is the "bulkiness" of the DcmCharString::getIndexOfPosition() function, i.e. its length and that it serves as a multi-purpose function, although I understand that it helps to avoid code duplication

I completely agree, I didn't like it either. I started with refactoring it, but didn't get to a point where it was much better yet (and I have been traveling meanwhile and didn't get back to it).
I guess we may have to live with some code duplication, and check for any performance impact afterwards.

What I also do not like is that character set-specific code is "hidden" in this bulky function, e.g. the special handling of "GBK" and "GB18030". Maybe, this could be moved to DcmSpecificCharacterSet?

I did this in the beginning, then moved it back as this has been only very trivial stuff used privately, but this certainly makes sense.

There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK.

Sure. I tend to forget that this still has to support some ancient compilers, and while I'm trying to use the same format as the rest of the code, I sometimes slip. As for naming conventions and autoconf support - I have to check that, and may need some support/suggestions at a later point.

I hope to be able to put something together over the weekend.

mrbean-bremen · 2025-09-07T18:15:51Z

@jriesmeier - I did some refactoring without any functional change, but this is still work in progress.

I removed the bulky getIndexOfPosition, and use a few other helper functions instead.
I added the virtual function findNextComponentPosition to do most of the character set specific stuff, with the intention to use it also in a few other places that currently don't support this (like checkStringValue, verify, getOFStringArray), though I haven't done this yet. Maybe this can be done in a separate PR later to keep this one manageable.
I'm not too happy about the interface yet, glad for any ideas.

I could not find a nice way to move stuff to DcmSpecificCharacterSet yet.
I also did not think about optimization yet (well, I did think about it, but decided to wait with that until the implementation is more clear).

About formatting:
I tried to use the formatting as in most in the rest of dcchrstr.cc (there are some other styles in some other files, so I just checked here). There are still different styles used in this file, so I just picked the one used more (indentation 4, not 2, curly brackets on a separate line, and a few others). I didn't see any coding guideline or style guide - if there is one, please point me to it.
A .clang-format file would also be nice, as this is supported by most IDEs.

Thank you for any feedback and ideas!

Thanks to GitHub user "mrbean-bremen" for the report (see PR #124).

jriesmeier · 2025-09-10T15:56:01Z

Thank you for the revision. We'll have a look at it (as time permits).
By the way, I already fixed the typo that you spotted ;) See commit 7d7c8c5.

About formatting:
I tried to use the formatting as in most in the rest of dcchrstr.cc (there are some other styles in some other files, so I just checked here). There are still different styles used in this file, so I just picked the one used more (indentation 4, not 2, curly brackets on a separate line, and a few others). I didn't see any coding guideline or style guide - if there is one, please point me to it.
A .clang-format file would also be nice, as this is supported by most IDEs.

Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style.

Our "golden rules" are.

When modifying an existing file, try to stick to the existing formatting/style, i.e. the one that is predominantly used.
When adding a new file to an existing module, use the formatting/style that is typically used in the existing files of this module.
When creating a new module, you are "free" to use your preferred formatting/style, but try to be consistent over time and not to change your style for each new module :)

This is far from perfect, of course, but it's not the most important thing when it comes to a toolkit that has been developed and maintained for about 30 years.

Ideally, we should have a "development howto", but an up-to-date version is also still in the pipe...

mrbean-bremen · 2025-09-10T17:10:04Z

Thank you for the revision. We'll have a look at it (as time permits).

Thank you - there's no rush, of course.

By the way, I already fixed the typo that you spotted ;)

Thanks - I forgot that I changed that :)

Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style.

Uh, formatting wars... I generally don't care much about a concrete formatting style (maybe with a the exception of Python, where an official style exists), but I try to get some styleguide agreed on, which can be enforced automatically (for example using clang-format). I just don't want to think about formatting, I'm lazy...

Our "golden rules" are. ...

That's pretty much what I'm trying to do (though I may not always succeed).

mrbean-bremen · 2025-11-09T08:06:46Z

I've got back to this PR after ignoring it for some time and made a few changes, notably added support for multi-byte strings in putOFStringAtPos(). The rest of the member functions should not need changes in this respect, with the exception of the static functions verify() and checkStringValue(), which I consider out of context for this PR.
I also rebased against master, so this should be up to date.

I'm still ignoring potential performance impacts (don't want to do premature optimization if not needed), though would appreciate any feedback.

mrbean-bremen · 2025-11-13T10:10:11Z

CC @jriesmeier

jriesmeier · 2025-11-14T10:31:19Z

Thank you for the reminder, but I already receive notification mails on this discussion :-)

It's not that I didn't want to respond more quickly, but I'm currently very busy with other things. So I ask for your patience (once again).

And thank you again for your contribution!

mrbean-bremen · 2025-11-14T12:03:03Z

Sure, no problem - I wasn't sure about the notifications, sometimes people switch them on only for mentions. Didn't want to pressure you, sorry for the spam :)

Thanks to GitHub user "mrbean-bremen" for the original report and proposed patch (see PR #124).

mrbean-bremen · 2025-12-17T16:13:56Z

dcmdata/libsrc/dcchrstr.cc

  : DcmByteString(tag, len)
 {
+    if (len > 0)
+        getSpecificCharacterSet(specificCharacterSet);


It turned out that getSpecificCharacterSet() cannot be called for example from getVM(), because it resets the current position of the item pointer, and iterating over the item elements as done in dcmdump (which also calls getVM()) caused a recursion (e.g. dcmdump started the listing from the first tag after calling this).
I'm not sure if calling it in the constructor, as I did now, is the best solution though.

I actually run into this while checking the issue in Bug #684, mentioned by @malaterre (also related to encodings), and noticed that indeed the UTF-8 converter does not handle multi-byte encodings like GBK (it handles code extensions though). Unfortunately, this is not related to the current changes, but could probably be fixed relatively easy independently.

@mrbean-bremen Where does DcmCharString::getSpecificCharacterSet() reset the current position of the item pointer? The DcmItem *item is a local variable in this method.

Anyway, calling getSpecificCharacterSet() in the constructor is certainly not a good idea, as it could slow down the construction. However, in case of the above constructor, it makes no sense at all, as the instance of the class DcmCharString has not yet been added to an DcmItem or DcmDataset instance. Therefore, a Specific Character Set (0008,0005) will never be defined (as it would be part of the surrounding item/dataset).

Btw, you forgot to set the value of specificCharacterSet in the assignment operator of this class. But as mentioned before, your current approach does not make much sense anyway ;-)

Yes, I realized that this does not make any sense - I reverted it locally, will push it shortly.

The problem with the item: item is a local variable, but it points to the parent dataset, which is not local.
Dcmdump uses DcmItem::print, which iterates over the dataset using elementList->seek(ELP_next). This sets currentPosition in its elementList.
Somewhere in this code (have to check where), getVM() is also called, which for non-trivial strings calls getSpecificCharacterSet(), which on its part iterates over the same dataset to find the tag, thereby resetting the same currentPosition variable and causing the next seek to find the element after the SpecificCharacterSet, causing the loop.
This happens in findAndGetOFStringArray, which calls findAndGetElement, which calls search and again elementList->seek, resetting currentPosition.

I did not expect this, and I haven't understood yet how to avoid this. Obviously need your help here...

OK, now I got it.

Currently, getSpecificCharacterSet() searches for the Specific Character Set (0008,0005) attribute in the surrounding dataset/item. This search is based on the internal DcmList instance, which uses an internal cursor/iterator and is a class member.

A big solution would be to improve DcmList by supporting external iterators or, which would be an even bigger task, to replace DcmList with some standard container class, without compromising performance.

A smaller solution could be something like adding a caching mechanism for the value of (0008,0005) to DcmItem. Caching means that each time when the value of this attribute changes (set, added, deleted, read or the like), the cache value must be updated.

Thank you! Caching Specific Character Set in the item sounds sensible.

A separate commit (or even a separate PR) for the caching feature would be nice, so that we could merge it into the public DCMTK separately. The caching could also be helpful for the current dcmdata implementation.

mrbean-bremen · 2025-12-18T20:07:40Z

dcmdata/tests/tspchrs.cc

+        // check whether a byte looking like a delimiter inside a multi-byte character is not handled as delimiter
+        // 0x5c is the byte for a backslash in single-byte encodings, but here part of two Kanji characters
+        OFCHECK(converter.convertString("Noriwa=\x81\x5c\x82\x5c", resultStr, delimiters).good());
+        OFCHECK_EQUAL(resultStr, "Noriwa=\xe4\xb9\x97\xe4\xbf\x93");


The string is taken from Bug #684 (from the mail from @malaterre).

Note that I have added a separate commit, and reordered the commits, so that the first (small) commit only fixes the specific problem from bug #684, independent from the rest of the changes (in case you want to use it).

Thanks a lot. I will check your commit 0d048a2 and make the necessary changes to the DCMTK "testing" branch in the next few days.

@mrbean-bremen I used major parts of your commit, modified a few details and committed the changes as commit 9bcc16c. I did not fully close DCMTK Bug #684, as I am still not 100% sure about "ISO 2022 IR 87".

Thanks a lot!
I actually had just assumed that you had already handled "ISO 2022 IR 87", as you had added handling for multi-valued character sets for this issue, I didn't actually check it.
I will have a look to check if this actually works.

Hm... I found the following in the test code:

#if DCMTK_ENABLE_CHARSET_CONVERSION == DCMTK_CHARSET_CONVERSION_ICONV // (only the ICONV library supports this character set) // the following should fail OFCHECK(converter.selectCharacterSet("\\ISO 2022 IR 87", "ISO_IR 100").good()); OFCHECK(converter.convertString("Yamada^Tarou=\033$B;3ED\033(B^\033$BB@O:\033(B=\033$B$d$^$@\033(B^\033$B$?$m$&\033(B", resultStr).bad()); #endif

This looks as if dcmtk internally does not support "ISO 2022 IR 87" for conversion. Also, it checks for the conversion to fail, not to succeed, so I'm not sure what this is actually testing if running with iconv.

The conversion code itself looks fine, it should be able to properly handle delimiter-like bytes in a multi-byte character, as far as I can see.

Thank you for pointing this out. It seems that the test cases have not been updated when "oficonv" was introduced. I will update the code of "tspchrs.cc" accordingly.

Background to this #if check: Not all character set conversion libraries that DCMTK supports (or previously supported) implement proper support for these Japanese character sets.

BTW, I also checked this test with oficonv (e.g. removed the #ifdef), and it passed - meaning the conversion failed. It looks like oficonv indeed does not support this encoding, though I haven't looked further into this.

You mean: oficonv does support this encoding?

In the meantime, the changes have been pushed to the public "master" branch (see commit e82d4e2).

No, it doesn't, that was I meant - but didn't understand the meaning of the test. With your changes, it is clearer now.
As for that:

I did not fully close DCMTK Bug #684, as I am still not 100% sure about "ISO 2022 IR 87".

I didn't test this with iconv, but I'm quite sure (looking at the code) that the problem will not appear there (multi-byte characters are properly handled).

BTW, I would have expected a positive test (that the encoding works with iconv), did I miss it? Or don't your tests run with iconv? That would also be the place to add a test to ensure that the delimiter problem does not appear with that encoding.

EDIT: I'm still confused - the conversion doesn't work with iconv either, according to the #ifdef - is that correct?

- add DcmCharString::getVM(), getOFString() and putOFStringAtPos(), which handle multi-byte charsets - DcmByteString::containsExtendedCharacters(): add check for ESCAPE characters (only allowed in code extensions) - removed obsolete DcmCharString::containsExtendedCharacters()

Enabled tests of the Japanese character sets "ISO 2022 IR 87" and "ISO 2022 IR 159" when using the "oficonv" conversion library. Thanks to GitHub user "mrbean-bremen" for pointing this out (see PR #124).

michaelonken reviewed Aug 8, 2025

View reviewed changes

dcmdata/libsrc/dcchrstr.cc Outdated Show resolved Hide resolved

dcmdata/libsrc/dcbytstr.cc Outdated Show resolved Hide resolved

dcmdata/libsrc/dcchrstr.cc Outdated Show resolved Hide resolved

mrbean-bremen force-pushed the charstr_vm branch from 7547784 to 446136b Compare August 8, 2025 09:43

mrbean-bremen commented Aug 8, 2025

View reviewed changes

mrbean-bremen force-pushed the charstr_vm branch 3 times, most recently from cbfa232 to c1479ed Compare August 8, 2025 15:00

mrbean-bremen force-pushed the charstr_vm branch from c1479ed to f00883c Compare August 14, 2025 05:34

mrbean-bremen force-pushed the charstr_vm branch from f00883c to b61e80b Compare September 7, 2025 18:30

michaelonken pushed a commit that referenced this pull request Sep 10, 2025

Fixed typo in comment.

7d7c8c5

Thanks to GitHub user "mrbean-bremen" for the report (see PR #124).

mrbean-bremen force-pushed the charstr_vm branch 2 times, most recently from 5fecb35 to 6bfea47 Compare November 9, 2025 07:56

michaelonken pushed a commit that referenced this pull request Dec 5, 2025

Fixed wrong use of ESC in "LO" and "SH" test case.

6cb93fe

Thanks to GitHub user "mrbean-bremen" for the original report and proposed patch (see PR #124).

mrbean-bremen force-pushed the charstr_vm branch 2 times, most recently from c0a74a9 to c90b25e Compare December 17, 2025 15:43

mrbean-bremen commented Dec 17, 2025

View reviewed changes

mrbean-bremen force-pushed the charstr_vm branch 2 times, most recently from 3a007ff to a8044bb Compare December 18, 2025 20:04

mrbean-bremen commented Dec 18, 2025

View reviewed changes

mrbean-bremen force-pushed the charstr_vm branch 2 times, most recently from 2369686 to 29d22c0 Compare December 19, 2025 09:02

mrbean-bremen force-pushed the charstr_vm branch from 29d22c0 to 553d373 Compare December 25, 2025 15:05

DcmCharString: add some support for multi-byte characters #124

Are you sure you want to change the base?

DcmCharString: add some support for multi-byte characters #124

Uh oh!

Conversation

mrbean-bremen commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrbean-bremen Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrbean-bremen commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jriesmeier commented Sep 5, 2025

Uh oh!

mrbean-bremen commented Sep 5, 2025

Uh oh!

mrbean-bremen commented Sep 7, 2025

Uh oh!

jriesmeier commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrbean-bremen commented Sep 10, 2025

Uh oh!

mrbean-bremen commented Nov 9, 2025

Uh oh!

mrbean-bremen commented Nov 13, 2025

Uh oh!

jriesmeier commented Nov 14, 2025

Uh oh!

mrbean-bremen commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jriesmeier Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrbean-bremen Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrbean-bremen Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jriesmeier Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrbean-bremen commented Aug 8, 2025 •

edited

Loading

mrbean-bremen Aug 8, 2025 •

edited

Loading

mrbean-bremen commented Aug 8, 2025 •

edited

Loading

jriesmeier commented Sep 10, 2025 •

edited

Loading

mrbean-bremen commented Nov 14, 2025 •

edited

Loading

jriesmeier Dec 17, 2025 •

edited

Loading

mrbean-bremen Dec 17, 2025 •

edited

Loading

mrbean-bremen Dec 25, 2025 •

edited

Loading

jriesmeier Dec 27, 2025 •

edited

Loading

mrbean-bremen Dec 28, 2025 •

edited

Loading