Skip to content

Conversation

@mrbean-bremen
Copy link

@mrbean-bremen mrbean-bremen commented Aug 8, 2025

  • add DcmCharString::getVM() and getOFString(), which handle multi-byte charsets
  • DcmByteString::containsExtendedCharacters(): add check for ESCAPE characters (only allowed in code extensions)
  • removed obsolete DcmCharString::containsExtendedCharacters()

This is not finished, there are some more methods that probably need to be adapted (checkStringValue, verify, getOFStringArray, putOFStringAtPos, writeJson), but I wanted to get some feedback if this is the correct way to do this.

EDIT: added support for putOFStringAtPos, getOFStringArray and writeJson should work out of the box, and the static functions checkStringValue and verify are somewhat out of scope here.

As mentioned in a comment - another way to handle this would be to always convert the string value before calling these methods (and cache the converted value). This would probably make the handling easier, but I'm somewhat reluctant because of the needed caching and cache invalidation, and the possible performance impact (though I haven't done any performance tests).

OFCondition DcmCharString::getIndexOfPosition(const long pos, const char*& start, const char*& end, unsigned long& vm)
{
OFBool result = OFFalse;
// the vast majority of values have VM 0 or 1, so optimize for these
Copy link
Author

@mrbean-bremen mrbean-bremen Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too happy with this helper method, but I also didn't want to copy the relevant code into both getVM() and getOFString().
I'm also not sure about the performance impact. My assumption is that the actual character set checks are only needed for very few tags (multi-valued tags with a single-value multi-byte encodings or with code extensions).

I've also tried to cache the positions of the values in the string, but this got ugly and I reverted it. Another idea would be to always convert the string to utf-8 before calling any of the involved functions, and cache this value. That would make the handling easier. Not sure about this, either, though.

// CHECK_BAD ( "LO-07", DcmLongString::checkStringValueu("OFFIS e.V., Escherweg 2, 26121 Oldenburg, Germany, http://www.offis.de/", "1") )
CHECK_GOOD( "LO-08", DcmLongString::checkStringValue("\\ _2_ \\ _3_ \\ _4_ \\ _5_ \\", "6") )
CHECK_GOOD( "LO-09", DcmLongString::checkStringValue("ESC\033aping", "1") )
CHECK_BAD( "LO-09", DcmLongString::checkStringValue("ESC only allowed for charset extension \033", "1") )
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed these tests, as the ESCAPE character is only allowed for the escape sequence, as per PS 3.5, 6.1.3:

The ESC character shall be used only for ISO 2022 character set control sequences, in accordance with Section 6.1.2.5.

/* only check if parameter is true since derived VRs are not affected
by the attribute SpecificCharacterSet (0008,0005) */
if (checkAllStrings)
if (checkAllStrings || isAffectedBySpecificCharacterSet())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check makes the separate method in DcmCharString obsolete

@mrbean-bremen mrbean-bremen force-pushed the charstr_vm branch 3 times, most recently from cbfa232 to c1479ed Compare August 8, 2025 15:00
@mrbean-bremen
Copy link
Author

mrbean-bremen commented Aug 8, 2025

I keep getting unrelated changes in the commit, due to the RCS tags... have rebased it a couple of times therefore.

@jriesmeier
Copy link
Member

@mrbean-bremen Thank you for your PR. As I already wrote in the DCMTK forum, I like the idea and your approach in principle. And, it seems to work, which is also nice :-)

However, what I don't like is the "bulkiness" of the DcmCharString::getIndexOfPosition() function, i.e. its length and that it serves as a multi-purpose function, although I understand that it helps to avoid code duplication. What I also do not like is that character set-specific code is "hidden" in this bulky function, e.g. the special handling of "GBK" and "GB18030". Maybe, this could be moved to DcmSpecificCharacterSet?

There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK.

@mrbean-bremen
Copy link
Author

However, what I don't like is the "bulkiness" of the DcmCharString::getIndexOfPosition() function, i.e. its length and that it serves as a multi-purpose function, although I understand that it helps to avoid code duplication

I completely agree, I didn't like it either. I started with refactoring it, but didn't get to a point where it was much better yet (and I have been traveling meanwhile and didn't get back to it).
I guess we may have to live with some code duplication, and check for any performance impact afterwards.

What I also do not like is that character set-specific code is "hidden" in this bulky function, e.g. the special handling of "GBK" and "GB18030". Maybe, this could be moved to DcmSpecificCharacterSet?

I did this in the beginning, then moved it back as this has been only very trivial stuff used privately, but this certainly makes sense.

There are also other minor issues (e.g. naming conventions, source code formatting, use of C++ type casts, missing Autoconf support) that we could fix when the patch would be incorporated into the "testing" branch of the DCMTK.

Sure. I tend to forget that this still has to support some ancient compilers, and while I'm trying to use the same format as the rest of the code, I sometimes slip. As for naming conventions and autoconf support - I have to check that, and may need some support/suggestions at a later point.

I hope to be able to put something together over the weekend.

@mrbean-bremen
Copy link
Author

@jriesmeier - I did some refactoring without any functional change, but this is still work in progress.

I removed the bulky getIndexOfPosition, and use a few other helper functions instead.
I added the virtual function findNextComponentPosition to do most of the character set specific stuff, with the intention to use it also in a few other places that currently don't support this (like checkStringValue, verify, getOFStringArray), though I haven't done this yet. Maybe this can be done in a separate PR later to keep this one manageable.
I'm not too happy about the interface yet, glad for any ideas.

I could not find a nice way to move stuff to DcmSpecificCharacterSet yet.
I also did not think about optimization yet (well, I did think about it, but decided to wait with that until the implementation is more clear).

About formatting:
I tried to use the formatting as in most in the rest of dcchrstr.cc (there are some other styles in some other files, so I just checked here). There are still different styles used in this file, so I just picked the one used more (indentation 4, not 2, curly brackets on a separate line, and a few others). I didn't see any coding guideline or style guide - if there is one, please point me to it.
A .clang-format file would also be nice, as this is supported by most IDEs.

Thank you for any feedback and ideas!

michaelonken pushed a commit that referenced this pull request Sep 10, 2025
Thanks to GitHub user "mrbean-bremen" for the report (see PR #124).
@jriesmeier
Copy link
Member

jriesmeier commented Sep 10, 2025

Thank you for the revision. We'll have a look at it (as time permits).
By the way, I already fixed the typo that you spotted ;) See commit 7d7c8c5.

About formatting:
I tried to use the formatting as in most in the rest of dcchrstr.cc (there are some other styles in some other files, so I just checked here). There are still different styles used in this file, so I just picked the one used more (indentation 4, not 2, curly brackets on a separate line, and a few others). I didn't see any coding guideline or style guide - if there is one, please point me to it.
A .clang-format file would also be nice, as this is supported by most IDEs.

Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style.

Our "golden rules" are.

  1. When modifying an existing file, try to stick to the existing formatting/style, i.e. the one that is predominantly used.
  2. When adding a new file to an existing module, use the formatting/style that is typically used in the existing files of this module.
  3. When creating a new module, you are "free" to use your preferred formatting/style, but try to be consistent over time and not to change your style for each new module :)

This is far from perfect, of course, but it's not the most important thing when it comes to a toolkit that has been developed and maintained for about 30 years.

Ideally, we should have a "development howto", but an up-to-date version is also still in the pipe...

@mrbean-bremen
Copy link
Author

Thank you for the revision. We'll have a look at it (as time permits).

Thank you - there's no rush, of course.

By the way, I already fixed the typo that you spotted ;)

Thanks - I forgot that I changed that :)

Unfortunately, we don't have that, mainly for historical reasons, but also because we could not agree on a common source code formatting / coding style.

Uh, formatting wars... I generally don't care much about a concrete formatting style (maybe with a the exception of Python, where an official style exists), but I try to get some styleguide agreed on, which can be enforced automatically (for example using clang-format). I just don't want to think about formatting, I'm lazy...

Our "golden rules" are. ...

That's pretty much what I'm trying to do (though I may not always succeed).

@mrbean-bremen mrbean-bremen force-pushed the charstr_vm branch 2 times, most recently from 5fecb35 to 6bfea47 Compare November 9, 2025 07:56
@mrbean-bremen
Copy link
Author

I've got back to this PR after ignoring it for some time and made a few changes, notably added support for multi-byte strings in putOFStringAtPos(). The rest of the member functions should not need changes in this respect, with the exception of the static functions verify() and checkStringValue(), which I consider out of context for this PR.
I also rebased against master, so this should be up to date.

I'm still ignoring potential performance impacts (don't want to do premature optimization if not needed), though would appreciate any feedback.

@mrbean-bremen
Copy link
Author

CC @jriesmeier

@jriesmeier
Copy link
Member

Thank you for the reminder, but I already receive notification mails on this discussion :-)

It's not that I didn't want to respond more quickly, but I'm currently very busy with other things. So I ask for your patience (once again).

And thank you again for your contribution!

@mrbean-bremen
Copy link
Author

mrbean-bremen commented Nov 14, 2025

Sure, no problem - I wasn't sure about the notifications, sometimes people switch them on only for mentions. Didn't want to pressure you, sorry for the spam :)

michaelonken pushed a commit that referenced this pull request Dec 5, 2025
Thanks to GitHub user "mrbean-bremen" for the original report and
proposed patch (see PR #124).
@mrbean-bremen mrbean-bremen force-pushed the charstr_vm branch 2 times, most recently from c0a74a9 to c90b25e Compare December 17, 2025 15:43
: DcmByteString(tag, len)
{
if (len > 0)
getSpecificCharacterSet(specificCharacterSet);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out that getSpecificCharacterSet() cannot be called for example from getVM(), because it resets the current position of the item pointer, and iterating over the item elements as done in dcmdump (which also calls getVM()) caused a recursion (e.g. dcmdump started the listing from the first tag after calling this).
I'm not sure if calling it in the constructor, as I did now, is the best solution though.

I actually run into this while checking the issue in Bug #684, mentioned by @malaterre (also related to encodings), and noticed that indeed the UTF-8 converter does not handle multi-byte encodings like GBK (it handles code extensions though). Unfortunately, this is not related to the current changes, but could probably be fixed relatively easy independently.

Copy link
Member

@jriesmeier jriesmeier Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrbean-bremen Where does DcmCharString::getSpecificCharacterSet() reset the current position of the item pointer? The DcmItem *item is a local variable in this method.

Anyway, calling getSpecificCharacterSet() in the constructor is certainly not a good idea, as it could slow down the construction. However, in case of the above constructor, it makes no sense at all, as the instance of the class DcmCharString has not yet been added to an DcmItem or DcmDataset instance. Therefore, a Specific Character Set (0008,0005) will never be defined (as it would be part of the surrounding item/dataset).

Btw, you forgot to set the value of specificCharacterSet in the assignment operator of this class. But as mentioned before, your current approach does not make much sense anyway ;-)

Copy link
Author

@mrbean-bremen mrbean-bremen Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I realized that this does not make any sense - I reverted it locally, will push it shortly.

The problem with the item: item is a local variable, but it points to the parent dataset, which is not local.
Dcmdump uses DcmItem::print, which iterates over the dataset using elementList->seek(ELP_next). This sets currentPosition in its elementList.
Somewhere in this code (have to check where), getVM() is also called, which for non-trivial strings calls getSpecificCharacterSet(), which on its part iterates over the same dataset to find the tag, thereby resetting the same currentPosition variable and causing the next seek to find the element after the SpecificCharacterSet, causing the loop.
This happens in findAndGetOFStringArray, which calls findAndGetElement, which calls search and again elementList->seek, resetting currentPosition.

I did not expect this, and I haven't understood yet how to avoid this. Obviously need your help here...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now I got it.

Currently, getSpecificCharacterSet() searches for the Specific Character Set (0008,0005) attribute in the surrounding dataset/item. This search is based on the internal DcmList instance, which uses an internal cursor/iterator and is a class member.

A big solution would be to improve DcmList by supporting external iterators or, which would be an even bigger task, to replace DcmList with some standard container class, without compromising performance.

A smaller solution could be something like adding a caching mechanism for the value of (0008,0005) to DcmItem. Caching means that each time when the value of this attribute changes (set, added, deleted, read or the like), the cache value must be updated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Caching Specific Character Set in the item sounds sensible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate commit (or even a separate PR) for the caching feature would be nice, so that we could merge it into the public DCMTK separately. The caching could also be helpful for the current dcmdata implementation.

@mrbean-bremen mrbean-bremen force-pushed the charstr_vm branch 2 times, most recently from 3a007ff to a8044bb Compare December 18, 2025 20:04
// check whether a byte looking like a delimiter inside a multi-byte character is not handled as delimiter
// 0x5c is the byte for a backslash in single-byte encodings, but here part of two Kanji characters
OFCHECK(converter.convertString("Noriwa=\x81\x5c\x82\x5c", resultStr, delimiters).good());
OFCHECK_EQUAL(resultStr, "Noriwa=\xe4\xb9\x97\xe4\xbf\x93");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string is taken from Bug #684 (from the mail from @malaterre).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I have added a separate commit, and reordered the commits, so that the first (small) commit only fixes the specific problem from bug #684, independent from the rest of the changes (in case you want to use it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot. I will check your commit 0d048a2 and make the necessary changes to the DCMTK "testing" branch in the next few days.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrbean-bremen I used major parts of your commit, modified a few details and committed the changes as commit 9bcc16c. I did not fully close DCMTK Bug #684, as I am still not 100% sure about "ISO 2022 IR 87".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!
I actually had just assumed that you had already handled "ISO 2022 IR 87", as you had added handling for multi-valued character sets for this issue, I didn't actually check it.
I will have a look to check if this actually works.

Copy link
Author

@mrbean-bremen mrbean-bremen Dec 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I found the following in the test code:

#if DCMTK_ENABLE_CHARSET_CONVERSION == DCMTK_CHARSET_CONVERSION_ICONV
        // (only the ICONV library supports this character set)
        // the following should fail
        OFCHECK(converter.selectCharacterSet("\\ISO 2022 IR 87", "ISO_IR 100").good());
        OFCHECK(converter.convertString("Yamada^Tarou=\033$B;3ED\033(B^\033$BB@O:\033(B=\033$B$d$^$@\033(B^\033$B$?$m$&\033(B", resultStr).bad());
#endif

This looks as if dcmtk internally does not support "ISO 2022 IR 87" for conversion. Also, it checks for the conversion to fail, not to succeed, so I'm not sure what this is actually testing if running with iconv.

The conversion code itself looks fine, it should be able to properly handle delimiter-like bytes in a multi-byte character, as far as I can see.

Copy link
Member

@jriesmeier jriesmeier Dec 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. It seems that the test cases have not been updated when "oficonv" was introduced. I will update the code of "tspchrs.cc" accordingly.

Background to this #if check: Not all character set conversion libraries that DCMTK supports (or previously supported) implement proper support for these Japanese character sets.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I also checked this test with oficonv (e.g. removed the #ifdef), and it passed - meaning the conversion failed. It looks like oficonv indeed does not support this encoding, though I haven't looked further into this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean: oficonv does support this encoding?

In the meantime, the changes have been pushed to the public "master" branch (see commit e82d4e2).

Copy link
Author

@mrbean-bremen mrbean-bremen Dec 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't, that was I meant - but didn't understand the meaning of the test. With your changes, it is clearer now.
As for that:

I did not fully close DCMTK Bug #684, as I am still not 100% sure about "ISO 2022 IR 87".

I didn't test this with iconv, but I'm quite sure (looking at the code) that the problem will not appear there (multi-byte characters are properly handled).

BTW, I would have expected a positive test (that the encoding works with iconv), did I miss it? Or don't your tests run with iconv? That would also be the place to add a test to ensure that the delimiter problem does not appear with that encoding.

EDIT: I'm still confused - the conversion doesn't work with iconv either, according to the #ifdef - is that correct?

@mrbean-bremen mrbean-bremen force-pushed the charstr_vm branch 2 times, most recently from 2369686 to 29d22c0 Compare December 19, 2025 09:02
- add DcmCharString::getVM(), getOFString() and putOFStringAtPos(), which handle multi-byte charsets
- DcmByteString::containsExtendedCharacters():
  add check for ESCAPE characters (only allowed in code extensions)
- removed obsolete DcmCharString::containsExtendedCharacters()
michaelonken pushed a commit that referenced this pull request Dec 28, 2025
Enabled tests of the Japanese character sets "ISO 2022 IR 87" and "ISO
2022 IR 159" when using the "oficonv" conversion library.

Thanks to GitHub user "mrbean-bremen" for pointing this out (see PR #124).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants