problem with foreign letters in names apparently from crossref,
Mike Marchywka
marchywka at hotmail.com
Thu Feb 10 01:49:38 CET 2022
On Tue, Feb 08, 2022 at 08:06:01AM -0600, Don Hosek wrote:
> On 8 Feb 2022, at 02:59, [mailto:texhax-request at tug.org]texhax-request at tug.org wrote:
>
> My c++ code uses a lot of typedefed strings that I guess could be easily
> set to use wide characters and I have a char class parser that is perfectly
> general upto to at least int size chars probably. However, I still use 8 bit char
> for characters in places, routinely test based on ASCII etc.
>
> UTF-8 uses 8-bit values exclusively, but, for non-ASCII characters they’ll be represented as multi-byte sequences. So, for
> example. ç (c-cedilla) is codepoint U+00C7 but in UTF-8 this will be two bytes: 0xC3 0xA7 (in some circumstances you may
> also encounter the semantically equivalent c + combining cedilla which will be c then U+0327 which is represented in UTF-8
> as 0xCC 0xA7).
>
> Pretty much any new code that deals with generalized inputs should assume that its input is UTF-8-encoded. UTF-8 has the
> advantage that for any multi-byte sequence, the starting byte can always be identified as such (so the classic interview
> problem of reverse this string can almost* be handled by determining whether a byte is a starting byte for a sequence or a
> continuation byte—the classic interview problem’s contrived nature begins to show itself in the 21st century).
>
Thanks. I thought this was going to be another infinite time sink
but it turned out to resolve pretty easily. This may be a bit
off topic but relevant to dealing with almost-ASCII lol.
It looks like I had two problems. The first was in the html parser
despite being invoked with the same encoding flag produced one output
from the c file input and a diferent one from the c++ ifstream. Changing
the encoding parameter to indicate UTF8 helped a lot. The second problem
though was this char-class parser I have that tries to break up
a string into groups of chars of the same class - letters, digits, whatever.
This works well for a lot of ad hoc stuff although the logic afterward
may be a bit contorted its often easier than trying to make
a custom parser. I have a bunch of included and user defined
bits for things like printable, alpha, upper case, etc.
The UTF8
sequences were singled out as breaks in the groups of letters. I
found for now it is easier just to piece them back together although
I could modify the char class parser thing ( or just check the high bit )
to note an "atomic" group...
After all that, it mostly seems to work now and I'm playing with
the email server. I had to add a bunch of things to remove
all the debug output ( all of that is a macro but I just
left it in adding a global var for gating it out at runtime ).
Once I started looking at it though I thought about creating
a simple text "ad generator" lol. Creating customized ads on the
fly in the context of a user query and current events
is kind of interesting and there are a lot of real time news
feeds to use :) However I do have important stuff to do ...
> Depending on the level of support you need,
>
> [http://utfcpp.sourceforge.net/]http://utfcpp.sourceforge.net
>
> should suffice for your needs. I’m guessing though that the biggest problem you’re running into is that you’re likely
> running your code under Windows and haven’t done whatever is necessary to communicate to the OS that the program is
> outputting UTF-8 which might be all that’s necessary.
>
> -dh
>
> * The almost comes into play when you remember that Unicode allows for combining character sequences like the
> above-mentioned c+combining cedilla, not to mention oddities like some emojis which are generated by combining sequences
> like bear+ZWJ+snowflake = polar bear, flags which are (mostly) two-character sequences of regional indicator letters, etc.
--
mike marchywka
306 charles cox
canton GA 30115
USA, Earth
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X
More information about the texhax
mailing list.