problem with foreign letters in names apparently from crossref,
Mike Marchywka
marchywka at hotmail.com
Tue Feb 8 16:12:53 CET 2022
On Tue, Feb 08, 2022 at 08:06:01AM -0600, Don Hosek wrote:
> On 8 Feb 2022, at 02:59, [mailto:texhax-request at tug.org]texhax-request at tug.org wrote:
>
> My c++ code uses a lot of typedefed strings that I guess could be easily
> set to use wide characters and I have a char class parser that is perfectly
> general upto to at least int size chars probably. However, I still use 8 bit char
> for characters in places, routinely test based on ASCII etc.
>
> UTF-8 uses 8-bit values exclusively, but, for non-ASCII characters they’ll be represented as multi-byte sequences. So, for
> example. ç (c-cedilla) is codepoint U+00C7 but in UTF-8 this will be two bytes: 0xC3 0xA7 (in some circumstances you may
> also encounter the semantically equivalent c + combining cedilla which will be c then U+0327 which is represented in UTF-8
> as 0xCC 0xA7).
lol, I'm right there now :) This is a big distraction for me, right now I just
wanted to accomodate the xml and that looks like it is coming together
as I have all the pieces.
If you are interested however,
the chars however are jumbled. For example,
"Gonzales" ( my english lol ), began ok afaict using printf,
( this just runs an html parser on the input xml and isolates the name )
toobib -hhtml ref/jaxk.xml 2>/dev/null | grep Gon | sed -e 's/.*Go//' | od -ax
0000000 n C ' a l v e s nl
c36e 61a7 766c 7365 000a
0000011
marchywka at happy:/home/documents/cpp/proj/toobib$ toobib -hhtml ref/jaxk.xml 2>/dev/null | grep Gon | sed -e 's/.*Go//'
nçalves
But the output of "TooBib" the bibtex generator ( or standalone c++ clas to be intergrated into
TooBib ),
echo load ref/jaxk.xml | ./a.out 2>&1 | grep "|Gon" | sed -e 's/.*Go//' | tail -n 1
nçalves
marchywka at happy:/home/documents/cpp/proj/toobib$ echo load ref/jaxk.xml | ./a.out 2>&1 | grep "|Gon" | sed -e 's/.*Go//' | tail -n 1 | od -ax
0000000 n C etx B ' a l v e s nl
c36e c283 61a7 766c 7365 000a
0000013
As long as bibtex accepts it I'm just going to leave it there for now.
Thanks...
>
> Pretty much any new code that deals with generalized inputs should assume that its input is UTF-8-encoded. UTF-8 has the
> advantage that for any multi-byte sequence, the starting byte can always be identified as such (so the classic interview
> problem of reverse this string can almost* be handled by determining whether a byte is a starting byte for a sequence or a
> continuation byte—the classic interview problem’s contrived nature begins to show itself in the 21st century).
>
> Depending on the level of support you need,
>
> [http://utfcpp.sourceforge.net/]http://utfcpp.sourceforge.net
>
> should suffice for your needs. I’m guessing though that the biggest problem you’re running into is that you’re likely
> running your code under Windows and haven’t done whatever is necessary to communicate to the OS that the program is
> outputting UTF-8 which might be all that’s necessary.
>
> -dh
>
> * The almost comes into play when you remember that Unicode allows for combining character sequences like the
> above-mentioned c+combining cedilla, not to mention oddities like some emojis which are generated by combining sequences
> like bear+ZWJ+snowflake = polar bear, flags which are (mostly) two-character sequences of regional indicator letters, etc.
--
mike marchywka
306 charles cox
canton GA 30115
USA, Earth
marchywka at hotmail.com
404-788-1216
ORCID: 0000-0001-9237-455X
More information about the texhax
mailing list.