-
Notifications
You must be signed in to change notification settings - Fork 443
bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters #1207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hrm, it's kind of a matter of definition as to what the actual bug is. I'd actually say the issue is with the buffer's current implementation, as it kind of should check if the current byte buffer ends in a valid unicode char. I think it's OK to have unicode.sub and family expect a valid unicode string as input. So I suppose what's needed is some way to allow the buffer to detect up to where the buffer is valid... I have some ideas, but none of them are nice. Any suggestions on how to best tackle this welcome. |
My implementation of the As for determining whether a buffer ends in a valid sequence... UTF-8 is specifically designed to make this kind of thing easy. A one-byte sequence is always 0x00--0x7F; 0xC0--0xDF always introduces a two-byte sequence; 0xE0-0xEF always a three-byte sequence; 0xF0--0xF7 a four-byte sequence. 0x80--0xBF only occurs as the second, third, or fourth byte in a sequence and no other byte does---if you get one you have to look at most three bytes back to find a start-of-sequence. Five-and-six byte sequences were defined originally, but are not allowed anymore, therefore 0xF8--0xFF never occur at all. A Lua regular expression could almost handle this. Only up to four final bytes need to be checked to determine if the last character is valid/complete. |
I have decided that the buffer library can assume the underlying stream returns valid character sets If user code is reading from a custom stream that has utf8 sequences, and assumes to read it in text mode, then that stream must respect the sequence breaks appropriately. If you cannot make your stream obey, then you need to read the stream in binary mode. |
Maybe that works on Windows, but unless Java is adding one of its own, POSIX systems do not have a separate "text mode". |
@SolraBizna you make a good point. My reasoning is based on the IF we are not respecting mode in the java layer (and we might not be) for text mode, then we'll need to use a But, in the end, I'm not handling this in the OpenOS layer. |
the openos io buffer in utf8 mode can splice inside a utf8 sequence this code prevents that by reading the next chunk to complete the sequence in the case the stream actually has bad utf8 sequence, the io buffer decides to return more data than it was asked, rather than corrupt the stream closes #1207
it only took 5 years :) but yes, i decided to fix this |
the openos io buffer in utf8 mode can splice inside a utf8 sequence this code prevents that by reading the next chunk to complete the sequence in the case the stream actually has bad utf8 sequence, the io buffer decides to return more data than it was asked, rather than corrupt the stream closes #1207
I got a bug report for my own program: http://oc.cil.li/index.php?/topic/511-crunch-break-the-4k-limit/#entry2317
I found out that the bug is not within my program:
When it loads this file: http://pastebin.com/z9w0Bwij
It crashes because the last character of the file is missing. (a "]" closing a long string)
During debugging I found out, that one character is split into two erroneous ones instead:
http://i.imgur.com/LtpKkHM.png
I further boiled it down to an error within the buffer:read / readBytesOrChars function:
Here is an step by step example to show what's wrong:
So the current bug is caused by unicode.sub adding some garbage if there are incomplete characters. When doing that I'd also recommend checking if the non binary version of buffer:read is working correctly when it receives only the first part of the last character of a chunk.
The text was updated successfully, but these errors were encountered: