Skip to content

bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters #1207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mpmxyz opened this issue Jun 7, 2015 · 6 comments
Assignees

Comments

@mpmxyz
Copy link
Contributor

mpmxyz commented Jun 7, 2015

I got a bug report for my own program: http://oc.cil.li/index.php?/topic/511-crunch-break-the-4k-limit/#entry2317
I found out that the bug is not within my program:
When it loads this file: http://pastebin.com/z9w0Bwij
It crashes because the last character of the file is missing. (a "]" closing a long string)
During debugging I found out, that one character is split into two erroneous ones instead:
http://i.imgur.com/LtpKkHM.png

I further boiled it down to an error within the buffer:read / readBytesOrChars function:

   1  local left = n - len(buffer)
   2  buffer = buffer .. sub(self.bufferRead, 1, left)
   3  self.bufferRead = sub(self.bufferRead, left + 1)

Here is an step by step example to show what's wrong:

n=3
buffer=""
self.bufferRead="\226\148\128\226\148"
->
left = 3 - 0 = 3
buffer = "" .. unicode.sub("\226\148\128\226\148", 1, 3) = "\226\148\128\195\162" --Where did "\195\162" come from?
self.bufferRead = "" --> triggers a new reading operation
->
self.bufferRead="\128\226\148\128"
left = 3 - 2 = 1 --only one character missing...
buffer = buffer .. unicode.sub("\128\226\148\128", 1, 1) = "\226\148\128\195\162" .. "\194\128" -- Where did "\194" come from?

So the current bug is caused by unicode.sub adding some garbage if there are incomplete characters. When doing that I'd also recommend checking if the non binary version of buffer:read is working correctly when it receives only the first part of the last character of a chunk.

@fnuecke
Copy link
Member

fnuecke commented Jun 8, 2015

Hrm, it's kind of a matter of definition as to what the actual bug is. I'd actually say the issue is with the buffer's current implementation, as it kind of should check if the current byte buffer ends in a valid unicode char. I think it's OK to have unicode.sub and family expect a valid unicode string as input. So I suppose what's needed is some way to allow the buffer to detect up to where the buffer is valid... I have some ideas, but none of them are nice. Any suggestions on how to best tackle this welcome.

@SolraBizna
Copy link

My implementation of the unicode library handles non-BMP code points and erroneous sequences well, but it's in C. The Lua architecture one converts from UTF-8 to UTF-16 and counts UTF-16 code units, which is the source of the questionable Astral Plane support (Java's fault) and the extra garbage from unicode.sub (which I'll wager is U+FFFD). I can't think of a reason my implementation couldn't be rewritten in Java/Scala and applied to byte arrays. That would avoid at least this particular problem.

As for determining whether a buffer ends in a valid sequence... UTF-8 is specifically designed to make this kind of thing easy. A one-byte sequence is always 0x00--0x7F; 0xC0--0xDF always introduces a two-byte sequence; 0xE0-0xEF always a three-byte sequence; 0xF0--0xF7 a four-byte sequence. 0x80--0xBF only occurs as the second, third, or fourth byte in a sequence and no other byte does---if you get one you have to look at most three bytes back to find a start-of-sequence. Five-and-six byte sequences were defined originally, but are not allowed anymore, therefore 0xF8--0xFF never occur at all. A Lua regular expression could almost handle this. Only up to four final bytes need to be checked to determine if the last character is valid/complete.

@payonel
Copy link
Member

payonel commented Oct 4, 2017

I have decided that the buffer library can assume the underlying stream returns valid character sets
When you use io.open to open a file in text mode (not binary), the underlying host filesystem returns valid utf8 sets, and does not slice them. The buffer library expects sequences to be complete and I'm okay with this expectation.

If user code is reading from a custom stream that has utf8 sequences, and assumes to read it in text mode, then that stream must respect the sequence breaks appropriately.

If you cannot make your stream obey, then you need to read the stream in binary mode.

@payonel payonel closed this as completed Oct 4, 2017
@SolraBizna
Copy link

When you use io.open to open a file in text mode (not binary), the underlying host filesystem returns valid utf8 sets, and does not slice them.

Maybe that works on Windows, but unless Java is adding one of its own, POSIX systems do not have a separate "text mode".

@payonel
Copy link
Member

payonel commented Oct 4, 2017

@SolraBizna you make a good point. My reasoning is based on the filesystem component method open allowing a binary and text mode. Thus, to the perspective of OpenOS (or any other runtime and architecture), the stream handle was opened in a specified mode, and thus should be able to safely assume the stream is in text or binary mode. And thus, it is my opinion that OpenOS is not responsible to verify the stream behaves in a textual or binary manner.

IF we are not respecting mode in the java layer (and we might not be) for text mode, then we'll need to use a InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8) where we can. I'll repoen this ticket until I have verified what we're currently doing.

But, in the end, I'm not handling this in the OpenOS layer.

@payonel payonel reopened this Oct 4, 2017
@payonel payonel self-assigned this Jan 28, 2020
payonel added a commit that referenced this issue May 15, 2020
the openos io buffer in utf8 mode can splice inside a utf8 sequence
this code prevents that by reading the next chunk to complete the sequence
in the case the stream actually has bad utf8 sequence, the io buffer decides to return
more data than it was asked, rather than corrupt the stream
closes #1207
@payonel
Copy link
Member

payonel commented May 15, 2020

it only took 5 years :) but yes, i decided to fix this
i found a fast way to check the buffer io result for a valid/expected result size. the buffer will return more chars in case a bad sequence is detected, and if the end of the stream is read, a chunk read will occur -- this resolve this use case and showed to preserve data in testing

@payonel payonel closed this as completed May 15, 2020
payonel added a commit that referenced this issue May 15, 2020
the openos io buffer in utf8 mode can splice inside a utf8 sequence
this code prevents that by reading the next chunk to complete the sequence
in the case the stream actually has bad utf8 sequence, the io buffer decides to return
more data than it was asked, rather than corrupt the stream
closes #1207
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants