bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters #1207

mpmxyz · 2015-06-07T17:09:30Z

I got a bug report for my own program: http://oc.cil.li/index.php?/topic/511-crunch-break-the-4k-limit/#entry2317
I found out that the bug is not within my program:
When it loads this file: http://pastebin.com/z9w0Bwij
It crashes because the last character of the file is missing. (a "]" closing a long string)
During debugging I found out, that one character is split into two erroneous ones instead:
http://i.imgur.com/LtpKkHM.png

I further boiled it down to an error within the buffer:read / readBytesOrChars function:

   1  local left = n - len(buffer)
   2  buffer = buffer .. sub(self.bufferRead, 1, left)
   3  self.bufferRead = sub(self.bufferRead, left + 1)

Here is an step by step example to show what's wrong:

n=3
buffer=""
self.bufferRead="\226\148\128\226\148"
->
left = 3 - 0 = 3
buffer = "" .. unicode.sub("\226\148\128\226\148", 1, 3) = "\226\148\128\195\162" --Where did "\195\162" come from?
self.bufferRead = "" --> triggers a new reading operation
->
self.bufferRead="\128\226\148\128"
left = 3 - 2 = 1 --only one character missing...
buffer = buffer .. unicode.sub("\128\226\148\128", 1, 1) = "\226\148\128\195\162" .. "\194\128" -- Where did "\194" come from?

So the current bug is caused by unicode.sub adding some garbage if there are incomplete characters. When doing that I'd also recommend checking if the non binary version of buffer:read is working correctly when it receives only the first part of the last character of a chunk.

The text was updated successfully, but these errors were encountered:

fnuecke · 2015-06-08T19:49:26Z

Hrm, it's kind of a matter of definition as to what the actual bug is. I'd actually say the issue is with the buffer's current implementation, as it kind of should check if the current byte buffer ends in a valid unicode char. I think it's OK to have unicode.sub and family expect a valid unicode string as input. So I suppose what's needed is some way to allow the buffer to detect up to where the buffer is valid... I have some ideas, but none of them are nice. Any suggestions on how to best tackle this welcome.

SolraBizna · 2015-06-22T23:29:58Z

My implementation of the unicode library handles non-BMP code points and erroneous sequences well, but it's in C. The Lua architecture one converts from UTF-8 to UTF-16 and counts UTF-16 code units, which is the source of the questionable Astral Plane support (Java's fault) and the extra garbage from unicode.sub (which I'll wager is U+FFFD). I can't think of a reason my implementation couldn't be rewritten in Java/Scala and applied to byte arrays. That would avoid at least this particular problem.

As for determining whether a buffer ends in a valid sequence... UTF-8 is specifically designed to make this kind of thing easy. A one-byte sequence is always 0x00--0x7F; 0xC0--0xDF always introduces a two-byte sequence; 0xE0-0xEF always a three-byte sequence; 0xF0--0xF7 a four-byte sequence. 0x80--0xBF only occurs as the second, third, or fourth byte in a sequence and no other byte does---if you get one you have to look at most three bytes back to find a start-of-sequence. Five-and-six byte sequences were defined originally, but are not allowed anymore, therefore 0xF8--0xFF never occur at all. A Lua regular expression could almost handle this. Only up to four final bytes need to be checked to determine if the last character is valid/complete.

payonel · 2017-10-04T03:03:23Z

I have decided that the buffer library can assume the underlying stream returns valid character sets
When you use io.open to open a file in text mode (not binary), the underlying host filesystem returns valid utf8 sets, and does not slice them. The buffer library expects sequences to be complete and I'm okay with this expectation.

If user code is reading from a custom stream that has utf8 sequences, and assumes to read it in text mode, then that stream must respect the sequence breaks appropriately.

If you cannot make your stream obey, then you need to read the stream in binary mode.

SolraBizna · 2017-10-04T17:37:12Z

When you use io.open to open a file in text mode (not binary), the underlying host filesystem returns valid utf8 sets, and does not slice them.

Maybe that works on Windows, but unless Java is adding one of its own, POSIX systems do not have a separate "text mode".

payonel · 2017-10-04T20:34:23Z

@SolraBizna you make a good point. My reasoning is based on the filesystem component method open allowing a binary and text mode. Thus, to the perspective of OpenOS (or any other runtime and architecture), the stream handle was opened in a specified mode, and thus should be able to safely assume the stream is in text or binary mode. And thus, it is my opinion that OpenOS is not responsible to verify the stream behaves in a textual or binary manner.

IF we are not respecting mode in the java layer (and we might not be) for text mode, then we'll need to use a InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8) where we can. I'll repoen this ticket until I have verified what we're currently doing.

But, in the end, I'm not handling this in the OpenOS layer.

the openos io buffer in utf8 mode can splice inside a utf8 sequence this code prevents that by reading the next chunk to complete the sequence in the case the stream actually has bad utf8 sequence, the io buffer decides to return more data than it was asked, rather than corrupt the stream closes #1207

payonel · 2020-05-15T09:30:11Z

it only took 5 years :) but yes, i decided to fix this
i found a fast way to check the buffer io result for a valid/expected result size. the buffer will return more chars in case a bad sequence is detected, and if the end of the stream is read, a chunk read will occur -- this resolve this use case and showed to preserve data in testing

the openos io buffer in utf8 mode can splice inside a utf8 sequence this code prevents that by reading the next chunk to complete the sequence in the case the stream actually has bad utf8 sequence, the io buffer decides to return more data than it was asked, rather than corrupt the stream closes #1207

fnuecke added the bug-minor label Jun 8, 2015

payonel closed this as completed Oct 4, 2017

payonel reopened this Oct 4, 2017

payonel self-assigned this Jan 28, 2020

payonel closed this as completed May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters #1207

bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters #1207

mpmxyz commented Jun 7, 2015

fnuecke commented Jun 8, 2015

Uh oh!

SolraBizna commented Jun 22, 2015

Uh oh!

payonel commented Oct 4, 2017

Uh oh!

SolraBizna commented Oct 4, 2017

Uh oh!

payonel commented Oct 4, 2017

Uh oh!

payonel commented May 15, 2020

Uh oh!

bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters #1207

bugs: unicode.sub modifying strings -> buffer:read having trouble with non ascii characters #1207

Comments

mpmxyz commented Jun 7, 2015

fnuecke commented Jun 8, 2015

Uh oh!

SolraBizna commented Jun 22, 2015

Uh oh!

payonel commented Oct 4, 2017

Uh oh!

SolraBizna commented Oct 4, 2017

Uh oh!

payonel commented Oct 4, 2017

Uh oh!

payonel commented May 15, 2020

Uh oh!