Fix wxTextInputStream for input starting with BOM-like bytes
Contrary to what a comment in wxTextInputStream::GetChar() said, it is actually possible to get more than one wide character from a call to wxMBConv::ToWChar(len+1) even if a previous call to ToWChar(len) failed to decode anything at all. This happens with wxConvAuto because it keeps returning an error while it doesn't have enough data to determine if the input contains a BOM or not, but then returns all the characters examined so far at once if it turns out that there was no BOM, after all. The simplest case in which this created problems was just input starting with a NUL byte as it as this could be a start of UTF-32BE BOM. The fix consists in keeping all the bytes read but not yet decoded in the m_lastBytes buffer and retrying to decode them during the next GetChar() call. This implies keeping track of how much valid data is there in m_lastBytes exactly, as we can't discard the already decoded data immediately, but need to keep it in the buffer too, in order to allow implementing UngetLast(). Incidentally, UngetLast() was totally broken for UTF-16/32 input (containing NUL bytes in the middle of the characters) before and this change fixes this as a side effect. Also add test cases for previously failing inputs.
This commit is contained in:
@@ -290,4 +290,40 @@ void TextStreamTestCase::TestInput(const wxMBConv& conv,
|
||||
CPPUNIT_ASSERT_EQUAL( 0, memcmp(txtWchar, temp.wc_str(), sizeof(txtWchar)) );
|
||||
}
|
||||
|
||||
TEST_CASE("wxTextInputStream::GetChar", "[text][input][stream][char]")
|
||||
{
|
||||
// This is the simplest possible test that used to trigger assertion in
|
||||
// wxTextInputStream::GetChar().
|
||||
SECTION("starts-with-nul")
|
||||
{
|
||||
const wxUint8 buf[] = { 0x00, 0x01, };
|
||||
wxMemoryInputStream mis(buf, sizeof(buf));
|
||||
wxTextInputStream tis(mis);
|
||||
|
||||
REQUIRE( tis.GetChar() == 0x00 );
|
||||
REQUIRE( tis.GetChar() == 0x01 );
|
||||
REQUIRE( tis.GetChar() == 0x00 );
|
||||
CHECK( tis.GetInputStream().Eof() );
|
||||
}
|
||||
|
||||
// This exercises a problematic path in GetChar() as the first 3 bytes of
|
||||
// this stream look like the start of UTF-32BE BOM, but this is not
|
||||
// actually a BOM because the 4th byte is 0xFE and not 0xFF, so the stream
|
||||
// should decode the buffer as Latin-1 once it gets there.
|
||||
SECTION("almost-UTF-32-BOM")
|
||||
{
|
||||
const wxUint8 buf[] = { 0x00, 0x00, 0xFE, 0xFE, 0x01 };
|
||||
wxMemoryInputStream mis(buf, sizeof(buf));
|
||||
wxTextInputStream tis(mis);
|
||||
|
||||
REQUIRE( tis.GetChar() == 0x00 );
|
||||
REQUIRE( tis.GetChar() == 0x00 );
|
||||
REQUIRE( tis.GetChar() == 0xFE );
|
||||
REQUIRE( tis.GetChar() == 0xFE );
|
||||
REQUIRE( tis.GetChar() == 0x01 );
|
||||
REQUIRE( tis.GetChar() == 0x00 );
|
||||
CHECK( tis.GetInputStream().Eof() );
|
||||
}
|
||||
}
|
||||
|
||||
#endif // wxUSE_UNICODE
|
||||
|
Reference in New Issue
Block a user