Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two different errors when reading two different files #54

Open
Khris777 opened this issue Mar 30, 2017 · 9 comments
Open

Two different errors when reading two different files #54

Khris777 opened this issue Mar 30, 2017 · 9 comments

Comments

@Khris777
Copy link

Khris777 commented Mar 30, 2017

I'm using parquet on Windows 10 and I have two different parquet files for testing, one is snappy-compressed, one is not compressed.

Simple test code for reading:

with open(filename,'r') as f:
    for row in parquet.reader(f):
        print row

The uncompressed file throws this error:

  File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
	for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
	dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 275, in read_data_page
	raw_bytes = _read_page(fo, page_header, column_metadata)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 244, in _read_page
	page_header.uncompressed_page_size)

AssertionError: found 87 raw bytes (expected 367)

Reading the compressed file like that gives:

  File "E:/PythonDir/Diverses/DataTest.py", line 23, in <module>
	for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 393, in reader
	footer = _read_footer(fo)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 71, in _read_footer
	footer_size = _get_footer_size(fo)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 64, in _get_footer_size
	tup = struct.unpack("<i", fo.read(4))

error: unpack requires a string argument of length 4

I can open both files with fastparquet 0.0.5 just fine so there's nothing wrong with the files.

What am I doing wrong?
Do I have to explicitely uncompress the data with snappy or is parquet doing that by itself?
Can you in general provide some more documentation on the basic usage?

@jcrobak
Copy link
Owner

jcrobak commented Apr 1, 2017

Hi @Khris777 - what version of python are on?

@Khris777
Copy link
Author

Khris777 commented Apr 1, 2017

I'm using Python 2.7.

@jcrobak
Copy link
Owner

jcrobak commented Apr 2, 2017

@Khris777 can you try opening the files in binary mode? i.e. with open(filename,'rb') ?

@Khris777
Copy link
Author

Khris777 commented Apr 3, 2017

Using binary mode leads to the script not finishing at all.

It does not lock up, it just runs on and on. The two files are both less than 1 MB, so this is odd.

When killing the process after several minutes it throws the usual KeyboardInterrupt and gives out the line at which it was, and the output is variable, some examples:

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
	dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
	dict_values_io_obj, bit_width, len(dict_values_bytes))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 213, in read_rle_bit_packed_hybrid
	debug_logging = logger.isEnabledFor(logging.DEBUG)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\logging\__init__.py", line 1366, in isEnabledFor
	return level >= self.getEffectiveLevel()

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\logging\__init__.py", line 1355, in getEffectiveLevel
	if logger.level:

===============================

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
	dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 343, in read_data_page
	values = encoding.read_rle_bit_packed_hybrid(
	
===============================

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
	dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
	dict_values_io_obj, bit_width, len(dict_values_bytes))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 222, in read_rle_bit_packed_hybrid
	while io_obj.tell() < length:

@jcrobak
Copy link
Owner

jcrobak commented May 26, 2017

Any chance you could try to reproduce this on the 1.2 release that was just published?

@Khris777
Copy link
Author

I will once I figure out why the latest python-snappy version fails to install.

@Khris777
Copy link
Author

Khris777 commented May 30, 2017

Okay, now things are like this.

I installed parquet and snappy into my Python 3.6 environment and there parquet works flawlessly, I can read everything just like I can using fastparquet. I did a fresh install, fetching a precompiled snappy-wheel from http://www.lfd.uci.edu/~gohlke/pythonlibs/ and getting the latest parquet with pip.

On Python 2.7 however it still doesn't work. I updated the parquet package normally using pip after also installing the precompiled snappy-wheel for 2.7.

I have the same data in three different formats, uncompressed, snappy-compressed, and gzip-compressed. All three always throw the same error so it doesn't seem to be a compression problem.

My testing code:

r1 = []
filename = "E:\\Temp\\uncompressedParquetFile.parquet"
with open(filename,'rb') as f:
    for row in parquet.reader(f):
        r1.append(row)

throws this error:

Traceback (most recent call last):

  File "<ipython-input-9-bb9230901f59>", line 1, in <module>
    runfile('E:/PythonDir/Diverses/parquetTest.py', wdir='E:/PythonDir/Diverses')

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "E:/PythonDir/Diverses/parquetTest.py", line 22, in <module>
    for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 426, in reader
    dict_items)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 344, in read_data_page
    dict_values_io_obj, bit_width, len(dict_values_bytes))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 227, in read_rle_bit_packed_hybrid
    res += read_bitpacked(io_obj, header, width, debug_logging)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\encoding.py", line 146, in read_bitpacked
    b = raw_bytes[current_byte]

IndexError: list index out of range

Without binary mode with open(filename,'r') as f: it's this error:

Traceback (most recent call last):

  File "<ipython-input-10-bb9230901f59>", line 1, in <module>
    runfile('E:/PythonDir/Diverses/parquetTest.py', wdir='E:/PythonDir/Diverses')

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "E:/PythonDir/Diverses/parquetTest.py", line 22, in <module>
    for row in parquet.reader(f):

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 393, in reader
    footer = _read_footer(fo)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\parquet\__init__.py", line 78, in _read_footer
    fmd.read(pin)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\thrift.py", line 112, in read
    iprot.read_struct(self)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
    val = self.read_val(ftype, fspec)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 299, in read_val
    result.append(self.read_val(v_type, v_spec))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
    self.read_struct(obj)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
    val = self.read_val(ftype, fspec)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 299, in read_val
    result.append(self.read_val(v_type, v_spec))

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
    self.read_struct(obj)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
    val = self.read_val(ftype, fspec)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
    self.read_struct(obj)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 267, in read_struct
    val = self.read_val(ftype, fspec)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 335, in read_val
    self.read_struct(obj)

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 250, in read_struct
    fname, ftype, fid = self.read_field_begin()

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 181, in read_field_begin
    return None, self._get_ttype(type), fid

  File "C:\Users\my.name\AppData\Local\Continuum\Anaconda2\lib\site-packages\thriftpy\protocol\compact.py", line 134, in _get_ttype
    return TTYPES[byte & 0x0f]

KeyError: 14

@jcrobak
Copy link
Owner

jcrobak commented May 31, 2017

Oh interesting. I'd love to try to recreate this issue. How are you generating the parquet file that it fails on?

@Khris777
Copy link
Author

The files are generated on a Cloudera Hadoop Cluster version 5.4.4 in Java by a colleague. I asked him for some code and he gave me the parts that write the parquet file, it's part of a larger file though:

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.IndexedRecord;
import org.apache.avro.reflect.ReflectData;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.Path;

import parquet.avro.AvroSchemaConverter;
import parquet.avro.AvroWriteSupport;
import parquet.hadoop.ParquetWriter;
import parquet.hadoop.metadata.CompressionCodecName;
import parquet.schema.MessageType;

public static final WriterVersion DEFAULT_WRITER_VERSION = WriterVersion.PARQUET_1_0;


Schema avroSchema = new Schema.Parser().parse(avroSchemaFile);

MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);

AvroWriteSupport writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);

File parquetFile = new File("parquetFile.parquet");

Path parquetFilePath = new Path(parquetFile.toURI());

try (ParquetWriter<IndexedRecord> parquetFileWriter =
		new ParquetWriter<IndexedRecord>(parquetFilePath, writeSupport, CompressionCodecName.SNAPPY, ParquetWriter.DEFAULT_BLOCK_SIZE, ParquetWriter.DEFAULT_PAGE_SIZE))
{
	for (UploadedXmlDTO uploadedXML : uploadedXMLs) 
	{
		GenericRecord record = new GenericData.Record(avroSchema);
		
		record.put("date", uploadedXML.getDate());
		record.put("xml", ByteBuffer.wrap(uploadedXML.getXml()));
		
		parquetFileWriter.write(record);
	}
}

Maybe this helps a little, I can't provide you with the files because of company policies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants