I am troubleshooting an anomaly in the way a generator from InMemoryUploadedFile.chunks()
decodes files.
I created five variants of CSV files from Microsoft Excel 2013:
sample-mac.csv
sample-dos.csv
sample-comma.csv
sample-tab-delimited.csv
(actually saves as .txt but a change of extension doesn’t hurt)sample-unicode.csv
Excel provides options for saving in those formats. The last encodes with "utf-16"
by default. The rest are plain texts (ascii
). The plain text variants differ in terms of the end-of-line character ("\r"
for mac, and "\r\n"
for others created on an MS-Windows system). Plain text files (including csv
files) created on Linux use only "\n"
for EOL.
So I have a file
which is an instance of InMemoryUploadedFile
and I get a iterable generator containing all the chunks from that file
:
chunks = file.chunks()
# where chunks is a generator
and I take the text in the first chunk for sampling:
sampler = next(chunks)
The sampler
is still a binary text at the moment. So I decode the text and observe …
print(sampler.decode(charset))
# where charset is ascii (None implies utf-8)
OBSERVATION:
The string characters in the sampler
decode properly (as expected) except for sample-mac.csv
(the first file sample). Somehow, somewhere … the chunks()
mangle or truncate the string such that the final output is a miserable (small) version of the intended. Some characters are lost! Why? It only happens in sample-mac.csv
.
Now here is the bummer:
when I open and read the same file directly via a python shell, it reads perfectly.
file = open("sample-mac.csv", "r")
read = file.read()
print(read)
file.close()
The above code prints out everything – same encoding and all. So python reads the same file properly but something messes with the file when it passes through InMemoryUploadedFile.chunks()
. Does anyone have an explanation for this?