processing of large input files - CSV or JSON ?

I have a 3GB CSV file and a 4GB equivalent JSON file.

Which is best for loading & processing this data to a MySQL table ?

I can load this data using MySQL command :

REPLACE INTO TABLE `aws-pricing`

I need to filter them in code - I don’t require all 4 million rows. And definitely not all 91 fields.

The CSV file is going to create a lesser load on the system, the JSON file would generally be processed faster - take your pick.

You can process the CSV file one line at a time. You only need as much memory as what’s required for that “current” row.

A JSON file must be loaded in its entirety. You’re going to create a memory object roughly the size of the complete file.

I guess I’ll pick CSV then - as the Azure VM is on a 1/2 CPU and 4/8 GB RAM.

I need to dump this newly generated (final) MySQL table (~1M rows consuming ~200MB) in a Redis cache and hence thought of using JSON for json.dumps.

Are you talking about a separate serialized JSON object for each row? Or are you talking about a single JSON object for the entire table?

If the former, then that would be a file you could process one line at a time. It’s only the latter case where the entire table would become memory-resident.

  1. But what would be the key for each row ? "aws-" + SKU ?

  2. If its a single JSON object wouldn’t it be just "aws":[ { row1 }, { row2 }, { row3 },......{ row 1M } ]

Which one would be lower in size for each of the above ?

redis-cli info memory | grep 'used_memory.*human';

I wouldn’t know. I don’t know your data or how you’re planning to use it in redis.

Yes - so in order to load and parse it, it needs to load the entire aws list.

Within redis? You’d need to check that yourself. I have no idea what sort of compression redis may use internally for storing data and indexes. I suggest you try it both ways and see.

I just realized the idea of a sing JSON object is a bad way to go because the entire value of the aws key, would be one large string containing a JSONified object which I would have to load using json.load which would need to parse the entire 1M rows from which I have to filter ! My bad ! One row per JSON object would be using redis to get a specific key.