processing of large input files - CSV or JSON ?

anjanesh · March 21, 2023, 7:45am

I have a 3GB CSV file and a 4GB equivalent JSON file.

Which is best for loading & processing this data to a MySQL table ?

I can load this data using MySQL command :

LOAD DATA LOCAL INFILE index.csv
REPLACE INTO TABLE `aws-pricing`
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\\'
LINES TERMINATED BY '\\n'
IGNORE 6 LINES;

I need to filter them in code - I don’t require all 4 million rows. And definitely not all 91 fields.

KenWhitesell · March 21, 2023, 11:44am

The CSV file is going to create a lesser load on the system, the JSON file would generally be processed faster - take your pick.

You can process the CSV file one line at a time. You only need as much memory as what’s required for that “current” row.

A JSON file must be loaded in its entirety. You’re going to create a memory object roughly the size of the complete file.

anjanesh · March 21, 2023, 11:59am

I guess I’ll pick CSV then - as the Azure VM is on a 1/2 CPU and 4/8 GB RAM.

I need to dump this newly generated (final) MySQL table (~1M rows consuming ~200MB) in a Redis cache and hence thought of using JSON for json.dumps.

KenWhitesell · March 21, 2023, 12:05pm

Are you talking about a separate serialized JSON object for each row? Or are you talking about a single JSON object for the entire table?

If the former, then that would be a file you could process one line at a time. It’s only the latter case where the entire table would become memory-resident.

anjanesh · March 21, 2023, 12:15pm

But what would be the key for each row ? "aws-" + SKU ?
If its a single JSON object wouldn’t it be just "aws":[ { row1 }, { row2 }, { row3 },......{ row 1M } ]

Which one would be lower in size for each of the above ?

redis-cli info memory | grep 'used_memory.*human';

KenWhitesell · March 21, 2023, 12:19pm

I wouldn’t know. I don’t know your data or how you’re planning to use it in redis.

Yes - so in order to load and parse it, it needs to load the entire aws list.

Within redis? You’d need to check that yourself. I have no idea what sort of compression redis may use internally for storing data and indexes. I suggest you try it both ways and see.

anjanesh · March 21, 2023, 12:59pm

I just realized the idea of a sing JSON object is a bad way to go because the entire value of the aws key, would be one large string containing a JSONified object which I would have to load using json.load which would need to parse the entire 1M rows from which I have to filter ! My bad ! One row per JSON object would be using redis to get a specific key.

Topic		Replies	Views
Is there a faster way of importing a large json dataset? Using Django	2	1844	April 8, 2020
deal with a 20 millions rows gpkg file in backend Getting Started	6	953	September 8, 2022
Ingesting a large JSON through my Django endpoint Forms & APIs	2	68	September 13, 2025
django orm take to much time in mysql and server Using Django	8	2508	November 12, 2021
How good work JSON field queries with Django for searching? Getting Started	3	2894	July 14, 2022

processing of large input files - CSV or JSON ?

Related topics