deal with a 20 millions rows gpkg file in backend

Peyo · September 8, 2022, 11:46am

I am starting on a new project and still trying to wrap my mind around what the project needs to achieve. Not sure if this is the best place to ask a question about project structure rather than code base question.:

Here is a bit more about the project

needs to use data from a gpkg file containing about 20 millions rows
user input a string
string is searched in the file
user sees the output of the row corresponding to his input (if found)

I have got a small prototype working with sample data but I am not sure how to make it work with 20m rows.

All the rows are important but only 20 of 170 columns are of interest for the project. Here is how I am thinking of approaching the issue:

convert the gpkg file into a csv
transform to a dataframe and keep the columns I want
import to my django model through admin import

What would be the flaws of that method?

What would you guys do if you were me?

CodenameTim · September 8, 2022, 1:04pm

A gpkg file is an extended sqlite database. Converting it to a csv is unnecessary and potentially difficult (depending on the relationships in the database). I would look into finding a way to query the GeoPackage database in a python script and then inserting the data into your Django models.

KenWhitesell · September 8, 2022, 1:17pm

Or, as an alternative, you could attach that database as a secondary database to the system, create a set of models for it, and use it directly.

KenWhitesell · September 8, 2022, 1:19pm

Or, if you really needed to move it into your PostgreSQL database, you could still create a set of models for it and copy the desired data to your primary database using dumpdata / loaddata.

Peyo · September 8, 2022, 1:48pm

Thank you guys for your precious input. I’ll try to do as suggested and let you know!

In terms of performance, how would the app handle searching through 20 millions rows? I have never dealt with such quantities of rows.

KenWhitesell · September 8, 2022, 1:57pm

Superficially, it sounds like this is a “read-only” situation - you’re not looking to update this data frequently, if at all.

In that situation, I’d be looking at creating indexes on all the important columns - and possibly column combinations - depending upon the nature of the searches to be performed.

(20 million rows - by itself - is not necessarily particularly large. You’d want to evaluate that relative to the total size of the database and the relative distribution of common search results, along with the size of the system on which this is being run. For example, if each row only consumed 100 bytes, that’s only 2 GB storage - the entire database could end up being memory resident.)

Peyo · September 8, 2022, 2:04pm

It is indeed a read-only situation. It corresponds to a public database, unaccessible anywhere else and for which an API does not exist. I will have to update the entire database once in a while when the public db gets updated.

It sucks that I have to keep all the columns which only a few out 160 are useful. I feel like it adds useless volume to the DB.

Topic		Replies	Views
Using CSV File as data storage and access Using Django	7	3994	April 3, 2020
Migration of large amount of data from one app to another Using Django	1	914	March 18, 2020
Django model with mysql Using Django	7	1130	January 30, 2021
change db backend from postgesql to mariadb Using Django	4	1182	February 16, 2021
Is there a faster way of importing a large json dataset? Using Django	2	1815	April 8, 2020

deal with a 20 millions rows gpkg file in backend

Related topics