deal with a 20 millions rows gpkg file in backend

I am starting on a new project and still trying to wrap my mind around what the project needs to achieve. Not sure if this is the best place to ask a question about project structure rather than code base question.:

Here is a bit more about the project

  • needs to use data from a gpkg file containing about 20 millions rows

  • user input a string

  • string is searched in the file

  • user sees the output of the row corresponding to his input (if found)

I have got a small prototype working with sample data but I am not sure how to make it work with 20m rows.

All the rows are important but only 20 of 170 columns are of interest for the project. Here is how I am thinking of approaching the issue:

  • convert the gpkg file into a csv

  • transform to a dataframe and keep the columns I want

  • import to my django model through admin import

What would be the flaws of that method?

What would you guys do if you were me?

A gpkg file is an extended sqlite database. Converting it to a csv is unnecessary and potentially difficult (depending on the relationships in the database). I would look into finding a way to query the GeoPackage database in a python script and then inserting the data into your Django models.

1 Like

Or, as an alternative, you could attach that database as a secondary database to the system, create a set of models for it, and use it directly.

2 Likes

Or, if you really needed to move it into your PostgreSQL database, you could still create a set of models for it and copy the desired data to your primary database using dumpdata / loaddata.

1 Like

Thank you guys for your precious input. I’ll try to do as suggested and let you know!

In terms of performance, how would the app handle searching through 20 millions rows? I have never dealt with such quantities of rows.

Superficially, it sounds like this is a “read-only” situation - you’re not looking to update this data frequently, if at all.

In that situation, I’d be looking at creating indexes on all the important columns - and possibly column combinations - depending upon the nature of the searches to be performed.

(20 million rows - by itself - is not necessarily particularly large. You’d want to evaluate that relative to the total size of the database and the relative distribution of common search results, along with the size of the system on which this is being run. For example, if each row only consumed 100 bytes, that’s only 2 GB storage - the entire database could end up being memory resident.)

It is indeed a read-only situation. It corresponds to a public database, unaccessible anywhere else and for which an API does not exist. I will have to update the entire database once in a while when the public db gets updated.

It sucks that I have to keep all the columns which only a few out 160 are useful. I feel like it adds useless volume to the DB.