Using CSV File as data storage and access

Is it feasible to use CSV file generated from other data source as primary data access object for a django project instead of using the dbms system like sqlite?

In theory, possibly, however you would lose the ability to use the Django ORM. I don’t know your use case enough to judge, but I’d rather import the CSV content in a supported database, then offer an interface to re-export it as CSV if needed.

You can use “CSV” table storage in (at least) these database backends:

However these will be slow though. Changes to CSV based tables require a lot of parsing and re-saving.

It’s probably better to import the CSV into a table with an optimized storage engine, and export it again later. django-import-export can help with that: https://pypi.org/project/django-import-export/

2 Likes

WOW! I had no idea a facility like this existed. I’ve got 3-4 use cases where this would be extremely useful. (They would be “read-only” queries from externally-generated sources, where we’re currently importing that data into tables just to produce some statistics from that data. “Speed/performance” in this case is most likely irrelevant.) You’ve given me something else to play with this week.

Oh cool. I’ve been meaning to investigate it myself for a while. CONNECT is really cool in that it can read excel, JSON files, and many other formats …

Let us know how it goes!

Following up on this…

This is a little less useful than I may have hoped, but will still be useful to me.

CSV works well enough. Some limitations that I encountered my first time through:

  • Access to the files in linux is done by the MariaDB engine under the mysql uid. This means that the files need to be in a location that the mysql account has access to. (Understandable and fair enough. It’s just one of those factors that needs to be accounted for if you’re handling CSV files as part of a larger process.)

  • It doesn’t attempt to use the header line in the CSV at all, and if you update the file, it will rewrite a header line using the column names defined for that “table”.

  • (Related to the previous point) You must provide a table definition, defining the columns individually. (Meaning there’s no provision for a “dynamic table” with columns defined by the first line of the file.)

  • The file name is specified in the CREATE TABLE statement, which means you can’t dynamically assign something like a “table template” to different files. (Yes, I’m sure there’s a way to work around this by dynamically creating your own DDL for the tables.)

  • There doesn’t appear to be any direct support for this within Django. (Admittedly, I performed a very cursory review of the available documentation, both of Django and of Django-Mysql, and if it’s in there, I haven’t found it yet.) Yes, I can specify the CONNECT storage engine, but what I’m not seeing is an inherent facility for specifying the other necessary parameters.

For example:

create table superbowl (sb_date varchar(20), < other columns snipped >)
engine=CONNECT 
table_type=CSV
file_name='/mnt/ext/mysql/superbowls.csv'
header=1;

So I know I can create the table definition as a model, but it would be nice if there were some META options available for the rest of the parameters. (Yea, I know, if it’s that important to me, I should do it. Unfortunately, it’s probably not important enough for me to try and tackle it.)

There are two really good features that I didn’t expect to find:

  • Very flexible with defining the format of the CSV with reference to field separators and quoting options.
  • The ability to ignore a predefined number of bad lines, or to accept bad lines by using NULL for fields with values that aren’t right.

The other feature that I was interested in, based upon your prior comment, was the ability to use ODBC as a means of accessing non-native formats such as Excel or Access. Unfortunately, there don’t appear to be any Open Source or reasonably priced (for a non-corporate individual) ODBC drivers for the Microsoft formats available for Linux. To even try this, I’d need to install MariaDB on a Windows system and test it there. (While I could do that for testing / experimentation purposes, that ends up being a non-starter for any of the functions for which I would find it useful.)

I do have a use-case for the MongoDB connector, that’s worth a look in the future.

Ken

1 Like

Cool, thanks for the write-up @KenWhitesell ! Could be a blog post.

Interesting, it would be nice to leave it alone. But I guess writes are done by reading the whole file in and writing it back out (?)

Maybe you can copy a table’s metadata with a different filename with a query like: CREATE TABLE AS SELECT * FROM template_table WHERE 1=0 ENGINE=CONNECT table_type=CSV file_name=that_filename.csv ?

Yeah. You can’t specify storage engine with anything built-in Django afaiu, did you mean specifying CONNECT through init_command ?

It’d be nice to get Model.Meta to support database table options, perhaps through a dict.

You can use migrations by putting the CREATE TABLE SQL in a RunSQL, with state_operations containing the CreateModel operation that makemigrations generates.

Nice!

Nothing open source is a shame.

…another thing I’ve seen that in theory could work for mapping Excel direct to the ORM is a SQLite virtual table, backed by openpyxl. See the ‘virtual tables’ section in this post, which briefly covers a redis-based table, with links: http://charlesleifer.com/blog/extending-sqlite-with-python/ . Not sure how plausible. So many things to play around with on SQLite!

Thank you for the info! Wouldn’t have happened otherwise. (Blog post? Have at it. If I ever write anything here that you feel deserves wider distribution, you explicitly have my permission to use it without credit, acknowledgement, or attribution. Oh, you mean me? :wink: Naa, I probably have one good post a year in me, and this fills my quota. Besides, you have a platform, voice, and reach that far exceeds mine - and if it deserves to be shared, it should be shared in a way to make it as widely available as possible.)

Yet another good tip - seems like it would work, but doesn’t match up with what I was envisioning. I was looking in the reverse direction. I’d like to define a table where the file_name is “variable” - or at least can be modified when the connection is established. (Yes, I can easily see where the semantics of that would be horrendous, especially if multiple processes were accessing that table and trying to refer to different names. So that concept just doesn’t “fit” into what a database engine is designed and intended to do. However, just because I know that doing it that way could cause serious issues doesn’t mean that I don’t want it to be that way. :smile:)

Hmmm… am I misinterpreting https://docs.djangoproject.com/en/3.0/ref/databases/#storage-engines in this context? It would appear to me that I could create a database connection specifying the storage engine for that database. (Haven’t tried - just making a guess based upon what I’m reading.)

This, to me, would be the key item in making it really useful in Django. (Also guessing that the needs for this are so narrow that it would be better served as an extension / third party package rather than in core. Not only is it MariaDB-specific, the legal keys for that dict would be different for each CONNECT type.)

Yet another good tip, thanks. I’ll keep that in mind.

The basic feature intrigues me, but not so much as far as Excel is concerned. (We already export spreadsheets to CSV files for processing - this doesn’t appear like it would simplify the environment enough to make it worth the effort.)

For sure! I’ve been using it for so long as a basic storage engine, I never bothered to get past the basic functionality to look at these types of features.

Ken