Is it feasible to use CSV file generated from other data source as primary data access object for a django project instead of using the dbms system like sqlite?
In theory, possibly, however you would lose the ability to use the Django ORM. I donāt know your use case enough to judge, but Iād rather import the CSV content in a supported database, then offer an interface to re-export it as CSV if needed.
You can use āCSVā table storage in (at least) these database backends:
- SQLite: https://www.sqlite.org/csv.html
- MySQL/MariaDB via the āCSVā storage engine: https://dev.mysql.com/doc/refman/8.0/en/csv-storage-engine.html . Or MariaDB via the CONNECT storage engine: https://mariadb.com/kb/en/connect/
However these will be slow though. Changes to CSV based tables require a lot of parsing and re-saving.
Itās probably better to import the CSV into a table with an optimized storage engine, and export it again later. django-import-export can help with that: https://pypi.org/project/django-import-export/
WOW! I had no idea a facility like this existed. Iāve got 3-4 use cases where this would be extremely useful. (They would be āread-onlyā queries from externally-generated sources, where weāre currently importing that data into tables just to produce some statistics from that data. āSpeed/performanceā in this case is most likely irrelevant.) Youāve given me something else to play with this week.
Oh cool. Iāve been meaning to investigate it myself for a while. CONNECT is really cool in that it can read excel, JSON files, and many other formats ā¦
Let us know how it goes!
Following up on thisā¦
This is a little less useful than I may have hoped, but will still be useful to me.
CSV works well enough. Some limitations that I encountered my first time through:
-
Access to the files in linux is done by the MariaDB engine under the mysql uid. This means that the files need to be in a location that the mysql account has access to. (Understandable and fair enough. Itās just one of those factors that needs to be accounted for if youāre handling CSV files as part of a larger process.)
-
It doesnāt attempt to use the header line in the CSV at all, and if you update the file, it will rewrite a header line using the column names defined for that ātableā.
-
(Related to the previous point) You must provide a table definition, defining the columns individually. (Meaning thereās no provision for a ādynamic tableā with columns defined by the first line of the file.)
-
The file name is specified in the CREATE TABLE statement, which means you canāt dynamically assign something like a ātable templateā to different files. (Yes, Iām sure thereās a way to work around this by dynamically creating your own DDL for the tables.)
-
There doesnāt appear to be any direct support for this within Django. (Admittedly, I performed a very cursory review of the available documentation, both of Django and of Django-Mysql, and if itās in there, I havenāt found it yet.) Yes, I can specify the CONNECT storage engine, but what Iām not seeing is an inherent facility for specifying the other necessary parameters.
For example:
create table superbowl (sb_date varchar(20), < other columns snipped >)
engine=CONNECT
table_type=CSV
file_name='/mnt/ext/mysql/superbowls.csv'
header=1;
So I know I can create the table definition as a model, but it would be nice if there were some META options available for the rest of the parameters. (Yea, I know, if itās that important to me, I should do it. Unfortunately, itās probably not important enough for me to try and tackle it.)
There are two really good features that I didnāt expect to find:
- Very flexible with defining the format of the CSV with reference to field separators and quoting options.
- The ability to ignore a predefined number of bad lines, or to accept bad lines by using NULL for fields with values that arenāt right.
The other feature that I was interested in, based upon your prior comment, was the ability to use ODBC as a means of accessing non-native formats such as Excel or Access. Unfortunately, there donāt appear to be any Open Source or reasonably priced (for a non-corporate individual) ODBC drivers for the Microsoft formats available for Linux. To even try this, Iād need to install MariaDB on a Windows system and test it there. (While I could do that for testing / experimentation purposes, that ends up being a non-starter for any of the functions for which I would find it useful.)
I do have a use-case for the MongoDB connector, thatās worth a look in the future.
Ken
Cool, thanks for the write-up @KenWhitesell ! Could be a blog post.
Interesting, it would be nice to leave it alone. But I guess writes are done by reading the whole file in and writing it back out (?)
Maybe you can copy a tableās metadata with a different filename with a query like: CREATE TABLE AS SELECT * FROM template_table WHERE 1=0 ENGINE=CONNECT table_type=CSV file_name=that_filename.csv
?
Yeah. You canāt specify storage engine with anything built-in Django afaiu, did you mean specifying CONNECT through init_command
?
Itād be nice to get Model.Meta
to support database table options, perhaps through a dict.
You can use migrations by putting the CREATE TABLE
SQL in a RunSQL
, with state_operations
containing the CreateModel
operation that makemigrations
generates.
Nice!
Nothing open source is a shame.
ā¦another thing Iāve seen that in theory could work for mapping Excel direct to the ORM is a SQLite virtual table, backed by openpyxl
. See the āvirtual tablesā section in this post, which briefly covers a redis-based table, with links: charles leifer | Extending SQLite with Python . Not sure how plausible. So many things to play around with on SQLite!
Thank you for the info! Wouldnāt have happened otherwise. (Blog post? Have at it. If I ever write anything here that you feel deserves wider distribution, you explicitly have my permission to use it without credit, acknowledgement, or attribution. Oh, you mean me? Naa, I probably have one good post a year in me, and this fills my quota. Besides, you have a platform, voice, and reach that far exceeds mine - and if it deserves to be shared, it should be shared in a way to make it as widely available as possible.)
Yet another good tip - seems like it would work, but doesnāt match up with what I was envisioning. I was looking in the reverse direction. Iād like to define a table where the file_name is āvariableā - or at least can be modified when the connection is established. (Yes, I can easily see where the semantics of that would be horrendous, especially if multiple processes were accessing that table and trying to refer to different names. So that concept just doesnāt āfitā into what a database engine is designed and intended to do. However, just because I know that doing it that way could cause serious issues doesnāt mean that I donāt want it to be that way. )
Hmmmā¦ am I misinterpreting Databases | Django documentation | Django in this context? It would appear to me that I could create a database connection specifying the storage engine for that database. (Havenāt tried - just making a guess based upon what Iām reading.)
This, to me, would be the key item in making it really useful in Django. (Also guessing that the needs for this are so narrow that it would be better served as an extension / third party package rather than in core. Not only is it MariaDB-specific, the legal keys for that dict would be different for each CONNECT type.)
Yet another good tip, thanks. Iāll keep that in mind.
The basic feature intrigues me, but not so much as far as Excel is concerned. (We already export spreadsheets to CSV files for processing - this doesnāt appear like it would simplify the environment enough to make it worth the effort.)
For sure! Iāve been using it for so long as a basic storage engine, I never bothered to get past the basic functionality to look at these types of features.
Ken