Inserting data into database using files on filesystem

kkkarthik · September 17, 2020, 10:01am

Hi
I am new with Django. I am going through some starting troubles. I really wish if some one can guide me through. Before posting here, I have read numerous articles related to my query but few things are not yet clear to me.

My query is:

I have a code written using python pandas and I have created a dataframe ready to use.
Now I wish to use this dataframe in Django code and insert the data in the database.

I wish to know where I can include this pandas code so that when I run my django application, it would create the necessary dataframes and insert it in database for further use.

I would like to model my models.py based on the rows and columns of the data possessed by this dataframe.
(I frankly do not have much code, as I said I am having some starting troubles). I really am grateful for any help.

KenWhitesell · September 17, 2020, 1:08pm

Is this a one-time (or infrequent) process that you will be running, or is it something that multiple people will be doing repeatedly?

If this is an infrequent process that only you need to do, I would look at making it a management command. You would then run this command (something like python manage.py load_panda_data if that’s what you named your command) from the command line.

Ken

kkkarthik · September 18, 2020, 3:02pm

Hi Ken,
Thanks a lot for replying to my query.
This is not a one-time query, I want to fetch data from unix FS and manipulate it using pandas dataframe and push it to database, make some calculations using numpy and visualize the data. (may be using matplotlib or other visualization modules that python offers)
I would like to perform real time data visualization every 2 seconds at a minimum and further give options to the end user, the time frame that we wants to visualize it at (basically making it configurable and thereby customizing the query based on the time frame the user wishes to visualize on the browser).
Yes, multiple people should be able to do it in parallel.
Can you please give me some guidance.

To begin with I need a way to keep pushing the data into the database every few seconds and build it further on.

Thanks and Regards

KenWhitesell · September 18, 2020, 3:28pm

You’ve got two distinct and separate tasks here - data collection and data visualization.

The data collection component (“fetch data from unix FS, manipulate it using pandas, push it to database”) is completely separate from the data visualization (“perform calculations using numpy and visualize the data”).

For the data collection, while you could use a Celery task (with CeleryBeat), I’m not convinced it’s necessarily the “best” tool for this job. If I were asked to architect this, my first thought would be to set this up completely outside the Django infrastructure. (I’d write a separate Python program to manage this process.)

The other half, data visualization, is pretty much standard Django. Retrieve data from the database and render web pages from it. (Although, I will admit that I’ve never tried to display “live-generated” matplotlib images through a web interface.)

(I would also have some concerns about being able to do this reliably for multiple people at 2-second intervals, but I’m guessing you’ve already prototyped enough of this to know that it’s actually possible to do what you’re considering.)

conor · September 18, 2020, 9:12pm

Hi kkarthik,

i do exactly what you are describing when it comes to pandas and dataframes. I have developed a backend with Django and DRF, and my colleagues, two veterinary professors, define diagnostic and clinical data in spreadsheets which is then used to populate the database.

As Ken said, I recommend you start by writing a separate program to do this. In my case, I have a base class which I call BaseImporter which defines the common things about importing from a spreadsheet.

Then, for each tab of the spreadsheet I have a class which inherits from the BaseImporter. Lastly, I have a separate program which puts it all together: fetch the spreadsheets (I use google sheets), read the spreadsheets, process the and sanity check dataframes before passing the data to the relevant importer.

Each importer logs its work and returns any logging information to the main program. This log information is then returned to whatever calls the main importer.

Finally, I use celery to run the importer. My colleagues visit a FormView from which they can choose a species and then my importer will run in the background and import all the data for the chosen species.

I’ll share some basic examples of how I put this together. I can’t promise that the code is any good, and it is constantly changing, but I hope that this helps you in some way.

Base Class

class BaseImporter:
    def __init__(self, species=None):
        self.named_tuple = None
        self.species = species
        self.sheet = None
        self.errors = []  # for return errors back to the importer tool
        self.logging_level = logger.getEffectiveLevel()

    def parse_case_sheet(self, obj, row):
        return obj(**self.clean_data(row))

    @staticmethod
    def clean_data(row):
        for key, value in row.items():
            if isinstance(value, str):
                value = value.strip()
            if pd.isna(value) or value == "" or value == "null":
                row[key] = None
        return row

# and a whole bunch more

A sheet importer

class AssessmentImporter(BaseImporter):
    def __init__(self, sheet, cases):
        super().__init__()
        self.sheet = sheet
        self.cases = cases

    def run_importer(self):
        for index, row in self.sheet.iterrows():
            data: Columns = self.parse_case_sheet(Columns, row)

            if data.case is None or data.case not in self.cases or data.initial_assessment is None:
                continue

            case = self.cases[data.case]

            assessment, created = Assessment.objects.update_or_create(
                case=case,
                defaults={
                    "initial_assessment": data.initial_assessment,
                    "updated_assessment": data.updated_assessment,
                },
            )
            logging_message = "Created" if created else "updated"
            self.debug(f"Case Assessment {assessment.initial_assessment} {logging_message}")
        return self.errors

Main program

    # get the spreadsheet in question
    if species == "rabbit":
        data = google_sheets(RANGES, RABBITS)
    # get the tab (sheet) in question
    assessment_sheet = extract_case_data(data, RANGES["assessment"])
    # run the importer for the sheet
    assessment_importer = AssessmentImporter(assessment_sheet, cases).run_importer()
    log_importer_results("assessments", assessment_importer, importer_logs)

Celery Stuff

Be sure to check out this link: https://docs.celeryproject.org/en/stable/django/first-steps-with-django.html

This is my task that runs the importer

from config.celery import app
from celery.utils.log import get_task_logger

from populator.import_tool import importer

logger = get_task_logger(__name__)


@app.task(name="importer")
def run_importer(species):
    """Runs the importer"""
    logger.debug("Run importer")
    return importer(species)

And this is my view which runs the task

# This is from the POST method of my Import FormView
# This may or may not be of use to you, but you see how I execute
# and get the importer results from celery

        if form.is_valid():
            species = form.cleaned_data["species"]
            result = run_importer.delay(species)
            return HttpResponse(json.dumps({"task_id": result.id}), status=status.HTTP_200_OK, content_type='application/json')

Righto, that’s how I do it. I’m sure there are a 1000 different ways to do it and no doubt it can be done better, but I have found this to be highly functional and robust.

Best of luck!

Conor

kkkarthik · September 20, 2020, 6:15am

Hi Ken,

Thanks for showing some light on ways to approach this.
I will try to get a move on now.

kkkarthik · September 20, 2020, 6:16am

Hi Conor,

Thanks for taking time to detail some stuff.
I appreciate your patience and time.

Topic		Replies	Views
Loading data from excel file into models. Forms & APIs	5	12823	February 21, 2022
insert pandas dataframe from API into model without a form Forms & APIs	3	2126	March 11, 2022
create new objects using class inmodels Using Django	3	1300	November 27, 2021
Best practice to insert data automatically from a pipeline Using Django	4	1087	September 16, 2020
understanding django models versus python sql commands Using Django	1	579	November 22, 2020