Data not "updating" on my deployment server ?

Hello,

I am creating a website using Django.

For context, the infrastructure is AWS. Django is hosted on an AWS EC2 instance, the static part is managed by AWS S3 and AWS Cloudfront.

My server has been running on my instance for several days now, so I can run some tests. In this context, the DEBUG value is set to False in settings.py.

I’ve just noticed something I can’t resolve: my data changes every day (maximum data date, number of jobs, etc.). These values are not updated and remain the same as the day the server is launched.

I wasn’t using a cache, so I set one up for the desired values. I created an appendix file which is run periodically via crontab to invalidate the existing cache and update with new data.

The “funny” thing is that I put a print in this file, and if I run it manually, the cache value is updated → the update isn’t done on the site, though.

Do you know how to manage this in django? I can share some code, but before flooding you with my files, I’d rather know if there’s a way for Django to update the values on the website, because I can’t find one at the moment.

I’ve thought of shutting down the server and restarting it every day via the crontab, but I don’t think that’s a great solution.

Edit : my main issue here is about a date filter, the “initial” value for end date is the MAX(date) from my database, but it’s not updating as it should be

Why don’t you start by sharing just the view containing this filter that isn’t working for you.

Of course, there are the files :

Of course, here are the contents of my forms.py file which, if the desired values (min and max date, list of regions, list of professions) are not cached (this is the case when the server is launched), caches 24 hours of these elements.

forms.py


# -----------------
# ------ SQL ------
# -----------------

def get_job_choices():
    job_choices = cache.get('job_choices')
    if job_choices is None:
        conn = get_db_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT DISTINCT job_search FROM jobs")
        job_choices = [i[0] for i in cursor.fetchall()]
        cache.set('job_choices', job_choices, timeout=60*60*24)  # Cache pour 24 heures
        cursor.close()
        conn.close()
    return job_choices

def get_date_range():
    dates = cache.get('date_range')
    if dates is None:
        conn = get_db_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT MIN(date_of_search) AS min, MAX(date_of_search) AS max FROM jobs")
        dates = cursor.fetchall()[0]
        cache.set('date_range', dates, timeout=60*60*24)
        cursor.close()
        conn.close()
    return dates

# ----------------------
# ------ Functions------
# ----------------------

def all_jobs():
    return get_job_choices()

def min_max_dates():
    return get_date_range()

def all_regions():
    return get_region_choices()

min = get_date_range()[0]
max = get_date_range()[1]

# ------------------
# ------ Forms------
# ------------------

def get_region_choices():
    region_choices = cache.get('region_choices')
    if region_choices is None:
        conn = get_db_connection()
        cursor = conn.cursor()
        cursor.execute("SELECT DISTINCT region FROM jobs WHERE region IS NOT NULL")
        region_choices = [i[0] for i in cursor.fetchall()]
        cache.set('region_choices', region_choices, timeout=60*60*24)
        cursor.close()
        conn.close()
    return region_choices


class ApiSearchForm(forms.Form):
    date_start = forms.DateField(
        label='Start date',
        initial=min.strftime('%Y-%m-%d'),
        widget=forms.DateInput(attrs={'type': 'date', 'class': 'form-control'}),                             
        required=False)

    date_end = forms.DateField(
        label='End date',
        initial=max.strftime('%Y-%m-%d'),                       
        widget=forms.DateInput(attrs={'type': 'date', 'class': 'form-control'}),                               
        required=False)  

    job_tupled = [("All", "All")] + [(job, job) for job in get_job_choices()]

    job = forms.ChoiceField(
        label='Jobs', 
        choices=job_tupled,  #
        widget=forms.Select(attrs={'class': 'form-control'}),
        required=False)

    region_tupled = [("All", "All")] + [(region, region) for region in get_region_choices()]
    region = forms.ChoiceField(
        label='Région', 
        choices=region_tupled,  
        widget=forms.Select(attrs={'class': 'form-control'}),
        required=False)

At the same time, I’ve created a caching file which is restarted every 24 hours by the cron. This file invalidates the current cache and re-caches each element.

daily_cache.py

from django.core.cache import cache
from datastats_db_config import get_db_connection
import os
import django
import json
import pandas as pd

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'datastats.settings')
django.setup()

def update_cache():
    conn = get_db_connection()
    cursor = conn.cursor()

    cache.delete('job_choices')
    cache.delete('date_range')
    cache.delete('region_choices')
    cache.delete('first_value')
    cache.delete('number_of_results')

    cursor.execute("SELECT DISTINCT job_search FROM jobs")
    job_choices = [i[0] for i in cursor.fetchall()]
    cache.set('job_choices', job_choices, timeout=60*60*24) 

    cursor.execute("SELECT MIN(date_of_search) AS min, MAX(date_of_search) AS max FROM jobs")
    dates = cursor.fetchall()[0]
    cache.set('date_range', dates, timeout=60*60*24)

    cursor.execute("SELECT DISTINCT region FROM jobs WHERE region IS NOT NULL")
    region_choices = [i[0] for i in cursor.fetchall()]
    cache.set('region_choices', region_choices, timeout=60*60*24)

    cursor.execute("""
        SELECT
            date_of_search,
            job_search,
            region,
            city,
            company_name,
            technos,
            description
        FROM
            jobs
        ORDER BY 
            id     
    """)
    apidf = pd.DataFrame(cursor.fetchall(), columns=['date_of_search', 'job_search', 'region', 'city', 'company_name', 'technos', 'description'])
    apidf['date_of_search'] = apidf['date_of_search'].astype(str)

    first_value = json.dumps(apidf.iloc[0].to_dict(), indent=4, ensure_ascii=False)
    cache.set('first_value', first_value, timeout=60*60*24) 

    number_of_results = len(apidf)
    cache.set('number_of_results', number_of_results, timeout=60*60*24) 


    cursor.close()
    conn.close()

    print(cache.get('date_range'))

if __name__ == '__main__':
    update_cache()

Finally, the values in views.py. I import the min_max_dates() function from forms.py to display the maximum date on the front → This value doesn’t update and keeps the date on which I launched the server.

Ditto for the ApiSearchForm() → The maximum date remains the same as when the server was launched.

If I manually relaunch the caching file, I check the max date values with a print and the update is done correctly. On the other hand, on the front-end, the update doesn’t take place: the insight and the form remain “frozen” at a date which is not necessarily the right one.

views.py

def data(request):
    
    if request.method == 'POST':

        apiform = ApiSearchForm(request.POST)
        
        if apiform.is_valid():
        
            # Code masked to display only that which is executed when the apage is launched (else:)
                
    else:
        
        apiform = ApiSearchForm()
        
        first_value = cache.get('first_value')
        if first_value is None:
            conn = get_db_connection()
            cursor = conn.cursor()
            cursor.execute("""
                SELECT
                    date_of_search,
                    job_search,
                    region,
                    city,
                    company_name,
                    technos,
                    description
                FROM
                    jobs
                ORDER BY 
                    id     
            """)
            apidf = pd.DataFrame(cursor.fetchall(), columns=['date_of_search', 'job_search', 'region', 'city', 'company_name', 'technos', 'description'])
            apidf['date_of_search'] = apidf['date_of_search'].astype(str)

            first_value = json.dumps(apidf.iloc[0].to_dict(), indent=4, ensure_ascii=False)
            cache.set('first_value', first_value, timeout=60*60*24)  # Cache pour 24 heures

            number_of_results = len(apidf)
            cache.set('number_of_results', number_of_results, timeout=60*60*24)  # Cache pour 24 heures

            cursor.close()
            conn.close()
        
        number_of_results = cache.get('number_of_results')
        
        real_date_min = min_max_dates()[0]
        real_date_max = min_max_dates()[1]
        

        return render(
            request, 
            'data.html', 
            {
                'first_value': first_value,
                'apiform': apiform,
                'number_of_results': number_of_results,
                'real_date_min': real_date_min,
                'real_date_max': real_date_max,
            })

First issue:

Those statements are only ever going to get executed once - when the module is imported.

Second: It’s an extremely bad idea to use Python built-in function names as variables.

Third: You’re wasting a lot of energy trying to manage a query cache that Django already manages. I doubt seriously that trying to cache those query results yourself is saving any detectable amount of time. Strongly suggest you get rid of that entire effort. Write your queries to use the relationships and tables that exist.

Fourth: Use the ORM. It’s there for a reason.

If you haven’t done so already, I suggest you work your way through the Official Django Tutorial. Or, even if you have, maybe go through it again, this time focusing on the ORM portions of the exercises.

Thank you for your valuable advice. I don’t use the ORM model by default because most of my queries are complex for analysis purposes. I chose to use psycopg2.

On the other hand, I hadn’t thought about importing, which only happens once, and that was exactly my problem.

Thanks again for your help!

The queries you are using here - or at least the ones you are showing, don’t fit into thtat category.

Using the ORM for those queries does not prevent you from using raw SQL in other circumstances - and in fact is likely to help facilitate that.

I’ve run into very few situations where raw SQL is preferable - and that includes some “complex” queries. (Including a handful, that when generated into SQL by the ORM were more than 16,000 characters in length. If you’re really curious, see In Pursuit of PostgreSQL Performance - DEV Community)