Generating report with parsing a big file (Snort) taking to long

Hi!
So I’m trying to do a report view with a data from multiple snort log files, which can contain about 100K+ records. The problem is that this task takes to long. For a single file I’m taking information from a single line assign it to dictionary and then to OrderedDict. In my view I can access data in DataTable and ECharts. What would be the best approach to optimize process? I heard about asynchronous support in Django 3, but it’s a mystery for me :frowning:
Here is my view function:

    @login_required
    def generate_report_fast(request):
        if UploadFile.objects.filter(user=request.user):
            files = UploadFile.objects.filter(user=request.user)
            dict_alert = OrderedDict()
            count = 1
            for file in files:
                path = file.upload_file.path
                date = str(file.get_year())
                with open(path) as alertfile:
                    for line in alertfile:
                        if line is None:
                            break
                        dict_alert[count] = read_data(line, date)
                        count += 1

            proto_count_data = protocount(dict_alert)
            ip_count_data = ipcount(dict_alert)
            classi_count_data = classicount(dict_alert)
            priority_count_data = prioritycount(dict_alert)
            time_count_data = timecount(dict_alert)

            return render(request, 'report.html', {'pkts': dict_alert, 'proto': proto_count_data, 'ip_data': ip_count_data,
                                                    'classi_data': classi_count_data, 'priority_data': priority_count_data,
                                                    'time_data': time_count_data})
        else:
            return render(request, 'details_error.html')

There might be a couple different ways to improve this, but I’m not sure that any of them are going to make this “fast” if you’re processing hundreds of thousands of lines.

First, I’d take a look at how you’re loading the data for internal use. You don’t show the details of your read_data method, so I can’t offer any suggestions there.

But, in your processing block

proto_count_data = protocount(dict_alert)
ip_count_data = ipcount(dict_alert)
classi_count_data = classicount(dict_alert)
priority_count_data = prioritycount(dict_alert)
time_count_data = timecount(dict_alert)

I’m guessing that you’re iterating over your dict_alert one time for each of these statistics you’re gathering. You’re going through all your collected data five time. You can reduce the time spent iterating over that data by collecting your stats in one loop. (You didn’t include the definitions for protocount, ipcount, classicount, prioritycount, or timecount, so I can’t be more specific than that.) In theory, that will reduce the amount of time spent in that section of the code by 80%.

Also, if you’re going to be running multiple sets of reports over a period of time with effectively the same data, there may be value in storing this in a database so that you’re not reading and parsing the data every time you need it. However, the value of this is extremely sensitive to your eventual usage of the data being handled.

Ken

1 Like