API Optimization

Hello developers,

I am developing a threat intelligence project. I am having some problems with optimization. I will be sharing my related codes. I am waiting for your optimization suggestions. I am currently using pagination.

views.py

class PasswordEntryViewSet(viewsets.ModelViewSet):
    """
    ViewSet for managing PasswordEntry records.
    """
    queryset = PasswordEntry.objects.all().order_by('-id')
    serializer_class = PasswordEntrySerializer
    filter_backends = [DjangoFilterBackend]
    filterset_class = PasswordEntryFilter

class DataLeakViewSet(viewsets.ModelViewSet):
    """
    ViewSet for managing DataLeak records.
    """
    queryset = DataLeak.objects.all().order_by('-id')
    serializer_class = DataLeakSerializer
    filter_backends = [DjangoFilterBackend]
    filterset_class = DataLeakFilter

serializers.py

class PasswordEntrySerializer(serializers.ModelSerializer):
    victim_comment = serializers.CharField(
        source='victim.comment', read_only=True
    )

    class Meta:
        model = PasswordEntry
        fields = '__all__'


class DataLeakSerializer(serializers.ModelSerializer):
    class Meta:
        model = DataLeak
        fields = '__all__'

filters.py


class PasswordEntryFilter(django_filters.FilterSet):
    url = django_filters.CharFilter(field_name='url', lookup_expr='icontains')
    username = django_filters.CharFilter(field_name='username', lookup_expr='icontains')

    class Meta:
        model = PasswordEntry
        fields = ['url', 'username']

class DataLeakFilter(django_filters.FilterSet):
    mail = django_filters.CharFilter(field_name='mail', lookup_expr='icontains')
    domain = django_filters.CharFilter(field_name='domain', lookup_expr='icontains')

    class Meta:
        model = DataLeak
        fields = ['mail', 'domain']

models.py

class Victim(models.Model):
    victim_id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    comment = models.TextField()
    created_at = models.DateTimeField(auto_now_add=True)

    def __str__(self):
        return self.comment

class PasswordEntry(models.Model):
    url = models.URLField()
    username = models.CharField(max_length=255, db_index=True)  
    password = models.CharField(max_length=100)
    file_name = models.CharField(max_length=255, db_index=True, null=True, blank=True)
    uploaded_at = models.DateTimeField(null=True, blank=True)
    
    victim = models.ForeignKey('Victim', on_delete=models.CASCADE, related_name='combined_entries', null=True, blank=True)
    application = models.CharField(max_length=255, null=True, blank=True)

    class Meta:
        indexes = [
            models.Index(fields=['username']),
            models.Index(fields=['file_name', 'uploaded_at']),
        ]

    def __str__(self):
        return f'{self.username} - {self.url}'

class DataLeak(models.Model):
    email = models.CharField(max_length=255)
    domain = models.CharField(max_length=255)
    password = models.CharField(max_length=255)
    file_name = models.CharField(max_length=255, blank=True)
    uploaded_at = models.DateField(blank=True)

    class Meta:
        unique_together = ('email', 'domain', 'password')
        indexes = [
            models.Index(fields=['email']),
            models.Index(fields=['domain']),
            models.Index(fields=['uploaded_at']),
        ]

    def __str__(self):
        return f'{self.email}@{self.domain}'

What are you optimizing? What is the target? What is the current state?

I will optimize query performance. I will work with large data sets such as 1 billion

If you’re really going to work with tables containing more than 1,000,000,000 rows, then your biggest contraints are going to be the database itself. You’ll at least need to be ready to partition these tables, and possibly shard them across multiple servers.

Optimizing a database at that scale is not a “cookbook type” solution. It’s not a case of “do this” then “do that” and everything’s fine.

You’ve got ancillary issues associated with doing backups and restores (Disaster Recovery and Business Continuity) that are also complicated by data of this size. Even the planning of doing a migrate raises a bunch of issues.

My suggestion would be to engage the services of a database consultant team, one that knows what they’re doing at that scale.

Thank you. I am working with competent people in the database part. I wanted to consult whether we need any revision in the code in terms of performance.