Opinion on ticket #35342 IntegerField clean() behaviour

So, as written on ticket #35342. During the full_clean( ) routine, when an IntegerField is being cleaned and the code calls the field’s to_python( ) method, and if the value of the field is a float, it will just be converted to an integer using {{{int(value)}}}. As it is characteristic of python’s native int( ) function, it converts any float to int discarding the digits right of the decimal point, allowing for examples like this to happen:

{{{
exModel = ModelWithIntegerField(value=1.2)

exModel.value #1.2
exModel.clean_fields() #no validation error raised
exModel.value #1
}}}

I don’t know if this behaviour is because of ticket #24229, but I believe there are better ways than just allowing any number through IntegerField verification as long as it’s a number, regardless of how it affects the output.

This leads to a counter intuitive behaviour that may distort values being written onto the database without warning and makes avoiding this distortion awkward.

My suggestion to improve this is:
1# Check first if it’s a float type then verify if the numbers after the decimal point are 0 {{{if value%1 == 0}}}, if they are, no problem converting, if not raise a TypeError as it would distort the original value.

This fix only stops values that would be distorted and lose information from passing clean(), it will continue to accept strings like “1” and “1.0”, and floats like 1.0

Another way to change this is to add an option to toggle this behaviour on and off.

Anyways, I would love to see what you, the community think about this.

This is actually how most relational database work.

PostgreSQL does not throw an error when you try to assign a float to an integer field. (Chapter 10 in the PostgreSQL docs covers the implicit type conversions that can occur.)

From what I can see from the docs (but haven’t explicitly tested) MySQL works the same way. (It looks like it’s addressed in Chapter 15 of those docs.)

So, I think it would be a mistake to have the ORM generate an error for an operation that is acceptable, and with defined behaviors. This wouldn’t be counter-intuitive to anyone familiar with the underlying databases, and in fact, the opposite would pose issues for those who are aware of this and are expecting the “normal” behavior.

What you said makes sense and you are correct about the implicit type conversions on the databases, but I have a few points.

First, even if those conversions are acceptable, it doesn’t mean that they are always desirable, as we are dealing with the input and not directly inside the database, we could at least have the option to choose our desired behaviour already built-in to Django as it is a simple feature would be very convenient to deal with data consistency.

Second, although this is expected behaviour when dealing with databases, it would still be counter-intuitive for people coming from other ORMs that have stricter typing, like SQLAlchemy.

Speaking here as just a “Django User” (I am not one of the fellows or a contributor to the project itself), I have seen expressed in multiple places (ticket, mailing list, etc) that additional settings or options are avoided when there are existing ways to handle this. (I’m paraphrasing, but I think you get the idea.)

You already have this option. I can think of at least three different ways it can be done.

  • Check it at “point of entry” (either in the form or the serializer). This is my preferred place of validation, because it gives you the most control over how you handle the error from the sender’s perspective. In my opinion, waiting until you’re going to try and save the model is too late in the process to be truly effective.

  • Use a validator on the model field

  • Create a custom field subclass (e.g., CheckedIntegerField) that overrides one of:

    • validate
    • to_python
    • clean

There are probably other ways as well. Your choice among them will likely depend upon how many of your integer fields you want handled in this non-standard method.

I find this to be uncompelling -

  • This argument goes both ways. I could make the same argument in reverse if I had to start working with SQLAlchemy again.

  • The ORM is dealing with databases, therefore it makes sense to match the database’s behavior.

  • I’d make the argument that SQLAlchemy is doing things wrong, because it’s adding an unnecessary restriction for something that databases handle quite well. I don’t want my abstraction layers preventing me from using features that the lower layers provide.

    • And, saying this with humor, Django came first. It’s SQLAlchemy that did not follow the pre-existing art.
    • (Any argument made using other, more type-strict languages such as Java carry even less weight with me, as I moved to Python to get away from Java’s overly-strict type system as it existed 20 years ago.)

I was unaware of this behaviour and I do find it surprising, notwithstanding the behaviour of databases. I can’t imagine a scenario where I could want this implicit conversion rather than an error and I therefore think we should default to an error and let anyone who needs the conversion implement it themselves.

I can understand how people may get tripped up by this, but if you pass a float to something that expects an integer, you should expect it to be truncated. This is the same as if you passed a float to int() in Python. I think you’ll also find that PostgreSQL will implicitly accept an integer or float passed to a text column.

It would be backward incompatible to default to an error. If something is done about this, at most, it should be a RuntimeWarning akin to the naive datetime while time zone support is active warning that Django already has.

It would be backward incompatible, but we could manage that with a deprecation cycle. I’d be happy with a RuntimeWarning though.

I agree with the general sentiment here. It doesn’t feel like something Field.clean should error out on given model level validation is mostly likely going to be run behind another layer that should deal with user input anyway.

I think that’s a good take on it given we have a precedent with naive datetime. It’s a potential loss of data but most of the time it’s harmless and even intended.

RuntimeWarning seems appropriate given these can be silenced or elevated to errors with warnings.filterwarnings depending on the project.

My only concerns is that doing it for float opens up a big can of worms with regards to all objects that support __int__. What makes float.__int__ special over the different input that int supports such as "3.14", unicodedata.numeric('¾')?

Are we going to have to re-implement int internals to determine whether or not we should warn and at what point does the cost of running these checks are going to be too much to warrant it. What about passing float to DecimalField that overflow .decimal_places?

For these reasons I’m -1 on on changing IntegerField.clean.

My only concerns is that doing it for float opens up a big can of worms with regards to all objects that support __int__. What makes float.__int__ special over the different input that int supports such as "3.14", unicodedata.numeric('¾')?

The string case isn’t an issue here:

Python 3.12.2 (main, Feb 25 2024, 04:38:01) [Clang 17.0.6 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> int("3.14")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '3.14'

I’m not sure what you’re suggesting with the other example.

I missed that int would reject floats represented as string. That was a bad example.

I’m not sure what you’re suggesting with the other example.

My point is that int doesn’t treat float differently than any other Python objects implementing __int__. The reason int(3.14) is not rejected by int is simply because float.__int__ is implemented.

>>> float.__int__(0.9)
0

It’s possible that CPython might have an a fast implemention of int(float) as it’s a common operation but I think that would qualify as an implementation detail

In this case why should we special case float over Decimal, unicodedata.numeric, fractions.Fraction, and other objects that might implement __int__ to return their integer part while discarding their reminder.

Why should we only do it for IntegerField but not for DecimalField which also has fixed representation of numeric data and could cause data truncation when providing overflowing input.

>>> decimal.Context(prec=2).create_decimal_from_float(0.2222)
Decimal('0.22')

In essence this behavior seems similar to trying input date times into fields / columns meant to store dates. Is it wrong to have these fields default to storing the date component of such objects even if that means discarding the time component? Is it wrong to have fields meant to store integers discard other components of numeric objects?

Wow, this discussion got a lot of comments overnight.
So, you are completely right, as a matter of fact when I had to solve this I used the third option you gave, creating a custom field subclass.

About SQLAlchemy, also saying this with humor, I found it very overcomplicated to work with compared to Django, I agree that SQLAlchemy isn’t the best example of an ORM and I missed using the Django ORM when I had to use SQLAlchemy.

As you just discussed, I’d be very happy with a RuntimeWarning as well.

About that, the problem here isn’t about the conversion being made itself like in the example you gave where you explicitly cast decimal to float, for anyone doing that, truncating is expected behaviour. The problem also isn’t with the __int__, the problem is with clean() for IntegerField.

The point is that the clean() function is an optional function we call exactly to warn us when something is wrong with our data, and it seems wrong for it not to even warn about imminent data loss, not only that but why is a function with that purpose only relying on the native int() function?
Also, int() does accept floats represented as strings if they have no significant decimal values, such as 2.0.
So if it accepts string numbers like "2.0" and "2" with no problem and it also accepts floats by truncating them, why doesn’t it accept string numbers with significant decimals like "2.4" in the same way?

Also for the example you gave about DecimalField, there is an option specifically for that _check_decimal_places_and_max_digits that makes sure you didn’t input more digits than intended.

As for the Date vs DateTime problem, I’d argue that there is no use case where we would need to validate between Date and DateTime and the data loss is relevant, that is because DateTime is rarely (in my opinion it should never) input through manual user writing but through an interface we build that can limit and specify the information we want. Whereas with numbers, it’s way more common for the input to be manually written, so there is a much bigger need to validate input so we can at least warn our users about the loss of information and how the user intention won’t be represented, being
very useful to have a way to do that built-in to Django, even more useful if it can be turned on and off for those cases where it’s intended.

Wrapping up all that I said, from what I just checked on all fields Django currently has, IntegerField (and its subclasses), is the only field that allows loss of information on relevant use cases and implements no option to prevent that (like Decimal has max digits), and that’s all because of how clean() is implemented only with int().

Something else to think about, all this time we’ve discussing Models’ IntegerField, but check out how forms’ IntegerField behaves since #24229, its behaviour is exactly how I’m describing that would be desirable to have since Django implemented parsing integers with this regex function \.0*\s*$, why is that forms’ has this behaviour while models’ don’t?

About that, the problem here isn’t about the conversion being made itself like in the example you gave where you explicitly cast decimal to float, for anyone doing that, truncating is expected behaviour. The problem also isn’t with the __int__, the problem is with clean() for IntegerField.

I think you misunderstood why __int__ is being brought up here. IntegerField.clean(value) calls int(value). When value: float then it happens to call float.__int__ implicitly. You are arguing that IntegerField.clean should do something when provided a float with a non-integer part.

I’m bringing up that there are both multiple fields that implicitly drop components when provided mixed types (DecimalField.clean(value: float) will drop anything that is beyond decimal places, DateField.clean(value: datetime) will drop its time component) and multiple types, other than float, that IntegerField.clean drops component for (e.g. decimal.Decimal) so special casing IntegerField.clean(value: float) seems incoherent to me.

Also for the example you gave about DecimalField, there is an option specifically for that _check_decimal_places_and_max_digits that makes sure you didn’t input more digits than intended.

I would invite you to revisit _check_decimal_places_and_max_digits as it has nothing to do with DecimalField.clean. It’s a method that performs systems checks to ensure that max_digits >= decimal_places.

As for the Date vs DateTime problem, I’d argue that there is no use case where we would need to validate between Date and DateTime and the data loss is relevant, that is because DateTime is rarely (in my opinion it should never) input through manual user writing but through an interface we build that can limit and specify the information we want.

I’m not sure I follow your user input rationale here. What about the myriads of ways data is programmatically supplied to Django endpoints nowadays such as REST endpoints through DRF and other such framework?