Discussion on Configurable Content Type Parsing Project

Hi everyone, I want to contribute to this project as part of GSOC 2023, I’ve created this discussion to get some more details about the project and get going in the right direction!

  1. I went through the last commit done by @carltongibson here: Refs #21442 – Added content-type aware request.data property. · django/django@d1ba27d (github.com)
    Is the request.POST property which was widely used earlier for data extraction from a request object, completely removed?

  2. Is request.POST replaced by request.data and will it serve all the purpose?

  3. What exactly are the acceptable ‘content-types’ for the middleware parsers that can return a ‘True’ boolean value?

Hi @udbhavsomani — thanks for looking at this.

Let me respond to your points:

On 1. and 2. request.POST will become an alias for the new request.data, so that existing code will continue to work. It will be scoped for only POST requests, and only form data encoded parsing (maybe) — so that its behaviour is unchanged.

On 3. The idea is any content type a user wants to provide a parser for. Django will provide form data and JSON parsing, but others that might be used could be MessagePack, Protobuf, XML, yaml, … — it goes on. Also there are specialisations, like application/vnd.github+json and so on. All Django needs to provide is the API for a parser — likely a can_accept() and parse() :thinking:. The idea would be to have a list and use the first parser that said yes to can_accept — allowing a specialist parser a first shot, before falling back to a more generic parser (for the +json type examples).

I want to pick up my WIP PR shortly, and aim for it to be complete before the start of the GSoC season — help there could include test cases for the (single and multipart) JSON parsing, which is next on my list. A GSoC project would be based around generalising the Parser bit, which I’m not aiming for in the first phase.

How’s that?

Kind Regards,

Carlton

Hi @carltongibson - Thank you for such a fast and informative response.
I have the gist on what we are trying to achieve here.

I see that you have already made a few tests here (Refs #21442 -- Added content-type aware request.data property. · django/django@d1ba27d · GitHub)

If it is okay by you, can I try writing these tests which will also help me to get a better understanding on this?

1 Like

Super! If you’d like to join in, I’m happy to have the help. :blush:

First of is straight application/json parsing. (So a request.data gives parsed Python object from a JSON body.)

Then it’s multipart examples with JSON parts.

With those bootstrapped we can look at more examples. If you wanted to make a PR against my PR branch (if that makes sense :exploding_head:) that would be awesome. We can discuss details there.

1 Like

Hey @carltongibson!
I set up the repo locally on my machine and ran the tests.py. I tried writing a test function for JSON acceptance, but it seems like request.data returns an empty querydict when content-type is changed to application/json. The data is there in the request.body from where I can load and match it.
Here is the test function that I tried.

    def test_json_data(self):
        """
        JSON from various methods.
        """
        for method in ["GET", "POST", "PUT", "DELETE"]:
            with self.subTest(method=method):
                payload = FakePayload("{\"key\": \"value\"}")
                request = WSGIRequest(
                    {
                        "REQUEST_METHOD": method,
                        "CONTENT_LENGTH": len(payload),
                        "CONTENT_TYPE": "application/json",
                        "wsgi.input": payload,
                    }
                )
                print(request.data) # <QueryDict: {}>
                self.assertEqual(json.loads(request.body), {"key": "value"}) # OK

Any help on what is happening and what am I missing out on?

Yes, that looks right, and you’ve discovered the current behaviour.

The task then is to make that pass.

If you look in django/http/request.py for the HttpRequest._load_post_and_files() method, you’ll see a branches where different content types are handled. Current behaviour there is to drop out the bottom, in the else branch. This is the bit that we’ll eventually make pluggable, but, for now we can add a branch for application/json.

Hopefully that makes sense?

Oh, I see the code there.
I’ll try adding an elif block for handling the application/json content_type to populate the request.data object.

I was actually able to parse JSON and pass the test case successfully.
I went ahead and created a PR to your WIP branch (:see_no_evil:).
Request you to kindly review it here: [4.2] Added JSON parsing for request.data property by udbhavsomani · Pull Request #16570 · django/django (github.com)

1 Like

Good work! I’ll review properly in the week.

Next step would be the same for multipart requests.

1 Like

Thank you!
Sure! I’ll try to make progress on the multipart requests containing JSON.

I had a few blockers, your help here would be highly appreciated.

  1. I tried this:
def test_POST_multipart_form_json_data(self):
        payload = FakePayload(
            "\r\n".join(
                [
                    "--boundary",
                    'Content-Disposition: form-data; name="json_data"',
                    "Content-Type: application/json",
                    "",
                    "{\"key\": \"value\"}",
                    "--boundary--",
                ]
            )
        )
        request = WSGIRequest(
            {
                "REQUEST_METHOD": "POST",
                "CONTENT_TYPE": "multipart/form-data; boundary=boundary",
                "CONTENT_LENGTH": len(payload),
                "wsgi.input": payload,
            }
        )
        print(request.data) # <QueryDict: {'json_data': ['{"key": "value"}']}>
        self.assertEqual(request.data, {'json_data': ['{"key": "value"}']}) # OK

This test case passes with the given output. Is it correct though? As in, is this the output that should be acceptable?

  1. There is this function:
def test_body_after_POST_multipart_form_data(self):
        """
        Reading body after parsing multipart/form-data is not allowed
        """
        # Because multipart is used for large amounts of data i.e. file uploads,
        # we don't want the data held in memory twice, and we don't want to
        # silence the error by setting body = '' either.
        payload = FakePayload(
            "\r\n".join(
                [
                    "--boundary",
                    'Content-Disposition: form-data; name="name"',
                    "",
                    "value",
                    "--boundary--",
                ]
            )
        )
        request = WSGIRequest(
            {
                "REQUEST_METHOD": "POST",
                "CONTENT_TYPE": "multipart/form-data; boundary=boundary",
                "CONTENT_LENGTH": len(payload),
                "wsgi.input": payload,
            }
        )
        self.assertEqual(request.POST, {"name": ["value"]})
        with self.assertRaises(RawPostDataException):
            request.body

Please help me understand what this test really does?

It doesn’t look like the JSON is parsed here. You have a string no? '{"key": "value"}'.

I need to have a think about MultiDict’s get() vs get_list() but — I’d expect a (list of) dict(s) in the json_data key…

assert request.data.get("json_data") == {"key": "value"}

See where it’s raised in body:

    # in django/http/request.py
    def body(self):
        if not hasattr(self, "_body"):
            if self._read_started:
                raise RawPostDataException(
                    "You cannot access body after reading from request's data stream"
                )
         ...

Since multipart parser reads from request._stream (via read()) — this enforces (basically) that you either use the parsed POST method or you handle this yourself via body but not really both.

Oh, right, I’ll have to parse and return a json value.

Got it. Will do the required changes.

Ahan, I see the handling of redundant parsing here. Thank you!

Hi @carltongibson
I was working on the multipart/form-data with json parts and I am kind of stuck at the multiparser bit.
I was looking at the code in django/http/multipartparser.py for the parse function.
Under this loop (which, imo, runs for every part of the multipart request):
for item_type, meta_data, field_stream in Parser(stream, self._boundary):

I tried to add this code for json parsing, but got stuck on how to access the data from the stream:

try:
    sub_content_type = meta_data["content-type"][0]
    sub_content_type = sub_content_type.strip()
    if sub_content_type == 'application/json':
        pass # access data by using the exhaust() method maybe?
    except (KeyError, IndexError, AttributeError):
        continue

Is this the correct approach?

And for the gsoc configurable type parsing project, we would require a similar, new class with multiple parsers and use this multipartparser as the default one?

Yes, something along these lines. Once we know the part is JSON we need to get all of its data in order to parse.

For single part parsing we just use the list of parsers. Then there’s a multipart parser that uses the same list for each part (or so I imagine… but it’s not settled yet: part of the project is determining what’s needed.)

1 Like

Hi @carltongibson!
I have made changes to my PR here: Refs #21442 – Added JSON parsing for request.data property. by udbhavsomani · Pull Request #16570 · django/django (github.com)

It would be really kind of you to please review it. I have made the test cases for multipart+JSON type request objects as well as modified the existing parser bit which now returns (list of) dict(s) for the given JSON key.

Thank you :slight_smile:

OK, great. I shall take another look next week now. Good effort!

The other part of this project is bringing in the effort to modernise the request object. (We want to do both in the scope of the same release, 5.0) That should be a case of updating @adamchainz’ original PR. If you wanted to browse that and have a think about it, that would be worth your time.

Check-out the django-upgrade project. Creating updaters to help people migrate their code would be a nice touch, and interesting (from the Python ast point of view).

1 Like

Yes, will go through both the things next week. Thank you for these references!

Hi @carltongibson!
I went through a bunch of threads (github + mailing list + tracker). I am in the process of drafting my GSoC proposal for the same.
Just wanted your inputs for the same:

  • Adding the request.query_params property and refining the request object further.
  • Create a new parser class that will check for the content-type of the request object and try to return parsed python object(s).
    • This can be activated by adding it as a middleware? can be there by default as well.
  • A fallback to the default parser if this middleware class does not return any compatible parser.
  • Adding this migration to the django-upgrade project

These are the things I’m currently looking to get done during the project term. Please add on to anything that I might have missed.