Hi guys, I am having an internal debate on what server to deploy my django app. Originally, I was going for Heroku but the app requires tenant to upload file with somewhat significant size. In that case, Heroku does not seem like the first choice. Would someone has a recommendation on what to use for this specific case?
Thanks!
While I donât have a specific recommendation regarding a server, I will point out that where Django stores uploaded files is a separate setting that you can customize. Itâs probably not trivial to store files on a separate server, but it appears as if it could be done.
See Upload Handlers and Managing Files for some ideas to get started.
Ken
If I were in your shoes, I would probably consider AWS S3 for the uploaded files coupled with django-storages. You could create an S3 buckets to store the uploads and django-storages would handle a lot transparently. In this mode, youâd get the deployment ease of Heroku with some capabilities from AWS.
Note that this is just a suggestion and I donât know what your budget permits so I would recommend running calculations on storage and transfer costs between Heroku and AWS before going down this road.
thank you for your response, I should have precised, the upload automatically replaces the existing data, therefore the schemas can never get too big. The way I go the app to work is a little bit contradictory to some common practices I have seen, meaning that the data gets upload, but before it is populating the schemas, it goes through an ETL that changes the data considerably. In this regard, I know that Heroku does not support large uploads, therefore I was wondering if you knew of any specific server that would fit this model. So far, I am leaning toward AWS ec2, but it would be nice to have confirmation you know! anyways thank you for your help!
Hey thank you for your response Matt, I appreciate your youtube videos, they have been super instructive for my needs, it feels like I am writing to my professor! As I mentioned in my other response, my apologies for not giving enough details about my situation to be hoping for a specific answer! I thought about S3 buckets but I needed a huge python ETL to transform my data and since I am not super experienced, it was simpler to handle the transformation before the data populates the database. (I know this not common practice, and probably make cringe some people) but I got to work well and I am now concerned about what server would be working well with this kind of model. In other words, a user uploads fairly big csv file that gets transformed and then replaced the existing data in that schema. In that regards, would you recommend any server? Thanks again!
Part of the question is, what do you consider huge? Just what size are these upload files youâre trying to process?
The reason Iâm asking is that if your files are too big, you may have other issues associated with the upload to deal with. (e.g. file upload size limits and timeout issues.)
Four files total, 2 of them are unsignificants, 1 will be between 500 and 20000 rows and the last one I imagine can go up 1 million rows.
Ok, when youâre considering things like this, thatâs not âhugeâ. I might consider the last one âlargeâ, depending upon the size of each row.
Generally speaking, if I can comfortably fit all the data in memory, I donât sweat too much about it. If each row is 100 bytes and you have 1 million rows, thatâs 100 MB for that file. Yes, you might have to make some settings changes in your web server to allow someone to upload a file of that size, and there might even need to be a Django setting to adjust as well. But depending upon a couple different factors, Iâd be tempted not to write that out at all. Accept the data as being uploaded and then process it in memory before writing it out to your database. (Or, if youâre concerned about the process being interrupted and needing to restart, Iâd also think about writing it out with each line of the file being a row in a table for later processing - a lot of it depends upon exactly what you need to do with this data and just how intricate the transformation are.)
Sorry I am a noob each file has between 3 to 7 columns. So you think its no problem? once the files are done with the ETL and stored in the schemas of the client, there is pretty mucho 0 calculation happening, the goal being to offer automated data visualizations from the raw data inputed. So you think I am good with AWS ec2?
Certainly as a starting point. Some of these types of decisions may come down to being budget, billing, and funding decisions as well. (And there are what, about a dozen different ec2 options for servers configurations? You will want to make the right choice there.)
Most any server environment Iâm aware of is going to be able to handle this from a capacity perspective. So I guess my point is that that shouldnât be driving your decision. Once youâve done your development and testing, youâll have a better idea of exactly how well this will run in your testing environment, and then you can make a final decision for deployment.
(But I would encourage you to have as nearly a full-scale test run before making that decision. Thereâs always the possibility that you may encounter an issue that can be better addressed before deployment rather than after.)
Thank you for your help, that goes a long way! I donât want to abuse your time, but I have one questionâŚI have done a lot of testing and the app functions well on local host with various upload size, multiple schemas, all the login stuff and all. What do you mean by âI would encourage you to have as nearly a full-scale test runâ?
You mentioned that you have one file that may have 1 million rows. Have you tested your app with a 1-million row upload? Thatâs what I mean by full scale. Sometimes, things might work well when youâre talking in the small to medium size range that end up not working well at full scale. So if you havenât tested with the largest file that you expect to receive, Iâd suggest you do so, to have a feel for how long itâs going to run and what that might end up doing to the rest of the environment while itâs running.
For example, since my development environment is generally larger than most of the servers I deploy to, I create local virtual machines using either VirtualBox or VMWare that are typically of the scale of those servers. (If youâre working with docker images, I know you can do it there as well - I just havenât used it that way.) Iâll then run my full application within that environment to try to get a feel as to how well itâs going to perform in the real environment. No, itâs not perfect, and there are sometimes some real surprises, but it does tend to work out well enough for a first shot.
We also try to have at least one test environment that is a true replica of our production environment that we call our âstagingâ server; Thatâs our final test platform before going live.
In the case of an AWS ec2 server, itâs easy enough to spin one up of the size youâre expecting to use just to see how itâs going to perform.
Thank you so much! I have ran tests with up to 5 million rows uploads, and it still work super well, just take about 1.5 minute to load up. Iâll set the app up in a virtual machine as a final test. Thank you for all your help!
Something you might have to watch out for is how long your platform of choice will go before a request will time out. It sounds like youâre planning to do all the processing during the view request. In the worst case scenario that you presented, thatâs 90 seconds. I think that Heroku times out at 30 seconds.
If you want to run in this way and donât want to use a more complicated background worker setup (and I wouldnât blame since that brings in a lot of extra complexity), then that might dictate your choice of operating environment. You would probably require a destination where you can control the timeouts of your web server and that starts to push you towards virtual machines like EC2 where you have full control.
Considering virtual machines, Iâve had good luck with Digital Ocean. In my experience, a DO droplet is quick to set up. The downside of this approach is that youâre suddenly maintaining all your own infrastructure. Thatâs quite an extra chunk of work to handle.
Thank you very much for your advices, I think that EC2 seems to be a good option.