Weekend project update: Open SEC Data

Here’s an early look at a project I have been working on to practice some Django and Vue.js concepts: Open SEC Data.

This project uses Django, DRF and Celery to read public SEC filings from sec.gov, build it into an API which is consumed through a Vue.js application. I’m currently focused on 13F filings which are required for large US investment funds managing over $100 million USD. There is data dating back to 1993 and it is published quarterly.

Here are some of the things I’m focusing on in this project in no particular order:

  • Getting better at Django REST Framework. This project has been helping me apply some of the parts of DRF that I have found difficult. I’m currently using ViewSets which feels function-based views inside of class-based views. They are flexible, but I would like to add more abstraction with filtering

  • Django admin. While this project primarily uses Django as a REST API with Django REST Framework, I have tried to take advantage of the Django admin to build out helpful views that can be used to spot check the data I’m creating. Most of my API is read-only, this makes things pretty simple.

  • Moderately complex paginated data tables with Vue. I work with lots of paginated table data, and I think there is a better way to do abstract some of the repeated logic that I use (getting and setting current page, rows per page). I’m using Vuex, and I have heard of module factories, but I’m thinking that there will be a better way to do this when Vue 3 officially comes to Quasar Framework (Quasar is a Vue.js framework).

  • Session authentication with DRF. There are a lot of guides showing how to use JWT and Token Authentication for DRF with Javascript frontends. The DRF recommends using Session Authentication for such use cases as a web-base Javascript client, so I hope I can promote some best practices around how to use Django’s built-in session authentication for use with the Django REST Framework using an HttpOnly session cookie. I also understand that all security decisions have trade-offs, and I’m trying to understand what trade-offs come with handling authentication in this way.

  • Social authentication. I have previously setup social authentication with Google, Facebook and GitHub using Python Social Auth. I think it is a great package, and it adds a lot of flexibility with it’s concept of pipelines, but I haven’t done much with these yet, so I’m hoping to dig in further and better understand how I can make better use of social authentication in my app. This app uses Linkedin 0Auth2 with a custom user model. Logging in with Linkedin account gives you the ability to request an API Token (Django REST Framework’s Token) to access the public API.

  • Automatic API documentation with OpenAPI. Swagger/OpenAPI seems like nice way to document and API, so I’m hoping to build best practices around how to document a DRF API automatically with OpenAPI and Swagger UI.

  • CI/CD with GitLab and docker swarm. I will admit that I am huge GitLab fan. I love how flexible their CI/CD pipelines are. Being a docker fan as well, I chose to use docker swarm for this project to keep things simple and straightforward. I think one under-appreciate feature of docker is being able to set DOCKER_HOST to an SSH connection, such as ssh://root@123.456.789.10. This let’s you control the remote docker host without needing to SSH to it first, and it is also how I’m able to deploy and run management commands “manually” through the GitLab UI.

  • Productive development environment. To start the project, you only need to run docker-compose up (after copying .env.template to .env in the root directory for storing sensitive data outside of git such as LinkedIn OAuth2 keys). The development environment is very similar to how this project runs in production with some additional utilities for monitoring and debugging such as pgadmin4, flower (for celery), redis commander (a GUI for viewing redis databases), Django debug toolbar (a must have for any Django project, I believe), runserver_plus with Werkzeug, and others. Also, the backend and frontend hot reload automatically with the help of webpack for Vue and watchdog for Django and Celery.

  • Automatic TLS certificate generation with Traefik. For a simple project in docker swarm, I’m really happy with how simple it is to request TLS certificates from Let’s Encrypt automatically with Traefik. There are no scripts, cron jobs or one-time setup jobs, it just seems to work out of the box if configured correctly.

  • Testing with pytest. I have only been trying to test most of my API views so far. I really like using factory with pytest, so I use that in most of my tests.

That’s all I have for now. I have a long list of questions, things I want to improve, add and experiment with, here are just a few that come to mind:

  • Frontend testing. I don’t have any component testing or e2d tests, so this would be good to add eventually. Since I’m using a component library and my app uses these components directly, I’m not exactly sure how much testing I should be doing.

  • Data verification/validation. There are a lot of site that do provide similar data, WhaleWisdom is the biggest one that I know of. Once I get more data built on the site it would be good to spot check some of the values. There are some nuances to the filing data that I haven’t addressed, such as Amendment filings and additions.

  • Calculating period changes. One of the features that I’m not sure how best to implement is the ability to sort holdings for a filer in a given period on the percent increase from the last period. One way would be to add these as additional fields to the Holding model and then calculate these values as I process the data in celery. If I process the data from recent periods to later periods, I will have to update these values once the previous period has been processed, so it would be an additional check to do. I’ll probably post this question here in more detail later. Here’s an example of what this means from WhaleWisdom.

  • Accessing LinkedIn profile data to populate fields on my CustomUser model.

  • Scaling? I have a lot more experience with deploying projects to AWS which is built around the ability to scale. I don’t know a project on DigitalOcean would be scaled automatically. A single node docker swarm cluster while take some time to process all of the data. I would probably be better of scaling vertically with much bigger droplets and higher celery concurrency.

  • Docker swarm secrets. I’m currently using environment variables to pass secrets stored in GitLab CI when I build images and deploy to docker swarm. I would like to learn how to properly use swarm secrets and work them into my CI/CD pipeline.

  • As I mentioned above, I’m also interested in updating this project to Vue3 and to apply some of its new features to this project.

  • Use pipenv, poetry or some other way of pinning secondary python dependencies. Does anyone have a recommendation on how best to do this with docker. I have always thought that docker is the virtual environment, but I realize that some versions of indirect dependencies may change when pip installing without using a lockfile similar to package-lock.json.

Edit: Signin with LinkedIn isn’t working, I’ll try to fix it later.

Here’s a diagram showing the different components of the application that I have so far:

Architecture overview

png

Application Entrypoints

The application can be accessed either through a web UI (A) or programmatically through a public API (B).

A. Requests coming from the web UI are authenticated using Django Session authentication (HttpOnly cookies).

B. API requests can be made with an API token (from Django Token Authentication).

Main Application Architecture

  1. The Traefik container exposes ports 443 and 80 for HTTPS and HTTP traffic. HTTP requests are redirected to HTTPS requests. Traefik uses Let’s Encrypt to issue TLS certificates that allow connections to the application to use HTTPS.

  2. All traffic from Traefik is routed to NGINX which handles traffic based on the URL path. /api/* and /admin/* requests are handled by Django/gunicorn, /flower/* requests are handled by the flower service and all other requests will be served by the Vue/Quasar application.

  3. Quasar Framework SPA/PWA makes API requests using Axios.

  4. The main Django application handles both requests for the main API and the Django admin. The main API is built with Django REST Framework.

  5. The API pulls data from a PostgreSQL database that runs in a container.

  6. The main celery worker has the same image as the Django web application. It handles asynchronous background tasks such as pulling filing data from SEC.gov and adding database records from the processed filing data.

  7. Redis serves as both as a caching layer to cache large/expensive requests and it is also the broker for celery. New tasks are stored in Redis and celery (6) continually monitors redis for new tasks that it can work on.

  8. Celery beat is in charge of scheduling tasks to be executed by celery. For example, once per day celery may query SEC.gov for a new filing.

  9. Flower is a celery monitoring utility that can be used to view the currently executing tasks as well as completed and failed tasks.

  10. REX-Ray is a docker volume plugin that automatically provisions DigitalOcean block storage devices to be used in our application. There are four volumes provisioned in this application to persist data in various places: Let’s Encrypt Certificates, Redis data, Django static/media files and Postgres data. The Django static/media volume is shared between the NGINX container, the Django container and the celery container.

  11. The main architecture runs on a single DigitalOcean droplet that uses a Docker machine image that runs in swarm mode.

  12. A DigitalOcean account is required for this architecture. The minimum cost of this architecture is between $5 and $6/month.

  13. The sample application that uses this architecture (linked below) access public data from SEC.gov.

CI/CD Pipeline with GitLab

X. A developer creates and pushes a git tag.

Y. Pushing this tag triggers a GitLab CI pipeline (defined in .gitlab-ci.yml) which builds the two main images used in our application (images for the frontend and backend) and then pushes these images to our private GitLab registry that is comes with all GitLab projects.

Z. The deployment command docker stack deploy references the stack.yml file which defines all of the services, networks and volumes used in the main architecture.

Great diagram, @briancaffey!

Can you explain why you have Traefik in front of Nginx? To me, that seems like two load balancers in sequence. Is it so you don’t have to do TLS termination at Nginx? Was it easier to set up Let’s Encrypt with Traefik than Nginx?

Thanks @mblayman! This is a good question, it probably is doing double load balancing as you have mentioned. I haven’t set up TLS termination with NGINX before so I can’t say if it is easier, but I have been able to consistently make things work with Traefik’s Let’s Encrypt integration for requesting and renewing certificates. I posted on the Traefik subreddit with your question about double load balancing and some other questions I had about it: https://www.reddit.com/r/Traefik/comments/k71mpv/questions_about_my_traefik_use_case/.

Here’s what I mentioned in this post:

The image here shows how I’m using Traefik in a Django application running in a docker swarm cluster in DigitalOcean. I’m wondering if this use case makes sense, or if things should be reorganized.

Traefik is the main entrypoint to my application and listens on ports 80 and 443, redirecting HTTP traffic to HTTPS and handling TLS. All Traefik traffic is forwarded to the NGINX container which does path-based routing to either:

  • my Django API or the Django admin (/api/* or /admin/*)
  • serve static content for a Vue.js Single Page Application that is built into the NGINX image I use in production
  • direct traffic to other monitoring services for my application (flower for monitoring celery tasks)

I understand that Traefik does not serve static content like NGINX does, so I think I will need NGINX or at least a simple file server to do this in production.

Is using Traefik and NGINX in this way doing “double load balancing” or making any additional “hops” within my docker networks for traffic going to the API that I should eliminate?

Could I label the containers running the Django application (API) so that they can send traffic to those containers directly without going through NGINX? I think this might be possible, but it might require more “labeling” for each of the services I’m routing to from Traefik.

I find that using Traefik to manage certificates with Let’s Encrypt is very convenient. If I were to handle TLS certificates in NGINX, I probably wouldn’t need Traefik at all, I would simply expose ports 80 and 443 on the NGINX container and it do TLS termination as well as the reverse proxy and static file serving that it currently does in my application.

I think one other possible advantage of using Traefik is that I could easily setup different stacks in the same cluster and have Traefik handle the routing for those as well. I’m not using Traefik in this way; it is currently defined in the same stack file as my application. I can see that it might make sense to split out the Traefik service to another stack, but I don’t have plans for anything beyond running a simple application stack on a single node swarm cluster, so I think I’m fine to keep everything in the same file.

My last question is about networking. Traefik and NGINX run on the same traefik-public docker network, and NGINX and the rest of my application run on a main docker network. Is this type of network isolation important in this context? Would there be any vulnerability in the Django application container sharing a network with the traefik-public network?

I’ll share any big takeaways from what the Traefik experts say.