2024-08-31 22:07:34
www.davidhang.com
The material this blog post was originally developed from was a bunch of slides
used for a skill share presentation I gave at my workplace @ coreplan.io.
I have 3+ years of experience with Django, with it being the main framework that
underpins the backend of CorePlan’s main SaaS product. It is a mature, batteries
included framework that has been around for a while now. One particular powerful
yet dangerous feature of Django is the ORM. This is a Django specific ORM which
cannot be separated from the rest of the framework. The other major python ORM
is SQLAlchemy which can be used with other python web frameworks, but is an
independent tool.
Below are some of the things that I have learned about the Django ORM, how it
compares to raw SQL and gotchas that you should be aware of when using it.
What is an ORM (Object Relational Mapper)?
- Abstraction over SQL to interact with databases
Code -> SQL
Hole.objects.all()
⬇️
SELECT * FROM drilling_hole;
Why use an ORM? – Pros
- Abstraction over SQL, no need to write raw SQL (plus and minus)
- Portability – Can change out database engines easily !?
- Probably not true, often will rely on db specific features e.g. postgres jsonb, triggers, etc
- Direct mapping from db to models
- Automatic schema generation
- Migrations are automatically generated
- Security
- abstracts away enough that sql injection is less likely
Why use an ORM? – Cons
- Abstraction over SQL…
- Hides the underlying SQL
- Can be difficult to debug
- Lazy loading can cause N+1 queries without the developer realising
- Harder to onboard new developers if they haven’t used Django before
- Performance
- Generated sql be slower than crafted SQL
Fundamentals
# drilling/models.py
from django.db import models
class Hole(models.Model):
name = models.TextField()
CREATE TABLE drilling_hole (
id SERIAL PRIMARY KEY,
name VARCHAR(100)
);
Migrations
python manage.py makemigrations # generate migration files
python manage.py migrate # apply migrations
python manage.py drilling --empty # generate empty file for data migration
https://docs.djangoproject.com/en/dev/topics/migrations/
Querying
ActiveRecord
pattern – ala Ruby on Rails style- QuerySets (
Hole.objects.all()
)- lazy
- chainable
- cached when iterated over multiple times !?
- I would not recommend relying on this because it hard to comprehend when it is cached and when it is not when you are reading code
holes_qs = Hole.objects.filter(name="cheese") # not evaluated yet
holes_qs = holes_qs.filter(depth__gt=100) # still not evaluated
list(holes_qs) # evaluated
list(holes_qs) # cached
holes_qs[2] # not cached
holes_qs.first() # not cached
holes_qs.get(id=1) # not cached
WHERE
WHERE
clause ≈filter()
holes_qs = Hole.objects.filter(name="cheese")
⬇️
SELECT *
FROM drilling_hole;
WHERE drilling_hole.name = 'cheese';
WHERE across tables?
- But how do you do a left/inner join? With the ORM it isn’t done declaratively, but implicitly
class Hole(models.Model):
name = models.TextField()
pad = models.ForeignKey(Pad, on_delete=models.CASCADE)
class Pad(models.Model):
name = models.TextField()
holes_qs = Hole.objects.filter(pad__name="cheese board")
⬇️
SELECT *
FROM drilling_hole;
INNER JOIN drilling_pad ON drilling_hole.pad_id = drilling_pad.id
WHERE drilling_pad.name = 'cheese board';
WHERE other conditionals
filter(name="cheese")
->filter(name__exact="cheese")
->WHERE name="cheese"
filter(name__iexact="cheese")
->WHERE name ILIKE 'cheese'
filter(name__contains="cheese")
->WHERE name LIKE '%cheese%'
filter(name__icontains="cheese")
->WHERE name ILIKE '%cheese%'
filter(name__in=["cheese", "board"])
->WHERE name IN ('cheese', 'board')
filter(name__gt=100)
->WHERE name > 100
etcfilter(name__isnull=True)
->WHERE name IS NULL
AS
holes_qs = Hole.objects.annotate(this_thang=F("pad__name"))
hole = holes_qs.first()
print(hole.this_thang)
⬇️
SELECT
* ,
drilling_pad.name AS this_thang
FROM drilling_hole;
INNER JOIN "drilling_pad" ON ("drilling_hole"."pad_id" = "drilling_pad"."id")
Subqueries
class Project(models.Model):
name = models.TextField()
class Pad(models.Model):
name = models.TextField()
class Hole(models.Model):
name = models.TextField()
pad = models.ForeignKey(Pad, on_delete=models.CASCADE)
project = models.ForeignKey(Project, on_delete=models.CASCADE)
# find pads that are on project_id=1
hole_subquery = Hole.objects.filter(project_id=1).values("pk")
pad_qs = Pad.objects.filter(hole__in=Subquery(hole_subquery))
⬇️
SELECT "drilling_pad"."id",
"drilling_pad"."name"
FROM "drilling_pad"
INNER JOIN "drilling_hole" ON ("drilling_pad"."id" = "drilling_hole"."pad_id")
WHERE "drilling_hole"."id" IN (
SELECT U0."id"
FROM "drilling_hole" U0
WHERE U0."project_id" = 1
)
Correlated subqueries are where the inner query depends on outer query
class Pad(models.Model):
name = models.TextField()
class Hole(models.Model):
name = models.TextField()
pad = models.ForeignKey(Pad, on_delete=models.CASCADE)
# include the hole id of any hole that has a foreign key to the pad
hole_subquery = Hole.objects.filter(pad_id=OuterRef("pk")).values("pk")
pad_qs = Pad.objects.annotate(hole_id=Subquery(hole_subquery))
⬇️
SELECT "drilling_pad"."id",
"drilling_pad"."name",
(
SELECT U0."id"
FROM "drilling_hole" U0
WHERE U0."pad_id" = ("drilling_pad"."id")
) AS "hole_id"
FROM "drilling_pad"
Performance improvements
- Reduce N+1
- You typically want to reduce N+1 queries because they have communication
overhead select_related
prefetch_related
- You also might choose to use
annotate()
instead ofselect_related
because select related pulls all the data for the associated table when you
might only need one column. That associated might have a jsonb column which
contains a lot of unnecessary data that you don’t need.
- You typically want to reduce N+1 queries because they have communication
holes = Hole.objects.all()
for hole in holes:
print(hole.pad.name) # N+1 queries
holes = Hole.objects.select_related("pad")
for hole in holes:
print(hole.pad.name) # no extra query
You would use prefetch related when you are not pulling a direct foreign key
such a many-to-many relationship like below.
class Faculty(models.Model):
name = models.TextField()
class Course(models.Model):
name = models.TextField()
faculty = models.ForeignKey(Faculty, on_delete=models.CASCADE)
class Student(models.Model):
name = models.TextField()
courses = models.ManyToManyField(Course, through="Enrolment")
class Enrolment(models.Model):
course = models.ForeignKey(Course, on_delete=models.CASCADE)
student = models.ForeignKey(Student, on_delete=models.CASCADE)
grade = models.FloatField()
students = Student.objects.prefetch_related("courses")
for student in students:
for course in student.courses.all():
print(course.name) # no extra query
print(course.faculty.name) # extra query
students = Student.objects.prefetch_related(
Prefetch(
"courses",
queryset=Course.objects.select_related("faculty")
)
)
for student in students:
for course in student.courses.all():
print(course.name) # no extra query
print(course.faculty.name) # no extra query
to_attr
to_attr
can be used to make “filtered” relationships available on the instance.
class Enrolment(models.Model):
course = models.ForeignKey(Course, on_delete=models.CASCADE)
student = models.ForeignKey(Student, on_delete=models.CASCADE)
grade = models.FloatField()
students = Student.objects.prefetch_related(
Prefetch(
"course",
queryset=Course.objects.filter(grade__gt=80.0).select_related("faculty"), to_attr="hd_courses"
)
)
for student in students:
for course in student.hd_courses.all():
...
Multiple instances when filtering across many-to-many
One gotcha is selecting across a many-to-many relationship can return multiple of the same instances.
Student.objects.filter(courses__faculty__name="Science") # inner join returns duplicated rows
QuerySet [Student: Student object (1)>, Student: Student object (1)>]>
Student.objects.filter(courses__faculty__name="Science").distinct()
QuerySet [Student: Student object (1)>]>
SELECT
"testing_student"."id",
"testing_student"."name"
FROM
"testing_student"
INNER JOIN
"testing_enrolment"
ON
("testing_student"."id" = "testing_enrolment"."student_id")
INNER JOIN
"testing_course"
ON
("testing_enrolment"."course_id" = "testing_course"."id")
INNER JOIN
"testing_faculty"
ON
("testing_course"."faculty_id" = "testing_faculty"."id")
WHERE
"testing_faculty"."name" = 'Science'
Gotchas and other Funky stuff
- Model instances when retrieved will try to populate all columns, if column
removed in migration, and the worker still up exception occursget()
orfirst()
for hole in Hole.objects.all()
- This can make migrations hard, as older workers will be requesting columns that might have been removed or renamed which will cause errors
- There are ways to do down-timeless migrations but are bit funky and multi
step - Recommendation is to avoid deleting or renaming columns
class Hole(models.Model):
name = models.TextField()
pad = models.ForeignKey(Pad, on_delete=models.CASCADE)
class Pad(models.Model):
name = models.TextField()
holes_qs = Hole.objects.annotate(this_thang=F("pad__name")).get()
⬇️
SELECT
drilling_hole.name, -- pulls all columns
drilling_hole.pad_id,
drilling_pad.name AS this_thang
FROM drilling_hole;
WHERE drilling_pad.name = 'cheese board';
LIMIT 1;
Values
- So how do you to only retrieve certain columns?
class Hole(models.Model):
name = models.TextField()
pad = models.ForeignKey(Pad, on_delete=models.CASCADE)
holes_qs = Hole.objects.values("name")
for hole in holes_qs:
print(type(hole)) # dict
# not `Hole` object, hence no class functions, no lazy loading e.g. can't access `hole.pad.name`
⬇️
SELECT
drilling_hole.name, -- only pulls name and maps it to a python dictionary object
FROM drilling_hole;
- Less data sent down the wire, but no lazy loading and no class functions as
the data is a python dictionary
Other options
only()
anddefer()
- Will retrieve model instances, but won’t retrieve all fields
- Values not declared when accessed on the model are lazy loaded
- Would not recommend to be used regularly, very high chance of N+1
holes_qs = Hole.objects.only("pad_id")
for hole in holes_qs:
print(hole.pad_id) # no extra query
print(hole.name) # name will be lazy loaded, N+1 queries
How do you know what SQL is being generated?
Updating rows
There are three typical ways to update a row in the database.
class Hole(models.Model):
name = models.TextField()
instance = Hole.objects.create(name="cheese")
# save()
instance.name = "board"
instance.save()
# update()
Model.objects.filter(name="board").update(name="board2")
# bulk_update()
instance.name = "board3"
instances_to_update = [instance]
Model.objects.bulk_update(instances_to_update, ["name"])
Problems with updates
update()
andbulk_update()
do not triggersave()
method on the model- built in django signals (publish/subscribe pattern), there are post_save and pre_save signals which can be triggered when calling
save()
update()
andbulk_update()
do not trigger those signals…
updated_at
column would not normally be updated when callingupdate()
or
bulk_update()
but if queryset is a descendant ofCoreplanQuerySet
then it will.
- Pagination / order_by
- Not a Django ORM thing, but a Django ORM hides the implementation detail,
which may lead to unexpected result - Page pagination is default in DRF list views and implemented with
LIMIT
andOFFSET
in SQL
- Not a Django ORM thing, but a Django ORM hides the implementation detail,
?page_size=10&page=3
SELECT *
FROM drilling_hole
LIMIT 10
OFFSET 20;
Anything wrong with this query?
- There is no deterministic guarantee that the same 10 rows will be returned each time.
- A plain
SELECT
in postgres (may be different in different dbs) provides no
guarantee of order, unlessORDER BY
is specified - It often appears to return in insertion/
id
order, but that is not guaranteed
in postgres - Model Meta
ordering
may set a default order, but sometimes tht is ignored - For list views you should to provide a deterministic order_by
order_by(name)
is not enough if name is not uniqueorder_by(name, id)
is required, because id is unique
- This can been the the cause of some flaky tests issues where lists are
returned seemingly in insertion order and asserted to return in id order
Thanks for reading! I hope this has been useful to you. There are definitely
more particularities and gotchas to be aware of when using the Django ORM and
Django in general but I think these are the most common ones.
Support Techcratic
If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support innovation! Thank you.
Bitcoin Address:
bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge
Please verify this address before sending funds.
Bitcoin QR Code
Simply scan the QR code below to support Techcratic.
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.