Taming the beast that is the Django ORM

2024-08-31 22:07:34
www.davidhang.com

The material this blog post was originally developed from was a bunch of slides
used for a skill share presentation I gave at my workplace @ coreplan.io.

I have 3+ years of experience with Django, with it being the main framework that
underpins the backend of CorePlan’s main SaaS product. It is a mature, batteries
included framework that has been around for a while now. One particular powerful
yet dangerous feature of Django is the ORM. This is a Django specific ORM which
cannot be separated from the rest of the framework. The other major python ORM
is SQLAlchemy which can be used with other python web frameworks, but is an
independent tool.

Below are some of the things that I have learned about the Django ORM, how it
compares to raw SQL and gotchas that you should be aware of when using it.

What is an ORM (Object Relational Mapper)?

Abstraction over SQL to interact with databases

Code -> SQL

Hole.objects.all()

⬇️

SELECT * FROM drilling_hole;

Why use an ORM? – Pros

Abstraction over SQL, no need to write raw SQL (plus and minus)
Portability – Can change out database engines easily !?
- Probably not true, often will rely on db specific features e.g. postgres jsonb, triggers, etc
Direct mapping from db to models
Automatic schema generation
- Migrations are automatically generated
Security
- abstracts away enough that sql injection is less likely

Why use an ORM? – Cons

Abstraction over SQL…
- Hides the underlying SQL
- Can be difficult to debug
- Lazy loading can cause N+1 queries without the developer realising
- Harder to onboard new developers if they haven’t used Django before
Performance
- Generated sql be slower than crafted SQL

Fundamentals

# drilling/models.py

from django.db import models
class Hole(models.Model):
    name = models.TextField()

CREATE TABLE drilling_hole (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

Migrations

python manage.py makemigrations # generate migration files
python manage.py migrate # apply migrations

python manage.py drilling --empty # generate empty file for data migration

https://docs.djangoproject.com/en/dev/topics/migrations/

Querying

ActiveRecord pattern – ala Ruby on Rails style
QuerySets (Hole.objects.all())
- lazy
- chainable
- cached when iterated over multiple times !?
  - I would not recommend relying on this because it hard to comprehend when it is cached and when it is not when you are reading code

holes_qs = Hole.objects.filter(name="cheese") # not evaluated yet
holes_qs = holes_qs.filter(depth__gt=100) # still not evaluated

list(holes_qs) # evaluated
list(holes_qs) # cached

holes_qs[2] # not cached
holes_qs.first() # not cached
holes_qs.get(id=1) # not cached

WHERE

WHERE clause ≈ filter()

holes_qs = Hole.objects.filter(name="cheese")

⬇️

SELECT * 
FROM drilling_hole;
WHERE drilling_hole.name = 'cheese';

WHERE across tables?

But how do you do a left/inner join? With the ORM it isn’t done declaratively, but implicitly

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

class Pad(models.Model):
  name = models.TextField()

holes_qs = Hole.objects.filter(pad__name="cheese board")

⬇️

SELECT * 
FROM drilling_hole;
INNER JOIN drilling_pad ON drilling_hole.pad_id = drilling_pad.id
WHERE drilling_pad.name = 'cheese board';

WHERE other conditionals

filter(name="cheese") -> filter(name__exact="cheese") -> WHERE name="cheese"
filter(name__iexact="cheese") -> WHERE name ILIKE 'cheese'
filter(name__contains="cheese") -> WHERE name LIKE '%cheese%'
filter(name__icontains="cheese") -> WHERE name ILIKE '%cheese%'
filter(name__in=["cheese", "board"]) -> WHERE name IN ('cheese', 'board')
filter(name__gt=100) -> WHERE name > 100 etc
filter(name__isnull=True) -> WHERE name IS NULL

AS

holes_qs = Hole.objects.annotate(this_thang=F("pad__name"))
hole = holes_qs.first()
print(hole.this_thang)

⬇️

SELECT 
  * , 
  drilling_pad.name AS this_thang
FROM drilling_hole;
INNER JOIN "drilling_pad" ON ("drilling_hole"."pad_id" = "drilling_pad"."id")

Subqueries

class Project(models.Model):
  name = models.TextField()

class Pad(models.Model):
  name = models.TextField()

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)
  project = models.ForeignKey(Project, on_delete=models.CASCADE)

# find pads that are on project_id=1
hole_subquery = Hole.objects.filter(project_id=1).values("pk")
pad_qs = Pad.objects.filter(hole__in=Subquery(hole_subquery))

⬇️

SELECT "drilling_pad"."id",
 "drilling_pad"."name"
FROM "drilling_pad"
 INNER JOIN "drilling_hole" ON ("drilling_pad"."id" = "drilling_hole"."pad_id")
WHERE "drilling_hole"."id" IN (
  SELECT U0."id"
  FROM "drilling_hole" U0
  WHERE U0."project_id" = 1
 )

Correlated subqueries are where the inner query depends on outer query

class Pad(models.Model):
  name = models.TextField()

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

# include the hole id of any hole that has a foreign key to the pad
hole_subquery = Hole.objects.filter(pad_id=OuterRef("pk")).values("pk")
pad_qs = Pad.objects.annotate(hole_id=Subquery(hole_subquery))

⬇️

SELECT "drilling_pad"."id",
 "drilling_pad"."name",
 (
  SELECT U0."id"
  FROM "drilling_hole" U0
  WHERE U0."pad_id" = ("drilling_pad"."id")
 ) AS "hole_id"
FROM "drilling_pad"

Performance improvements

Reduce N+1
- You typically want to reduce N+1 queries because they have communication
  overhead
- select_related
- prefetch_related
- You also might choose to use annotate() instead of select_related
  because select related pulls all the data for the associated table when you
  might only need one column. That associated might have a jsonb column which
  contains a lot of unnecessary data that you don’t need.

holes = Hole.objects.all()
for hole in holes:
  print(hole.pad.name) # N+1 queries

holes = Hole.objects.select_related("pad")

for hole in holes:
  print(hole.pad.name) # no extra query

You would use prefetch related when you are not pulling a direct foreign key
such a many-to-many relationship like below.


class Faculty(models.Model):
  name = models.TextField()

class Course(models.Model):
  name = models.TextField()
  faculty = models.ForeignKey(Faculty, on_delete=models.CASCADE)

class Student(models.Model):
  name = models.TextField()
  courses = models.ManyToManyField(Course, through="Enrolment")

class Enrolment(models.Model):
  course = models.ForeignKey(Course, on_delete=models.CASCADE)
  student = models.ForeignKey(Student, on_delete=models.CASCADE)
  grade = models.FloatField()

students = Student.objects.prefetch_related("courses")

for student in students:
  for course in student.courses.all():
    print(course.name) # no extra query
    print(course.faculty.name) # extra query

students = Student.objects.prefetch_related(
  Prefetch(
    "courses", 
    queryset=Course.objects.select_related("faculty")
  )
)

for student in students:
  for course in student.courses.all():
    print(course.name) # no extra query
    print(course.faculty.name) # no extra query

to_attr

to_attr can be used to make “filtered” relationships available on the instance.

class Enrolment(models.Model):
  course = models.ForeignKey(Course, on_delete=models.CASCADE)
  student = models.ForeignKey(Student, on_delete=models.CASCADE)
  grade = models.FloatField()

students = Student.objects.prefetch_related(
  Prefetch(
    "course", 
    queryset=Course.objects.filter(grade__gt=80.0).select_related("faculty"), to_attr="hd_courses"
  )
)

for student in students:
  for course in student.hd_courses.all():
    ...

Multiple instances when filtering across many-to-many

One gotcha is selecting across a many-to-many relationship can return multiple of the same instances.

Student.objects.filter(courses__faculty__name="Science") # inner join returns duplicated rows
QuerySet [Student: Student object (1)>, Student: Student object (1)>]>

Student.objects.filter(courses__faculty__name="Science").distinct()
QuerySet [Student: Student object (1)>]>

SELECT
  "testing_student"."id", 
  "testing_student"."name" 
FROM 
  "testing_student" 
INNER JOIN 
  "testing_enrolment" 
ON 
  ("testing_student"."id" = "testing_enrolment"."student_id") 
INNER JOIN 
  "testing_course" 
ON 
  ("testing_enrolment"."course_id" = "testing_course"."id") 
INNER JOIN 
  "testing_faculty" 
ON 
  ("testing_course"."faculty_id" = "testing_faculty"."id")
WHERE 
  "testing_faculty"."name" = 'Science'

Gotchas and other Funky stuff

Model instances when retrieved will try to populate all columns, if column
removed in migration, and the worker still up exception occurs
- get() or first()
- for hole in Hole.objects.all()
This can make migrations hard, as older workers will be requesting columns that might have been removed or renamed which will cause errors
There are ways to do down-timeless migrations but are bit funky and multi
step
Recommendation is to avoid deleting or renaming columns

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

class Pad(models.Model):
  name = models.TextField()

holes_qs = Hole.objects.annotate(this_thang=F("pad__name")).get()

⬇️

SELECT 
  drilling_hole.name, -- pulls all columns 
  drilling_hole.pad_id,
  drilling_pad.name AS this_thang
FROM drilling_hole;
WHERE drilling_pad.name = 'cheese board';
LIMIT 1;

Values

So how do you to only retrieve certain columns?

class Hole(models.Model):
  name = models.TextField()
  pad = models.ForeignKey(Pad, on_delete=models.CASCADE)

holes_qs = Hole.objects.values("name")

for hole in holes_qs:
  print(type(hole)) # dict
  # not `Hole` object, hence no class functions, no lazy loading e.g. can't access `hole.pad.name`

⬇️

SELECT 
  drilling_hole.name, -- only pulls name and maps it to a python dictionary object
FROM drilling_hole;

Less data sent down the wire, but no lazy loading and no class functions as
the data is a python dictionary

Other options

only() and defer()
Will retrieve model instances, but won’t retrieve all fields
Values not declared when accessed on the model are lazy loaded
Would not recommend to be used regularly, very high chance of N+1

holes_qs = Hole.objects.only("pad_id")

for hole in holes_qs:
  print(hole.pad_id) # no extra query
  print(hole.name) # name will be lazy loaded, N+1 queries

How do you know what SQL is being generated?

Updating rows

There are three typical ways to update a row in the database.

class Hole(models.Model):
  name = models.TextField()

instance = Hole.objects.create(name="cheese")

# save()
instance.name = "board"
instance.save()

# update()
Model.objects.filter(name="board").update(name="board2")

# bulk_update()
instance.name = "board3"
instances_to_update = [instance]
Model.objects.bulk_update(instances_to_update, ["name"])

Problems with updates

update() and bulk_update() do not trigger save() method on the model
built in django signals (publish/subscribe pattern), there are post_save and pre_save signals which can be triggered when calling save()
- update() and bulk_update() do not trigger those signals…
updated_at column would not normally be updated when calling update() or
bulk_update() but if queryset is a descendant of CoreplanQuerySet then it will.

Pagination / order_by
- Not a Django ORM thing, but a Django ORM hides the implementation detail,
  which may lead to unexpected result
- Page pagination is default in DRF list views and implemented with LIMIT and OFFSET in SQL

?page_size=10&page=3

SELECT * 
FROM drilling_hole 
LIMIT 10 
OFFSET 20;

Anything wrong with this query?

There is no deterministic guarantee that the same 10 rows will be returned each time.
A plain SELECT in postgres (may be different in different dbs) provides no
guarantee of order, unless ORDER BY is specified
It often appears to return in insertion/id order, but that is not guaranteed
in postgres
Model Meta ordering may set a default order, but sometimes tht is ignored
For list views you should to provide a deterministic order_by
order_by(name) is not enough if name is not unique
- order_by(name, id) is required, because id is unique
This can been the the cause of some flaky tests issues where lists are
returned seemingly in insertion order and asserted to return in id order

Thanks for reading! I hope this has been useful to you. There are definitely
more particularities and gotchas to be aware of when using the Django ORM and
Django in general but I think these are the most common ones.

Source Link

Support Techcratic

If you find value in Techcratic’s insights and articles, consider supporting us with Bitcoin. Your support helps me, as a solo operator, continue delivering high-quality content while managing all the technical aspects, from server maintenance to blog writing, future updates, and improvements. Support innovation! Thank you.

Bitcoin Address:

bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge

Please verify this address before sending funds.

Bitcoin QR Code

Simply scan the QR code below to support Techcratic.

Please read the Privacy and Security Disclaimer on how Techcratic handles your support.

Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.

Tags: HACKER NEWS

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.