Accelerate Python with Taichi#
Python has become the most popular language in many rapidly evolving sectors, such as deep learning and data sciences. Yet its easy readability comes at the cost of performance. Of course, we all complain about program performance from time to time, and Python should certainly not take all the blame. Still, it’s fair to say that Python’s nature as an interpreted language does not help, especially in computation-intensive scenarios (e.g., when there are multiple nested for loops).
This notebook is modified from the blog written by Yuanming Hu, who is the creator of Taichi. One of the most notable advantages Taichi
delivers is speeding up Python code.
To install Taichi
, activate you environment and type the following command:
pip install taichi
To use Taichi
, import the package with the following command:
import taichi as ti
Count the number of primes#
Large-scale or nested for loops in Python always leads to poor runtime performance. The following demo counts the primes within a specified range and involves nested for loops. Simply by importing Taichi or switching to Taichi’s GPU backends, you will see a significant boost to the overall performance.
"""Count the prime numbers in the range [1, n]
"""
import time
# Checks if a positive integer is a prime number
def is_prime(n: int) -> bool:
result = True
# Traverses the range between 2 and sqrt(n)
# - Returns False if n can be divided by one of them;
# - otherwise, returns True
for k in range(2, int(n ** 0.5) + 1):
if n % k == 0:
result = False
break
return result
# Traverses the range between 2 and n
# Counts the primes according to the return of is_prime()
def count_primes(n: int) -> int:
count = 0
for k in range(2, n):
if is_prime(k):
count += 1
return count
t_start = time.perf_counter()
print(count_primes(1000000))
t_end = time.perf_counter()
print(f"Execution time: {t_end - t_start:.4f} seconds")
78498
Execution time: 2.1988 seconds
Now, let’s change the code a bit: import Taichi
to your Python code and initialize it using the CPU backend.
import taichi as ti
ti.init(arch=ti.cpu)
[Taichi] version 1.7.3, llvm 15.0.1, commit 5ec301be, win, python 3.12.9
[Taichi] Starting on arch=x64
Decorate is_prime()
with @ti.func
and count_primes()
with @ti.kernel
.
Note
Taichi’s compiler compiles the Python code decorated with @ti.kernel
and @ti.func
onto different devices, such as CPU and GPU, for high-performance computation.
@ti.func
def is_prime(n: int):
result = True
for k in range(2, int(n ** 0.5) + 1):
if n % k == 0:
result = False
break
return result
@ti.kernel
def count_primes(n: int) -> int:
count = 0
for k in range(2, n):
if is_prime(k):
count += 1
return count
t_start = time.perf_counter()
print(count_primes(1000000))
t_end = time.perf_counter()
print(f"Execution time: {t_end - t_start:.4f} seconds")
78498
Execution time: 0.1199 seconds
Exercise
Increase \(N\) tenfold to 10,000,000 and rerun the codes. What is the speed-up?
Change Taichi’s backend from CPU to GPU and give it a rerun. What is the speed-up?
2D Diffusion#
Import the required libraries.
# Import the required libraries
import time
import taichi as ti
ti.init(arch=ti.cpu,
default_fp=ti.f64)
[Taichi] Starting on arch=x64
Isolate the code responsible for heavy computation (loops) and enclose the code in a Taichi kernel.
@ti.kernel
def fvm_iteration() -> ti.f64:
# copy the current temperature field to the placeholder field
for i, j in ti.ndrange(nx, ny):
Told[i, j] = T[i, j]
# loop over the grid points
for i, j in ti.ndrange(nx, ny):
# left-bottom corner
if i == 0 and j == 0:
T[i, j] = ((k*area/dx)*Told[i+1, j] + ((k*area/dx))*Told[i, j+1] + q*area + area*Tinf/(1/h + dx/(2*k))) / (2*k*area/dx + area/(1/h + dx/(2*k)))
# right-bottom corner
elif i == nx-1 and j == 0:
T[i, j] = ((k*area/dx)*Told[i-1, j] + (k*area/dx)*Told[i, j+1] + area/(1/h + dx/(2*k))*Tinf) / (2*k*area/dx + area/(1/h + dx/(2*k)))
# left-top corner
elif i == 0 and j == ny-1:
T[i, j] = ((k*area/dx)*Told[i+1, j] + (k*area/dx)*Told[i, j-1] + (q*area + 2*k*area/dx*Tn)) / (4*k*area/dx)
# right-top corner
elif i == nx-1 and j == ny-1:
T[i, j] = ((k*area/dx)*Told[i-1, j] + (k*area/dx)*Told[i, j-1] + (2*k*area/dx*Tn)) / (4*k*area/dx)
# left boundary
elif i == 0:
T[i, j] = ((k*area/dx)*Told[i+1, j] + (k*area/dx)*Told[i, j-1] + (k*area/dx)*Told[i, j+1] + q*area) / (3*k*area/dx)
# right boundary
elif i == nx-1:
T[i, j] = ((k*area/dx)*Told[i-1, j] + (k*area/dx)*Told[i, j-1] + (k*area/dx)*Told[i, j+1]) / (3*k*area/dx)
# bottom boundary
elif j == 0:
T[i, j] = ((k*area/dx)*Told[i-1, j] + (k*area/dx)*Told[i+1, j] + (k*area/dx)*Told[i, j+1] + (area*Tinf/(1/h + dx/(2*k)))) / (3*k*area/dx + area/(1/h + dx/(2*k)))
# top boundary
elif j == ny-1:
T[i, j] = ((k*area/dx)*Told[i-1, j] + (k*area/dx)*Told[i+1, j] + (k*area/dx)*Told[i, j-1] + (2*k*area/dx*Tn)) / (5*k*area/dx)
# internal nodes
else:
T[i, j] = 0.25 * (Told[i-1, j] + Told[i+1, j] + Told[i, j-1] + Told[i, j+1])
# calculate the temperature difference
Tdiff = 0.0
for i, j in ti.ndrange(nx, ny):
Tdiff += ti.abs(T[i, j] - Told[i, j])
return Tdiff
Major routine of the code.
# Parameter declarations
lx = 0.3 # length of the plate
ly = 0.4 # height of the plate
nx = 3 # number of grid points in x-direction
ny = round(ly/lx*nx) # number of grid points in y-direction
dx = lx/nx # grid spacing in x-direction
dy = ly/ny # grid spacing in y-direction
h = 0.01 # plate thickness
area = h*dx # flux area
k = 1000 # coefficient for heat conduction
q = 500000 # heat flux at the west boundary
Tinf = 200 # ambient temperature in the south
h = 253.165 # convective heat transfer coefficient at the southern edge
Tn = 100 # constant temperature at the northern edge
# Set initial condition (Note the order of nx and ny)
T = ti.field(dtype=ti.f64, shape=(nx, ny)) # a taichi field with all elements equal to zero
# Finite volume calculations
Told = ti.field(dtype=ti.f64, shape=(nx, ny)) # placeholder field to advance the solution
Tdiff = 1 # temperature difference for convergence
cnt = 0 # counter for the number of iterations
t_start = time.perf_counter() # start time for the simulation
while Tdiff > 1e-3: # loop until the difference is less than 1e-3
cnt += 1 # increment the counter
Tdiff = fvm_iteration() # calculate the temperature difference
if cnt % 10 == 0: # print every 100 iterations
print('Iteration {}: Tdiff = {:.4f}'.format(cnt, Tdiff))
# Stop the timer and print the iteration results
t_end = time.perf_counter()
print('******************************************')
print('Final temperature difference: {:.4f}'.format(Tdiff))
print('Number of iterations: {}'.format(cnt))
print('Elapsed time: {:.3f} seconds'.format(t_end - t_start))
print('The temperature at the plate center is {:.4f} degree Celsius.'.format(0.5*(T[nx//2, ny//2] + T[nx//2, ny//2-1])))
Iteration 10: Tdiff = 71.3782
Iteration 20: Tdiff = 39.7494
Iteration 30: Tdiff = 22.2378
Iteration 40: Tdiff = 12.4416
Iteration 50: Tdiff = 6.9609
Iteration 60: Tdiff = 3.8945
Iteration 70: Tdiff = 2.1789
Iteration 80: Tdiff = 1.2190
Iteration 90: Tdiff = 0.6820
Iteration 100: Tdiff = 0.3816
Iteration 110: Tdiff = 0.2135
Iteration 120: Tdiff = 0.1194
Iteration 130: Tdiff = 0.0668
Iteration 140: Tdiff = 0.0374
Iteration 150: Tdiff = 0.0209
Iteration 160: Tdiff = 0.0117
Iteration 170: Tdiff = 0.0065
Iteration 180: Tdiff = 0.0037
Iteration 190: Tdiff = 0.0020
Iteration 200: Tdiff = 0.0011
******************************************
Final temperature difference: 0.0010
Number of iterations: 203
Elapsed time: 0.107 seconds
The temperature at the plate center is 193.1574 degree Celsius.
Exercise#
Explore the Taichi website.