How to write Scrapy MySQL Data Pipeline

Asim Zahid
4 min readJan 12, 2023
Photo by Sihang Chen on Unsplash

Here’s an example of how you might create a Scrapy pipeline that checks if a table exists in a MySQL database, and if it doesn’t, creates it:

import mysql.connector
from mysql.connector import errorcode
from scrapy.exceptions import DropItem

class MySqlPipeline:

def __init__(self, host, user, password, database, table):
self.host = host
self.user = user
self.password = password
self.database = database
self.table = table

@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
database=crawler.settings.get('MYSQL_DATABASE'),
table=crawler.settings.get('MYSQL_TABLE')
)

def open_spider(self, spider):
self.conn = mysql.connector.connect(
host=self.host,
user=self.user,
password=self.password,
database=self.database
)
self.cursor = self.conn.cursor()

# Check if table exists
self.cursor.execute(f"SHOW TABLES LIKE '{self.table}'")
if not self.cursor.fetchone():
# Create table
self.cursor.execute(f'''
CREATE TABLE {self.table} (
airport_id INT NOT NULL,
day BIGINT NOT NULL,
sumdepartures INT,
sumcancellations INT,
sumdepdelay INT,
avgdepdelay FLOAT,
recorded_datetime DATETIME,
PRIMARY KEY (airport_id, day)
)
''')
self.conn.commit()

def close_spider(self, spider):
self.cursor.close()
self.conn.close()

def process_item(self, item, spider):
try:
self.cursor.execute(f'''
INSERT INTO {self.table}
(airport_id, day, sumdepartures, sumcancellations, sumdepdelay, avgdepdelay, recorded_datetime)
VALUES
(%s, %s, %s, %s, %s, %s, NOW())
''', (item['airport_id'], item['day'], item['sumdepartures'], item['sumcancellations'], item['sumdepdelay'], item['avgdepdelay']))

self.conn.commit()
return item
except mysql.connector.Error as e:
raise DropItem(f'Error inserting item: {e}')

The above code defines a Scrapy pipeline called MySqlPipeline that is responsible for saving the scraped data to a MySQL database. The pipeline is initialized with the following properties:

  • host: The hostname or IP address of the MySQL server.
  • user: The username to use when connecting to the MySQL server.
  • password: The password to use when connecting to the MySQL server.
  • database: The name of the database on the MySQL server where the table should be created.
  • table: The name of the table in the MySQL database where the scraped data should be saved.

The pipeline has three main methods:

  • open_spider: This method is called when the spider is opened. It connects to the MySQL server using the host, user, password, and database properties and creates a cursor to execute queries. It then checks if the table already exists or not, if it doesn't then it creates the table with the given structure.
  • process_item: This method is called for each item that is scraped by the spider. It uses the cursor to execute an INSERT INTO statement, which inserts the scraped data into the specified table in the MySQL database. The NOW() function is used to set the value of the recorded_datetime column to the current date and time.
  • close_spider: This method is called when the spider is closed. It closes the cursor and the connection to the MySQL server.

The pipeline also has a from_crawler classmethod which allows to pass the settings to the pipeline class. It uses the ITEM_PIPELINES setting in the Scrapy project's settings.py file to specify the pipeline and the MYSQL_HOST, MYSQL_USER, MYSQL_PASSWORD, MYSQL_DATABASE, and MYSQL_TABLE settings are used to set the appropriate values for your MySQL server and table.

It is also worth noting that, if there are any errors occur while inserting the data into the MySQL table, the pipeline raises a DropItem exception, which causes the item to be dropped and not saved to the database.

Hire Me:

Are you seeking a proficient individual to execute website data extraction and data engineering services? I am available and eager to undertake the task at hand. I look forward to hearing from you in regards to potential opportunities.

About Author:

Asim is a research scientist with a passion for developing impactful products. He possesses expertise in building data platforms and has a proven track record of success as a dual Kaggle expert. Asim has held leadership positions such as Google Developer Student Club (GDSC) Lead and AWS Educate Cloud Ambassador, which have allowed him to hone his skills in driving business success.

In addition to his technical skills, Asim is a strong communicator and team player. He enjoys connecting with like-minded professionals and is always open to networking opportunities. If you appreciate his work and would like to connect, please don’t hesitate to reach out.

Read More

--

--

Asim Zahid

I can brew up algorithms with a pinch of math, an ounce of Python and piles of data to power your business applications.