How to write Scrapy MySQL Data Pipeline

4 min readJan 12, 2023

Here’s an example of how you might create a Scrapy pipeline that checks if a table exists in a MySQL database, and if it doesn’t, creates it:

import mysql.connector
from mysql.connector import errorcode
from scrapy.exceptions import DropItem

class MySqlPipeline:

    def __init__(self, host, user, password, database, table):
        self.host = host
        self.user = user
        self.password = password
        self.database = database
        self.table = table

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host=crawler.settings.get('MYSQL_HOST'),
            user=crawler.settings.get('MYSQL_USER'),
            password=crawler.settings.get('MYSQL_PASSWORD'),
            database=crawler.settings.get('MYSQL_DATABASE'),
            table=crawler.settings.get('MYSQL_TABLE')
        )

    def open_spider(self, spider):
        self.conn = mysql.connector.connect(
            host=self.host,
            user=self.user,
            password=self.password,
            database=self.database
        )
        self.cursor = self.conn.cursor()

        # Check if table exists
        self.cursor.execute(f"SHOW TABLES LIKE '{self.table}'")
        if not self.cursor.fetchone():
            # Create table
            self.cursor.execute(f'''
                CREATE TABLE {self.table} (
                    airport_id INT NOT NULL,
                    day BIGINT NOT NULL,
                    sumdepartures INT,
                    sumcancellations INT,
                    sumdepdelay INT,
                    avgdepdelay FLOAT,
                    recorded_datetime DATETIME,
                    PRIMARY KEY (airport_id, day)
                )
            ''')
            self.conn.commit()

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

    def process_item(self, item, spider):
        try:
            self.cursor.execute(f'''
                INSERT INTO {self.table} 
                (airport_id, day, sumdepartures, sumcancellations, sumdepdelay, avgdepdelay, recorded_datetime) 
                VALUES 
                (%s, %s, %s, %s, %s, %s, NOW())
            ''', (item['airport_id'], item['day'], item['sumdepartures'], item['sumcancellations'], item['sumdepdelay'], item['avgdepdelay']))

            self.conn.commit()
            return item
        except mysql.connector.Error as e:
            raise DropItem(f'Error inserting item: {e}')

The above code defines a Scrapy pipeline called MySqlPipeline that is responsible for saving the scraped data to a MySQL database. The pipeline is initialized with the following properties:

host: The hostname or IP address of the MySQL server.
user: The username to use when connecting to the MySQL server.
password: The password to use when connecting to the MySQL server.
database: The name of the database on the MySQL server where the table should be created.
table: The name of the table in the MySQL database where the scraped data should be saved.

The pipeline has three main methods:

open_spider: This method is called when the spider is opened. It connects to the MySQL server using the host, user, password, and database properties and creates a cursor to execute queries. It then checks if the table already exists or not, if it doesn't then it creates the table with the given structure.
process_item: This method is called for each item that is scraped by the spider. It uses the cursor to execute an INSERT INTO statement, which inserts the scraped data into the specified table in the MySQL database. The NOW() function is used to set the value of the recorded_datetime column to the current date and time.
close_spider: This method is called when the spider is closed. It closes the cursor and the connection to the MySQL server.

The pipeline also has a from_crawler classmethod which allows to pass the settings to the pipeline class. It uses the ITEM_PIPELINES setting in the Scrapy project's settings.py file to specify the pipeline and the MYSQL_HOST, MYSQL_USER, MYSQL_PASSWORD, MYSQL_DATABASE, and MYSQL_TABLE settings are used to set the appropriate values for your MySQL server and table.

It is also worth noting that, if there are any errors occur while inserting the data into the MySQL table, the pipeline raises a DropItem exception, which causes the item to be dropped and not saved to the database.

Hire Me:

Are you seeking a proficient individual to execute website data extraction and data engineering services? I am available and eager to undertake the task at hand. I look forward to hearing from you in regards to potential opportunities.

@MrAsimZahid | Applied Scientist

Kaggle Expert | Former Google Developer Student Club Lead & AWS Educate Ambassador

mrasimzahid.github.io

About Author:

Asim is a research scientist with a passion for developing impactful products. He possesses expertise in building data platforms and has a proven track record of success as a dual Kaggle expert. Asim has held leadership positions such as Google Developer Student Club (GDSC) Lead and AWS Educate Cloud Ambassador, which have allowed him to hone his skills in driving business success.

In addition to his technical skills, Asim is a strong communicator and team player. He enjoys connecting with like-minded professionals and is always open to networking opportunities. If you appreciate his work and would like to connect, please don’t hesitate to reach out.

Asim Zahid - Medium

Read writing from Asim Zahid on Medium. I can brew up algorithms with a pinch of math, an ounce of Python and piles of…

mrasimzahid.medium.com

How to implement MySQL Buffer Logic in a Scrapy Web Crawler

In web crawlers, buffer logic refers to the set of rules or algorithms used to manage and manipulate the URLs that are…

mrasimzahid.medium.com

20 Best Data and Web Scraping Tools

Data is the new oil. Nowadays, Everyone needs data whether you are running an e-commerce business, performing…

mrasimzahid.medium.com

How to write Scrapy MySQL Data Pipeline

Hire Me:

@MrAsimZahid | Applied Scientist

Kaggle Expert | Former Google Developer Student Club Lead & AWS Educate Ambassador

About Author:

Asim Zahid - Medium

Read writing from Asim Zahid on Medium. I can brew up algorithms with a pinch of math, an ounce of Python and piles of…

Read More

How to implement MySQL Buffer Logic in a Scrapy Web Crawler

In web crawlers, buffer logic refers to the set of rules or algorithms used to manage and manipulate the URLs that are…

20 Best Data and Web Scraping Tools

Data is the new oil. Nowadays, Everyone needs data whether you are running an e-commerce business, performing…

Written by Asim Zahid