Indexing the filesystem with SqlAlchemy and Postgres

In attempt to understand how SqlAlchemy works, I’ve written up a sample indexing a large folder hierarchy.

This loops over each file in a structure (there are several million files here – it’s a small subset of court cases from PACER).

First, connect to the postgres database- before running this you want to install Anaconda, and the postgres drivers:

pip install sqlalchemy
easy_install pg8000

Then, you can set up the connection:

from sqlalchemy import *
engine = create_engine(
                "postgresql+pg8000://postgres:postgres@localhost/pacer",
                isolation_level="READ UNCOMMITTED"
            )
c = engine.connect()

Defining tables is pretty simple:

from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
from sqlalchemy import Column, Integer, String

class PacerFile(Base):
     __tablename__ = 'pacer_files'

     id = Column(Integer, primary_key=True)
     file_name = Column(String)
     file_path = Column(String)

     def __repr__(self):
        return "" % \
               (self.file_name, self.file_path)

Base.metadata.create_all(engine) 

Then, create an actual database session:

from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)

Session = sessionmaker()

Session.configure(bind=engine)

session = Session()

Finally, loop over each file in the hierarchy. I commit periodically, as it lets me see how things are going. Likely it would be faster committing just at the end.

import os
for root, subFolders, files in os.walk("Q:\\pacer2\\"):
  for f in files:
    pacer_file = PacerFile(file_name=f, \
                           file_path=os.path.join(root))
    session.add(pacer_file)

  session.commit()