In attempt to understand how SqlAlchemy works, I’ve written up a sample indexing a large folder hierarchy.
This loops over each file in a structure (there are several million files here – it’s a small subset of court cases from PACER).
First, connect to the postgres database- before running this you want to install Anaconda, and the postgres drivers:
pip install sqlalchemy easy_install pg8000
Then, you can set up the connection:
from sqlalchemy import *
engine = create_engine(
                "postgresql+pg8000://postgres:postgres@localhost/pacer",
                isolation_level="READ UNCOMMITTED"
            )
c = engine.connect()
Defining tables is pretty simple:
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
from sqlalchemy import Column, Integer, String
class PacerFile(Base):
     __tablename__ = 'pacer_files'
     id = Column(Integer, primary_key=True)
     file_name = Column(String)
     file_path = Column(String)
     def __repr__(self):
        return "" % \
               (self.file_name, self.file_path)
Base.metadata.create_all(engine) 
 
Then, create an actual database session:
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
Session = sessionmaker()
Session.configure(bind=engine)
session = Session()
Finally, loop over each file in the hierarchy. I commit periodically, as it lets me see how things are going. Likely it would be faster committing just at the end.
import os
for root, subFolders, files in os.walk("Q:\\pacer2\\"):
  for f in files:
    pacer_file = PacerFile(file_name=f, \
                           file_path=os.path.join(root))
    session.add(pacer_file)
  session.commit()