In attempt to understand how SqlAlchemy works, I’ve written up a sample indexing a large folder hierarchy.
This loops over each file in a structure (there are several million files here – it’s a small subset of court cases from PACER).
First, connect to the postgres database- before running this you want to install Anaconda, and the postgres drivers:
pip install sqlalchemy easy_install pg8000
Then, you can set up the connection:
from sqlalchemy import *
engine = create_engine(
"postgresql+pg8000://postgres:postgres@localhost/pacer",
isolation_level="READ UNCOMMITTED"
)
c = engine.connect()
Defining tables is pretty simple:
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
from sqlalchemy import Column, Integer, String
class PacerFile(Base):
__tablename__ = 'pacer_files'
id = Column(Integer, primary_key=True)
file_name = Column(String)
file_path = Column(String)
def __repr__(self):
return "" % \
(self.file_name, self.file_path)
Base.metadata.create_all(engine)
Then, create an actual database session:
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
Session = sessionmaker()
Session.configure(bind=engine)
session = Session()
Finally, loop over each file in the hierarchy. I commit periodically, as it lets me see how things are going. Likely it would be faster committing just at the end.
import os
for root, subFolders, files in os.walk("Q:\\pacer2\\"):
for f in files:
pacer_file = PacerFile(file_name=f, \
file_path=os.path.join(root))
session.add(pacer_file)
session.commit()