The “beautiful soup” library in Python lets you parse HTML pages.
It does some things a little weirdly if you’re used to Javascript. To filter the document, you can use “find”, which gives you a list of tags matching some condition. However, these values are text elements, not DOM nodes, so you have to do “parent” to get something that is actually useful.
You can then do a “find” on the element you found, to filter to it’s child elements that are bits of text.
file = 'pages/talk' + str(i) + '.html'
soup = BeautifulSoup(open(file), 'html.parser')
def getTexts():
for hit in soup.find(attrs={"class": "transcript-text-content"}):
yield "".join(hit.parent.findAll(text=True))
print "".join(getTexts())
In my tests, the join gives you newlines between the elements, but this may be a coincidence based on my data.