Abstract
Amazon and traditional bookstores collect lists of books based on what people want to buy, which makes it difficult to discover new or interesting material. Books on history are victim to the “history is written by the victors” principle, which makes it difficult to find a reasonable perspective on some topics.
Amazon’s scale allows them to sell a comprehensive collection of books, but it also makes it nearly impossible to maintain accurate metadata on each book, and this limits the ways you can search for books.
If one built a catalog of non-fiction, one can imagine many non-scalable ways to improve the process of finding books.
Detailed Model
In a traditional bookstore, you need to know about a book in advance, or decide based on the cover and reading a few pages. In reading non-fiction, you had no way of knowing if the author knew about the topic they wrote about, or that the people they interviewed for a book told them the truth.
The fact that the book is in the store at all simply indicates that one personthinks you’ll be interested. Amazon improves on this by providing navigation that lets you go from book to book based on books having the same purchasers, as well as adding reader reviews. Similarly, libraries cull books based on lack of circulation, and buy books based in what people in the community are interested in.
If you walk through any bookstore or library, you will quickly see how pre-existing factors limit what topics you can learn about. For instance, in “military history,” one notices that typical books on the Iraq war are written from the perspective of the U.S. military.
In sections on the history of African or South American countries, there are are typically a few types of material: narrowly focused ethnographic studies, colonial-era materials written from a European perspective, and a few very recent historical accounts, often from a Western/academic perspective. A novel way to catalog books would identify whether the author has personal knowledge of the subject (e.g. deaf history books by deaf authors, or history of Pennsylvania by someone who also lives there)
It can be difficult to find autobiographies of national leaders of post-colonial countries (many of these are out of print). It might be of interest to catalog books based on regions of the world that they are “about” (recognizing that borders shift over time). By mixing historical commentary on a region with biographies, one could fill in gaps in a nation’s history.
For books written with a university publisher, authors typically have published papers or talks, which could be used to identify how well they know their topic. Knowing that a particular philosopher has influenced many of his peers, is read in classrooms in elite universities, or is favored by some politicians, are all indications of influence that justify reading their work (regardless of it’s actual content).
Finally, one could use demographic information about the author or subject of a book to correct selection biases. All other things being equal, a listing of biographies in the period of early American history should include biographies of presidents, slave rebellions, Native Americans, and so on, mixed as evenly as possible in search results.
Challenges and Opportunities
Internet collections of books are not limited by inventory size (like a bookstore). It would be possible, for instance to set an alert for when a book becomes available.
Some affiliate programs may make this difficult (Amazon’s terms forbid “mobile” optimized sites and sites that feature other sellers above them).
The IBM Watson APIs allow you to send text and resolve what “entities” are mentioned – this means that if an author’s bio mentions a specific country or university, it is a good clue that they are associated with that place. Author bios also typically identify the gender of the author (“she went to school at harvard”), which could be used for ranking. Similarly this could be done for assocated texts, like book descriptions. One could use TF-IDF to match the author’s essays against the book except, to measure how focused their research is. Talk transcripts are typically readily available as well, in the form of subtitles (these are automatically generated on Youtube videos).