Search engines can use geospatial information to enhance search results. Let’s consider scenarios where an end user looks for something that requires travel: a restaurant, meetup, or job.
- a restaurant in walking distance (e.g. 0-1.5 miles)
- a restaurant in a reasonable driving distance in an urban area (e.g. 1-5 miles)
- a restaurant in driving distance in a rural area (1-15 miles)
- a restaurant in driving distance, but near another popular location (e.g. I need gas too)
- a restaurant in driving distance, and also easy to get to (e.g. no left turns)
- a restaurant outside a reasonable driving distance, in a vacation destination (e.g. 50+ miles away)
We can resolve these queries by treating distance as a ranking signal, or by providing the user a way to filter within an area.
Distance may also not be enough – we likely need to obtain additional context clues to give good results.
Search in an Area
In Solr, you can filter to items within a circle with “geofilt"
(filter within X miles), or filter to within a square with “bbox
“. There is also an add-on to search 2D polygons. Unsurprisingly, Elasticsearch offers similar options. Postgres also has a robust extension for geospatial search called PostGIS.
If you know more about your users, you can use the polygon search to give better results.
For instance, if you know the areas your customer normally travels (e.g. work, school, and home), you could draw an artificial bounding box around this area and rank results within it higher.
Using Distance in Ranking
You can combine a Solr/Elasticsearch relevance score with distance by simply dividing the score by the distance. Experimentally this seems to work well, with some caveats.
The two typical formulas for measuring distance on a sphere are Haversine and Vincenty. For most search applications these will be indistinguishable, but Haversine will use less computation resources.
An interesting alternative way to measure distance is to assign items to a grid – then distance is the number of discrete steps between one grid point an another. (See this fascinating talk: “Tiling the Earth with Hexagons“)
Providing Additional Context
I’m a fan of using step functions to give hints to rank search results. We can also often improve search results by incorporating secondary datasets.
For instance, let’s consider ways we could improve search results within walking distance:
- If a result is within a mile give the score +5
- If we can get a dataset of neighborhood walkability scores, boost results near the user by that score
- If we can get a dataset of population density, we could use this as a proxy for walkability (i.e. use the inverse of density to represent walkability)
In urban areas, the direction and route to the destination are also pretty important.
A dataset of population density would help people who are driving. If we’re traveling across a county that transitions from rural -> urban, it could be used to weight results in a direction.
In some cases, we might need to tune results for a locale. For example, if we were building a job search engine for NYC, we might penalize results based on the number of bridges we have to cross.
Conclusion: A Note on Ethics
As software developers we try to solve business problems in the most direct means available.
The above methods of tuning results for walkability capture a contemporary cultural value. To tune a search engine, they implicitly encode the mental impression a group of people has on the state of a neighborhood.
If this impression is formed entirely by people outside of the neighborhood, we create a power dynamic that can causes oppressive systems live on from generation to generation.
For instance, if we use the existence of physical barriers or the presence of superfund sites as quality signs for a neighborhood, we implicitly re-code the redlining of the past, which we have an ethical duty to avoid.
These impressions will also change over time as a neighborhood changes, and we don’t want to lock an area into our current perceptions.