Where 2.0: Crawling the web for GeoData
Juan Gonzalez
Planeteye started by looking around at the world of information available in the form of mashups. Heard from Google earlier, already > 50,000 using their google maps api. Can only imagine how many data points are available. Would be interesting if crawler could go and collect the data that’s published.
Google has MyMaps, hear there’s > 9 million my maps created by the public.
Highly structured data that is behind these mashups.
Location information not always structured in this way. Are other methods, like an address. Example from NYT referring to a physical location. Easier to crawl address but presents a number new challenges. This is just as valuable as the adjoining google map.
Problem with addresses is that they come in many formats. When analysing worldwide information you can come across many addresses. Humans can figure things out, but not machines.
We experimented with a number of techniques. We tried low-resolution geocoding. Everyblock mentioned going beyond the point marker. We’re going the same way, with these addresses it’s probably possible to get an accurate location, but easier to get a general location. We’re allowing for that exercise and assume we won’t always know the location.
Not the same as geotagging to a centerpoint, different technique. Improves our chances of managing some geocoding though.
Next challenge is the same place being referred to in many different ways. All the different examples will be referring to exactly the same point. The challenge is how to work out they’re the same things. Location is not enough, have to look beyond that. Telephone, elements of the name, etc. can help with this.
Once we have the ability to match the place we can start defining implicit lengths between the multiple sites referring to the same location. If this goes well we’ll end up with a very large number of data points. Current visualization techniques tend to work in two ways: very large dataset would need breaking down and show a few points at a time - good for end user but doesn’t give context of all datapoints; other approach is to put them all on the map and hope for the best - that quickly becomes useful.
Have the ability to take the entire dataset and allow the user to appreciate it, [using circles that get bigger when there’s more elements around a location]. Doesn’t matter how many points there are, they’re returned at the same speed.
We’re trying to create a travel guide by pulling in all the travel information on the web at PlanetEye
Technorati tags: crawling, geodata, planeteye, where, where2.0, where2008