shopify analytics ecommerce

Fun with Webcrawling with Azure

I’ve been messing around with Azure (and the free $200 of account credit for signing up) to run a web crawler on Seattle craigslist automotive listings. The crawler continuously scrubs the seattle “cars and trucks” site and logs all the listings into a JSON database with title, location, price, vehicle attributes, etc. It feeds all the data into a CosmosDB, which can then be searched with AzureSearch. However, currently i’m simply dumping the data into a JSON database and running my own Python scripts to manipulate and search the data.

For example, show me a all cars from 1995 to 1998 that are not a truck or SUV, that have a 6-cylinder engine if they are a European model, or an 8-cylinder engine if they are an American model — this kind of filtering cannot be done on Autotrader.

I’m stilling developing it as a side-hobby, but my plan is to begin to include more websites in the crawler. For example, automotive forums and classifieds, car dealership personal certified pre-owned listings, etc. Then be able to filter and search simply based on location and vehicle attributes, and not have to check 4-5 different listings.

Here is the real-time log of the crawler:

craigslist_webcrawling.png

It uses Scrapy (pip install Scrapy). More to come. I may try to hook it up to a mobile iOS/Android front-end depending on time/costs/legal infringements.