Introducing arXivES

I’ve been wanting to experiment with Elasticsearch for some time now, and I finally got around to it.


First, I needed to find a somewhat relevant use case for Elasticsearch. I thought about products that used an effective Instant feature. Facebook Instant searches its index of the people, places, and things in your social graph. YouTube Instant searches the videos on its platform. Robinhood Instant (not to be confused with the product of the same name) searches stocks that can be traded on its platform.

So I rummaged about for another product that might benefit from an Instant feature and settled on publications on arxiv.


I scraped information from arxiv with urllib2 and BeautifulSoup then indexed it in Elasticsearch following the Elasticsearch tutorials. With the publications indexed, building out the application functionality took care of itself. In terms of design, I knew the search bar needed to be central and evident, and I also knew the results needed to be consistent in terms of both content and style.

My helpful screenshoot

Building an effective scraper to save myself work turned out to be the most challenging issue of the implementation. I found that arxiv blacklisted the IP address after about 1200 requests of my initial script, but grows by about ~8k publications per month. Right now, I manually connect to a new VPN every time the current IP gets blacklisted. I’m interested whether there exists a better way to avoid getting blacklisted in the first place.


After building a functional app in dev, I set out to deploy it. To save on server costs, I opted to use the Amazon AWS Free Usage Tier, which affords a t2.micro instance. This decision resulted in a couple of minor issues.

Other than these issues, deployment went pretty smoothly. A slight gotchya with the logrotate wildcard: the default configuration does not include /var/log/nginx/arxives/{access,error}.log since * does not include the path delimiter /.


Please feel free to try it out at!