I’ve been wanting to experiment with Elasticsearch for some time now, and I finally got around to it.
First, I needed to find a somewhat relevant use case for Elasticsearch. I thought about products that used an effective Instant feature. Facebook Instant searches its index of the people, places, and things in your social graph. YouTube Instant searches the videos on its platform. Robinhood Instant (not to be confused with the product of the same name) searches stocks that can be traded on its platform.
So I rummaged about for another product that might benefit from an Instant feature and settled on publications on arxiv.
I scraped information from arxiv with urllib2 and BeautifulSoup then indexed it in Elasticsearch following the Elasticsearch tutorials. With the publications indexed, building out the application functionality took care of itself. In terms of design, I knew the search bar needed to be central and evident, and I also knew the results needed to be consistent in terms of both content and style.
Building an effective scraper to save myself work turned out to be the most challenging issue of the implementation. I found that arxiv blacklisted the IP address after about 1200 requests of my initial script, but grows by about ~8k publications per month. Right now, I manually connect to a new VPN every time the current IP gets blacklisted. I’m interested whether there exists a better way to avoid getting blacklisted in the first place.
After building a functional app in dev, I set out to deploy it. To save on server costs, I opted to use the Amazon AWS Free Usage Tier, which affords a t2.micro instance. This decision resulted in a couple of minor issues.
Since a t2.micro instance only has 1GB of RAM, the host machine ran out of memory when it tried to allocate 2GB for the heap of the JVM (Java Virtual Machine), on which Elasticsearch relied. The Elasticsearch configuration settings specify 2GB as a default. For Elasticsearch v5.1.1, the settings for the initial and max sizes of the memory allocation pool can be modified in
AWS somewhat sequesters operations to open and close ports, and the port HTTP 80 starts as closed by default.
Other than these issues, deployment went pretty smoothly. A slight gotchya with the logrotate wildcard: the default configuration does not include
* does not include the path delimiter
Please feel free to try it out at http://arxives.davideng.me!