Similarity Join on Spark

Similarity Join is a widely used technique to find out the similarity of two (usually) string vectors like phrases, sentences or whole paragraphs of text. The basic idea is to build a metric to calculate the similarity score of each pair then if the value is within a certain threshold, output those pairs as similar with calculated score. The metric to build should be fairly dependent on the nature of vectors to compare, as there are multiple ways to compute similarity. I have implemented Similarity Join in Apache Spark using Jaccard similarity metric and count filtering, which reduces the runtime by reducing the number of comparisons to perform.
Continue reading Similarity Join on Spark

Upgrading MongoDB from 2.x to 3.2.x

If you have installed MongoDB with the default version from package maintainer version (for example Ubuntu 14.04), then you are using v2.4 which is pretty much an archaic version and MongoDB has improved greatly particularly with WiredTiger or RocksDB storage engines since then. If you read the upgrade instructions on the official documentation, you would have seen that you need to follow this version chain 2.4 -> 2.6 -> 3.0 -> 3.2 to upgrade. Wait, what? In practice, it just does not work out. In this post, I will put my tested way of upgrading directly from 2.4 to 3.2 (or latest) AND switch to WiredTiger storage engine also.

Continue reading Upgrading MongoDB from 2.x to 3.2.x

JUnit: Multiple @RunWith & Dependent Tests

There are times that you may want to run a test case with multiple @RunWith annotations or you may have a bootstrap test class that you must run before every other test, such as bootstrapping the application during integration testing. JUnit by its nature does not have test class chaining but there is an easy solution.

Continue reading JUnit: Multiple @RunWith & Dependent Tests

Sieve Email Filtering with Dovecot and Roundcube

Sieve filtering is a very useful feature on your own email server as it lets you to define powerful filtering options for your email server. In this guide I will explain the steps I used to enable sieve filtering on my server.

I assume your (optionally multi-domain, debian based) email server is already configured with postfix, dovecot and roundcube and you would like to enable sieve filtering and manage those filters using roundcube’s managesieve plugin.

Continue reading Sieve Email Filtering with Dovecot and Roundcube

Unit Testing vs Integration Testing

In the table below, I have summarized the properties of and differences between Unit Testing and Integration Testing with regards to Maven.

Unit Testing Integration Testing
Tests easily reproducible headless cases: utility functions, partially independent business logic, isolated functionalities with mocking reproducible realistic cases (including -but not limited to- UI) often depending on other systems with little to no mocking but still isolated from production instances
Applicable Project Types any project, particularly important for library or utility projects shippable projects, often the last module(s) in multi-module project chains
Coverage Goals increase as much as possible complete it
Runtime Duration instant to short medium to long
Typical Run Interval often: after every push to main branch (must be automated with CI) or fixed intervals like 15mins to 1hr (depending on the project) not-so-often: before releases/after sprints or fixed intervals (like once a day or a week), automated or manual runs
Runtime Dependencies none, must be self-sufficient database, other services in the network (though it is best if those get automatically configured at runtime)
Maven Goal test verify
Maven Plugin maven-surefire-plugin maven-failsafe-plugin
Java Class Naming Convention Ending with Test Ending with IT

There are many articles to read on testing but they are often very long and hard to track. I hope this information is useful to some people.

Automated Building and Deployment of Docker Microservices

In this guide I will focus on automatic deployments using Spring Boot and Docker. My intuition was to build and deploy services developed with Spring Boot seamlessly without using a full-fledged CI tool. A simple bash script automates the process of getting source code, building, dockerizing, deploying and running. My requirement was that these operations must be done as easy as clicking an URL.

Continue reading Automated Building and Deployment of Docker Microservices

How to create a linux service

Usually you need to create a linux service from a command. Here I present a simple and efficient init.d template script and recall commands to enable/disable your service on system startup.

You need to supply four basic parameters:

  • Service name
  • Full path to executable (with arguments)
  • Working directory
  • User name for the spawned process

Continue reading How to create a linux service

Java keystore generation regarding intermediate certificates

If you’re hosting a Java application server such as Tomcat or Jetty, you’d definitely like to configure SSL for the end-to-end security. Java (by default) implements its own SSL implementation, which instructs you to use keystore file to store private keys and certificates, using a special keytool command which comes installed with Java.

SSL Certificates signed by a trusted authority (such as StartSSL) usually consist of some intermediate certificates, which needs to be served by the server to pass validity checks done by client’s browser. There are many guides on the web that fails in this part.

Continue reading Java keystore generation regarding intermediate certificates

you will be shocked to discover how it’s easy in life to part ways with people forever. that’s why when you find someone you want to keep around, you do something about it.

Pig: Reparsing Strings into Tuples in Java

Recently, I needed to read text which is stored with PigStorage. The text also had internal bag and tuple structures so I didn’t want to reinvent the wheel. However there is no direct documentation about that, so you have to dig into the Pig source to find how does Pig itself read it. Luckily enough, I’ve found it.

import org.apache.pig.ResourceSchema.ResourceFieldSchema;
import org.apache.pig.builtin.Utf8StorageConverter;
import org.apache.pig.newplan.logical.relational.LogicalSchema;
import org.apache.pig.impl.util.Utils;

Let’s say your string to be parsed is this:

String tupleString = "(quick,123,{(brown,1.0),(fox,2.5)})";

First, parse your schema string. Note that you have an enclosing tuple.

LogicalSchema schema = Utils.parseSchema("a0:(a1:chararray, a2:long, a3:{(a4:chararray, a5:double)})");

Then parse your tuple with your schema.

Utf8StorageConverter converter = new Utf8StorageConverter();
ResourceFieldSchema fieldSchema = new ResourceFieldSchema(schema.getField("a0"));
Tuple tuple = converter.bytesToTuple(tupleString.getBytes("UTF-8"), fieldSchema);

Voila! Check your data.

assertEquals((String) tuple.get(0), "quick");
assertEquals(((DataBag) tuple.get(2)).size(), 2L);

Also on: