By this time, you should have completed Analytica in 5 steps. In this tutorial, we’ll continue on with our example of analyzing Stack Overflow user profiles. In the last part we saw we could calculate metrics for our data set of users such as their average and maximum reputation score. We calculated metrics for each individual users and then also grouped and ordered our data to see how we could determine which were the most popular locations for StackOverflow users. In this part, we’ll see how we can use Analytica’s JOIN function to combine together our users and badges collections to discover things like the average number of badges that users have, as well as whether each individual user is ahead or behind the average. Lastly, we’ll see how we can use MongoDB’s query language to filter out documents that we don’t want to analyze at the database level.
6. Joining multiple collections
Even though document oriented databases such as MongoDB reduce the need for joins by allowing “pre-joined’ data in a document (i.e. sub-collections), in more complex applications you may have multiple collections that you need to analyze together. In our data set we have a users collection and a badges collection. We’d like to do some explore the relationship between users and the badges they have earned. To do this, Analytica allows you to join collections like so:
SET so.users.badges = select(so.badges.where(UserId = My._so._users.Id))
If you are in the Analytica Shell, you can type that command in. If you are in in Excel, you use the calculated properties pane to SET the new property:
Taking this apart:
The first part of the select argument specifies the collection to select from, so.badges. The where part of the select statement is deconstructed as follows:
UserId: matches each UserId in the so.badges collection
My._so._users.Id: matches the current id of the current user document being examined.
This effectively examines each document from the so.users collection, and finds corresponding documents from so.badges. The correspondence is established by comparing the Id in the documents of so.users with the UserId of each document in so.badges. The matching badges are added as an embedded array/sub-collection ‘badges’ to each document in so.users. Now each user document has the badges the user has earned embedded within in. The resulting structure of our documents looks now like this:
[deeper explanation of 'My' keyword needs to go here]
Now that we have our badges associated with users, we can then determine the average number of badges that our users have earned using SET so.avgbadges = average(so.users.count(badges))
One thing to note here is the placement of the count function. so.users.count(badges). The placement of count after users tells Analytica to return the count of badges for each user. Consequently, the average function receives a set of numbers, each element of which corresponds to the count of badges for a single user. It can then average them and return the result we want. If we had done average(count(so.users.badges)), count would have returned a single number representing the number of badges across all users. The average function would have received this single number and consequently it would give us a different result than we were looking for.
7. Calculations within and across documents
So far, we’ve seen how we can calculate metrics based on properties of all (or a subset) of collections e.g. average reputation of all users or the count of all anonymous users. We’ve also seen how we can calculate a property for each document based on field(s) in the document – we used that technique to calculate the grumpiness score. In this step, we’ll see how we can create a property for each document in a collection based on calculations
One of the things we want to calculate is whether the number of badges a user has earned is above or below the average for all other users. We’ll call this a user’s ‘badge score’. Now that each users has the badges assigned to him in a sub-collection from our JOIN, we can find out. We do this through another SET operation as follows:
SET so.users.badgescore = count(My._so._users.badges) – average(so.users.count(badges))
Let’s break this down. We’re creating a property called badgescore for document in so.users. In order to calculate that score, we’re taking the difference of the number of badges that a user has earned and the average number of badges that all users have earned. In the portion of the expression to the left of the minus sign, we use ‘My’ to indicate that we are referring to the current user. To the right of the minus sign, we calculate the average number of badges across all users (which we saw how to do in the previous step). If a user has earned less than the average number of badges, his badgescore will be negative and if he has earned more than the average number of badges, his score will be positive.
We can then list out badge scores for users as needed.
8. Filtering documents at the source
In step 5, we saw how we can create a calculated collection based on a subset of documents. This is useful when you want to perform analysis on an entire dataset as well as a subset. Sometimes, though, we only care about a subset of our data and not the entire set. In situations like this, it can make sense to filter documents at the source i.e. in MongoDB before loading them into analytica. This can be accomplished by created a virtual collection. A virtual collection is defined by a MongoDB query and contains only the results of that query. For example, you could create a virtual collection by filtering documents by a date field (say the to retrieve the last month of data) or by matching users that came from a certain location. For this example, we’ll only retrieve users who are not anonymous. As you’ll remember from step 5, anonymous users are those who do not have a DisplayName that contains the text ‘user’. If you are using Excel, to create a virtual collection, you’ll first have to open an Analytica query sheet. A query sheet is a specialized sheet that allows you to execute Analytica commands. Click the “Create” button in the Analytica tool bar as in the screenshot:
In the box labeled XL type the following:
setvc(namedusers, so, users, “{DisplayName: /^((?!user).*)$/}”)
This tells Analytica to create a virtual collection called namedusers in the database so. The collection contains documents from the collection users that match the query {DisplayName: /^((?!user).*)$/}. This is MongoDB regular expression query that selects documents where the DisplayName field does not contain the text “user”.
After finishing this step, you’ve walked through much of Analytica’s basic functionality. Start on your own datasets or browse around more of the documentation




