This example will walk you through 5 steps to introduce you to the features of Analytica. The sample dataset we will use is from StackOverflow. StackOverflow is a question-and-answer site for programming topics. Users gain reputation points if other other users vote up their answers. Users are also incentivized by getting badges for answering questions, editing, and voting. We’ll use Analytica to analyze StackOverflow user profiles to determine things such as what are average and maximum reputation points, how often they upvote and downvote, and which countries/cities have the most users. Walking through this tutorial should take about 20-30 minutes. In each step we’ll outline the step you need to do in Excel (if you are using Analytica for Excel) and the commands you need to type in the shell (if you are using the Analytica Command Shell).
1. Calculating properties for a collection
2. Retrieving properties from documents and collections
3. Adding a property to each document in a collection
4. Analyzing subsets of collections
5. Grouping and ordering
6. Complete Workbook
Loading the sample data set
If you haven’t already, you’ll need to install Analytica. Once it’s installed, we’ll start out by loading our dataset. We’ve set up a hosted MongoDB database with our sample dataset. The database is named ‘so’ and is located at datasets.analytica.com and port 43657 with the read-only username set to ‘analytica’ and the password set to ‘analytica’.
If you are in Excel, you’ll start by checking the “Enable Workbook for Analytica”, then clicking on “Manage Connections”. You’ll be presented with a workbook to manage your database connections. Fill it in, as follow, select the ‘so’ cell, and then hit ‘Connect’. You’ll now be connected
If you are running in the Analytica Shell, you can type the following:
connect datasets.analytica.com 43657 analytica analytica so MongoDB
and you will be connected to the remote database.
Note: if you are connecting to a database without a username and password from the Analytica shell, use 3 dashes to indicate a blank username or password e.g connect myserver port — — mydb MongoDB
If you’d like to download the data sets to your own MongoDB server, you can follow these instructions.
Once you are connected, if you are in Excel, click the “Types” button to show the types pane. You should see the Types pane show up on the right side. The Types pane shows you the structure of your MongoDB databases and the the documents within them. With our sample dataset you will see the ‘so’ database (short for StackOverflow!) with the ‘users’ collection and the ‘badges’ collection. Documents in the ‘users’ collection contains one document for each user, with their name, location, reputation, votes, etc as fields. The ‘badges’ collection has one document for each badge that users have earned. In this tutorial, we’ll just be working with the users collection.
1. What are the maximum, minimum, and average reputation scores for users? [Calculating properties for a collection]
Now that you are connected to your data set, we’ll start by calculating the maximum, minimum, and average reputation for our Stack Overflow users. To do that we need tell Analytica to add a calculated properties to our so database. In Analytica, this is called a SET operation. In this guide we’ll write SET operations as follows:
SET so.maxreputation = max(users.Reputation)
If you are using the Analytica shell, you can just type that expression. If you are in Excel, you can SET expression with the Calculated Properties pane. You’ll need to click the ‘Calculated Properties’ button in the Analytica menu in Excel to show the calculated properties pane. You can then run a set operation by typing in the name and the expression. Here is how you would do it for SET so.maxreputation = max(users.Reputation) :
Taking this apart:
We’re adding a calculated property maxreputation to the database so.
If you are using Analytica for Excel, you will see this new property appear in the Type Tree under the so database, but above the users and badges collections.
The value that so.maxreputation will contain is the maximum value of the Reputation property for all users. Analytica will examine each document in the users collection, and retrieve the maximum value for the Reputation property and store it in maxreputation
Note: Because we are creating the ‘maxreputation’ property under the ‘so’ level, the expression for the property can be specified just as ‘users.Reputation’ instead of ‘so.users.Reputation’.
The resulting property will not be in the documents of the users collection. Instead, it will be attached right under the database so.
We can similarly calculate the average reputation by SET so.avgreputation = average(users.Reputation) and the minimum reputation by SET so.minreputation = min(users.Reputation)
These the max, min, and average are just a few of the functions you can apply to your MongoDB data. Analytica has over a 100 different mathematical, statistical and string manipulation functions [if you have the need for additional functions, just ask us at firstname.lastname@example.org]
Analytica accesses any fields we have in our databases, collections, and documents using what are called GET commands. For example, if we want to see the maximum reputation score we calculated for users, we need to execute GET so.maxreputation
This will look for the ‘so’ database, and then return the ‘maxreputation’ field contained in that database. If you are in the Analytica shell, you can type that command and see the result. If you are in Excel, you can enter the following expression in a cell =AGET(so.maxreputation) and hit enter. The function =AGET() is short for “AnalyticaGet” and tells Analytica to perform a GET operation. In our test data set, this will return 102.66 for the maximum reputation.
GET operations work on either calculated properties or properties from your database. For example, you can run GET so.users.DisplayName in the shell. This returns all the DisplayNames for our users. If you are in Excel, select a range of cells in a column and type =AGET(so.users.DisplayName)
Then hit contol-shift-enter. This will populate all the cells you selected with DisplayNames. If you selected 10 cells, then 10 names will be displayed (this is called an array formula in Excel).
If you look at your Type pane, you should also now be able to see the max, min, and average properties you just created under the ‘so’ data. If you are using Excel, your worksheet should look something like this now
3. Are users grumpy?! [Adding a property to each document in a collection]
A key action on the StackOverflow site is up voting and down voting. Let’s say we wanted to see if community members were down voting more than they were upvoting (probably not a good thing!). To do this, let’s create a property for each user. We’ll call this a user’s ‘grumpiness’ score. In order to calculate the grumpiness for each user, we’d have to add a property to each user profile document in the so.users collection. We can so this with a SET statement:
SET so.users.grumpiness = UpVotes – DownVotes
In Excel, you would do this by setting the name in calculated properties pane to ‘so.users.grumpiness’ and the expression to ‘UpVotes – DownVotes’ Let’s take this apart:
In so.users.gumpiness, grumpiness is the name of the property, and so.users defines the documents where the calculated property will be in. You can read this as: for each document in so.users, create a calculated property called grumpiness.
The second part defines the value so.users.grumpiness will contain. In this case, we’re caclulating a users likability as the difference between the number of UpVotes and DownVotes he has cast.
Finally, set registers the definition with the Analytica server. When the user now retrieves the so.users schema, it will include grumpiness for each of the users. If their grumpiness score is negative then that user has are downvoting more than they have upvoting!
We can see the grumpiness of each user using GET so.users.grumpiness
and we can SET so.avggrumpiness = average(users.grumpiness)
Looking at the result [=AGET(so.avggrumpiness)], we can see that the average grumpiness is 10.59 – so our Stack Overflow users are not very grumpy!
4. Where are our users located? [Grouping and ordering]
In order to figure out where are users are located, the data has to be grouped by the ‘Location’ property first and then we can count how many users are in each location. The group by is as follows and it will add a new collection to the database so.
SET so.bylocation = group(users.by(Location))
This results a new collection bylocation. The group function groups the users in users collection. The by function defines the property used for grouping. Our collection contains a property called Location (which we grouped by) and an array of documents, corresponding to the documents of each user with the same location. Exploring the type tree, you will see that calculated properties (such as grumpiness) are carried over into the new grouped documents.
Since the users are grouped by location, it is possible to count up the users for each location.
SET so.bylocation.count = count(users)
This will create a count property in each of our bylocation documents which contains the number of users for that location.
If we’d like to see the top locations for our users, the last thing to do is to order them by our count property. This is done with the another SET statement: SET so.toplocations = orderdesc(bylocation.by(count))
This command tells Analytica to create a new ordering as a collection called toplocations. The orderdesc function tells analytica to order a collection in descending order (so the one with the highest will be first), while the by function tells Analytica which property to use count property in ordering.
The final state of our types looks like after we execute the group and orderdesc functions:
You should now be able to use the =AGET(so.toplocations.location) and =AGET(so.toplocations.count) to print out a list of the top few locations and users counts for those locations. Remember to use the array form in Excel to do this (select a few cells in a column, type =AGET(….), then hit control-shift-enter). Or if you are using the shell, you can use the equivalent GET commands to print out a list. Note: you will notice in our dataset that the top count shows up as blank. This is because a lot of users don’t set their location. The United States, and Germany round out the next top 2 locations for users.
5. Analyzing subsets of data
The last thing we’re going to do is figure out The stackoverflow has a lot of anonymous users – indicated by their DisplayName being set to ‘userXXXX’ where ‘XXXX’ is a random number. We can filter these out by using a SELECT function on our collection. Running select(users.where(not(contains(DisplayName, “user”)))) will select only those documents that do not contain the string ‘user’. We can then assign those to a new calculated collection e.g. SET so.namedusers = select(users.where(not(contains(DisplayName, “user”)))) will create a new calculated collection called namedusers which only contains users that don’t have the string ‘user’ in their DisplayName.
We can also use the results of the select as input to another function such as
SET so.anonusercount = count(select(users.where(contains(DisplayName, “user”))))
This will just set a property called ”anonusercount” which counts the number of anonymous users (those that have ‘user’ in their DisplayName)
You can print out this out into a sheet with AGET, and if you’d like, you can create a pie chart that shows the proportion of named users to anonymous users. If you’ve followed along for all steps, your spreadsheet should look something like this:
If you’ve gotten this far, congratulations! You now know how to use Analytica. You can now connect up to your own MongoDB databases or continue on for a few more steps of tutorials with Analytica
The complete workbook can be downloaded from here: analytica5steps. The following steps get you to the result
- Open the workbook
- Select the “Analytica” tab
- Select the “Manage Connections” button
- Select the database “so”
- Select the “connect” button
Now the workbook is connected.
- Select the “Calculated Properties” button
- In the opening task pane, select the “Re-create Calculated Properties” button
- Wait for the system acknowledgement window
Now the calculated properties are established.
- Select the “Sheet1″ tab at the bottom
- Press CTRL-ALT-F9 (this executes the =analyticaGet() functions inside the sheet
At this point you see all the evaluation results on the sheet. It is possible that the graph is not completely finished – this is a Microsoft Excel problem. In order to fix the graph
- Select the cell B3
- Place your cursor at the end of the =aget(so.anonusercount) that appears in the formular bar near the top
- Hit return
And now the graph displays properly.