This is the second part of the tutorial on how to use NodeJS with MongoDB. Here we switch to using the regular MongoDB shell and commands to make the study of aggregate functions simpler.
To show how to use aggregate functions, we will first explain how to do basic queries. Then we will show how to do the WordCount program, which is what people start with when they are first learning, for example, Apache Spark.
Basically, there are two aggregate functions: aggregate and MapReduce. Aggregate functions are the same as the familiar SQL command:
SQL select xxx, count (xxx) from table group by xxx
The aggregate functions are count, sum, average, min, max, etc.
The data that we used in the first part was smoker survey date. It has a complex structure. Recall that the schema is this:
var schema = new mongoose.Schema({ cachedContents : { largest : String, non_nill : Number, nill : Number, tip : [{ item : String, count: Number }] , smallest : String, format : { displayStyle : String, align : String } } });
And the sample data looks like this:
"_id" : ObjectId("59e9a6144bde260fa07e68f1"), "cachedContents" : { "nill" : 0, "non_nill" : 14069, "largest" : "WY", "tip" : [ { "item" : "NC", "count" : 20, "_id" : ObjectId("59e9a6144bde260fa07e6905") }, { "item" : "ND", "count" : 19, "_id" : ObjectId("59e9a6144bde260fa07e6904") }, { "item" : "OH", "count" : 18, "_id" : ObjectId("59e9a6144bde260fa07e6903") }, …. (shortened)….. { "item" : "WY", "count" : 2, "_id" : ObjectId("59e9a6144bde260fa07e68f3") }, { "item" : "IN", "count" : 1, "_id" : ObjectId("59e9a6144bde260fa07e68f2") } ] }, "__v" : 0 }
That is complicated to work with, since it has an array of embedded documents: tip:[{item, count}]. So we will first show how to query this data, then we will make simpler data to show the aggregate and mapreduce functions.
Simple Queries
A simple find query will output all fields. Like this:
db.smokersmodels.find({ "cachedContents.largest" : "2016" }).pretty()
"_id" : ObjectId("59e9a6144bde260fa07e6fda"), "cachedContents" : { "nill" : 0, "non_nill" : 14069, "largest" : "2016", "tip" : [ { "item" : "2014-2015", "count" : 20, "_id" : ObjectId("59e9a6144bde260fa07e6fe5") }, … (and so forth) ….
Notice that we use pretty() to format the output to JSON so that it is easier to read.
Notice too that we put quote marks around the field names. If you look at the instructions from MongoDB they leave then off. But that causes a syntax error when we use the mongo shell. There seems to be some confusion about that if you look at stackoverflow. We are using MongoDB version mongodb-linux-x86_64-debian81-3.4.9.
And we use the dot notation like cachedContents.largest to pull fields nested in our document.
To output specific fields we put them after the query and put the number 1 to indicate which ones to print. Notice here how the output looks when we leave off pretty().
db.smokersmodels.find({ "cachedContents.largest" : "2016" } , { "cachedContents.largest" : 1 })
{ "_id" : ObjectId("59e9a6144bde260fa07e688c"), "cachedContents" : { "largest" : "2016" } } { "_id" : ObjectId("59e9a6144bde260fa07e68ad"), "cachedContents" : { "largest" : "2016" } } { "_id" : ObjectId("59e9a6144bde260fa07e68e5"), "cachedContents" : { "largest" : "2016" } } { "_id" : ObjectId("59e9a6144bde260fa07e693d"), "cachedContents" : { "largest" : "2016" } } { "_id" : ObjectId("59e9a6144bde260fa07e69c3"), "cachedContents" : { "largest" : "2016" } }
Simple Aggregation
Use the distinct operation to do simple aggregation:
db.smokersmodels.distinct("cachedContents.largest") [ "2016", "WY", "Wyoming", "Tobacco Use – Survey Data", "Cigarette Use (Adults)", "Smoking Status", "BRFSS", "Some Days", "%", "Percentage", "87.9", "*", "Data in these cells have been suppressed because of a small sample size", "14.9", "85.4", "90.4", "39839", "Overall", "White", "All Ages", "All Grades", "BEH", "101BEH" ]
Word Count
Now we make some simpler data to illustrate how to do the WordCount program with MongoDB.
First, create some data:
db.words.insertMany( [ { word: "the" }, { word: "rain" }, { word: "in" }, { word: "spain" }, { word: "spain" } ]);
They print it out to see what it looks like:
db.words.find().pretty() { "_id" : ObjectId("59eb051e878eb4d6aca74243"), "word" : "the" } { "_id" : ObjectId("59eb051e878eb4d6aca74244"), "word" : "rain" } { "_id" : ObjectId("59eb051e878eb4d6aca74245"), "word" : "in" } { "_id" : ObjectId("59eb051e878eb4d6aca74246"), "word" : "spain" } { "_id" : ObjectId("59eb051e878eb4d6aca74247"), "word" : "spain" }
The trick with the WordCount program has always been to make the (key,value) pairs (word, 1) and then sum the number 1:
db.words.aggregate([ {$group: { _id: "$word", cnt: {$sum: 1} } } ])
{ "_id" : "in", "cnt" : 1 } { "_id" : "rain", "cnt" : 1 } { "_id" : "spain", "cnt" : 2 } { "_id" : "the", "cnt" : 1 }
As you can see, the word spain occurs 2 times.
MapReduce
Doing this with mapReduce basically does the same thing. We first create (key,value) pairs (word, 1) using the emit function. Then we use the JavaScript Array.sum function to sum the number 1.
db.words.mapReduce( function() { emit(this.word,1); },
function(key, values) {return Array.sum(values)}, { query:{},
} out:"total_matches" ).find()
{ "_id" : "in", "value" : 1 } { "_id" : "rain", "value" : 1 } { "_id" : "spain", "value" : 2 } { "_id" : "the", "value" : 1 }
The result is the same.
Additional Resources