Twenty years ago, my development team built a natural language processing engine that scanned employment, auto, and real estate advertisements for searchable categories. I knew that we had a difficult data management challenge. The data in some ad types were relatively straightforward, like identifying car makes and models, but others required more inference, such as identifying a job category based on a list of skills.

We developed a metadata model that captured all the searchable terms, but the natural language processing engine required the model to expose significant metadata relationships. We knew designing a metadata model with arbitrary connections between data points in a relational database was complex, so we explored using object databases to manage the model.

What we were trying to accomplish back then with object databases can be done better today with graph databases. Graph databases store information as nodes and data specifying their relationships with other nodes. They are proven architectures for storing data with complex relationships.

Graph database usage has certainly grown during the past decade as companies considered other NoSQL and big data technologies. The global graph database market was estimated at $651 million in 2018 and forecasted to grow to $3.73 billion by 2026. But many other big data management technologies, including Hadoop, Spark, and others, have seen much more significant growth in popularity, skill adoption, and production use cases compared to graph databases. By comparison, the big data technology market size was estimated at $36.8 billion in 2018 and forecasted to grow to $104.3 billion by 2026.

I wanted to understand why more organizations aren’t considering graph databases. Developers think in objects and use hierarchical data representations in XML and JSON regularly. Technologists and business stakeholders intrinsically understand graphs since the Internet is an interconnected graph through hyperlinks and concepts like friends and friends of friends from social networks. Then why haven’t more development teams used graph databases in their applications?

Learning the query languages of graph databases

Although it may be relatively easy to comprehend the modeling of nodes and relationships used in graph databases, querying them requires learning new practices and skills.

Let’s look at that example of computing a list of friends and friends of friends. Fifteen years ago, I cofounded a travel social network and decided to keep the data model simple by storing everything in MySQL. The table storing a list of users had a self join to represent friends, and it was a relatively straightforward query to extract a friend’s list. But getting to a friend of a friend’s list required a monstrously complex query that worked but didn’t perform well when users had extended networks.

I spoke with Jim Webber, chief scientist at Neo4j, one of the established graph databases available, about how to construct a friends of friends query. Developers can query Neo4j graph databases using RDF (Resource Description Framework) and Gremlin, but Webber told me that more than 90 percent of customers are using Cypher. Here’s how the query in Cypher for extracting friends and friends of friends looks:

MATCH (me:Person {name:'Rosa'})-[:FRIEND*1..2]->(f:Person)
WHERE me <> f
RETURN f

Here’s how to understand this query:

  • Find me the pattern where there is a node with label Person and a property name:’Rosa’, and bind that to the variable “me.” The query specifies that “me” has an outgoing FRIEND relationship at depth 1 or 2 to any other node with a Person label, and binds those matches to variable “f.”
  • Make sure “me” is not equal “f,” because I’m a friend of my friends!
  • Return all the friends and friends of friends

The query is elegant and efficient but has a learning curve for those used to writing SQL queries. Therein lies the first challenge for organizations moving toward graph databases: SQL is a pervasive skill set, and Cypher and other graph query languages are a new skill to learn.

Designing flexible hierarchies with graph databases

Product catalogs, content management systems, project management applications, ERPs and CRMs all use hierarchies to categorize and tag information. The problem, of course, is some information isn’t truly hierarchical, and subject matters must create a consistent approach to structuring the information architecture. That can be a painful process, especially if there’s internal debate on structuring the information, or when application end-users can’t find the information they seek because it’s in a different part of the hierarchy.

Not only do graph databases enable arbitrary hierarchies, but they also enable developers to create different views of the hierarchy for different needs. For example, this article on graph databases might show up under hierarchies in a content management system for data management, emerging technologies, industries that are likely to use graph databases, common graph database use cases, or by technology roles. A recommendation engine then has a much richer set of data to match content with user interest.

I spoke to Mark Klusza, co-founder of Construxiv, a company selling technologies to the construction industry, including Grit, a construction scheduling platform. If you look at a commercial construction project’s schedule, you’ll see references to multiple trades, equipment, parts, and model references. A single work package can easily have hundreds of tasks with dependencies in the project plan. These plans must integrate data from ERPs, Building Information Modeling, and other project plans and present views to schedulers, project managers, and subcontractors. Klusza explained, “By using a graph database in Grit, we create much richer relationships on who’s doing what, when, where, with what equipment, and with which materials. That enables us to personalize views and to forecast job scheduling conflicts better.”

To take advantage of flexible hierarchies, it helps to design applications from the ground up with a graph database. The entire application is then designed based on querying the graph and leveraging the nodes, relationships, labels, and properties of the graph.

Cloud deployment options reduce operational complexities

Deploying data management solutions into a data center isn’t trivial. Infrastructure and operations must consider security requirements; review performance considerations to size up servers, storage, and networks; and also operationalize replicated systems for disaster recovery.

Organizations experimenting with graph databases now have several cloud options. Engineers can deploy Neo4j to GCP, AWS, Azure, or leverage Neo4j’s Aura, a database as a service. TigerGraph has a cloud offering and starter kits for use cases such as customer 360, fraud detection, recommendation engines, social network analysis, and supply chain analysis. Also, the public cloud vendors have graph database capabilities, including AWS Neptune, the Gremlin API in Azure’s CosmoDB, the open source JanusGraph on GCP, or the graph features in Oracle’s Cloud Database Services.

I return to my original question. With all the interesting use cases, mature graph database platforms available, opportunities to learn graph database development, and cloud deployment options, why aren’t more technology organizations using graph databases?

Pin It on Pinterest