Design Big Data App using Spring Boot and Cassandra

In this blog, we will design a Big Data App that will be highly available, that can scale, that can handle millions and millions of data records.

App Requirements

  • App needs to store information of every book, ever published in the world.
  • App needs to allow users to browse the book information, mark a book as read, rate the book, track the progress of a book that users are currently reading.
  • App architecture needs to be highly available and highly performant by handling millions of records.
  • User experience should be very fast.
  • Application level security needs to be implemented.

Technologies Stack

  • Spring Boot.
  • Apache Cassandra NoSQL Database. We will use hosted Cassandra instance by DataStax. We will use a service by DataStax called Astra DB service.
  • Spring Data Cassandra.
  • Spring Security.
  • Thymeleaf, to render the HTML view.

System Design

  • We will not use Relational Database. We will be using NoSQL Database called Cassandra. It’s like a columnar database just like a relational table. You can also use it as a document database, like a key-value store. Cassandra is a very popular choice as an application database which can deal with the large amounts of data.
  • Putting all the world’s book information in a table of a Relational DB will cause the system to slow down. Imagine doing a join of tables with large data, suppose how many books the user has read. This is not going to scale. Cassandra builds an application with this kind of scalability and also be be performant with this huge amount of data. Cassandra also scales up or down based on the load. Astra DB service will scale up or down based on the number of requests.
  • For connecting Spring Boot application to Cassandra, we will use Spring Data for Apache Cassandra. It’s like a model based interaction like JPA. We will have Spring Data Cassandra to use Repository pattern to connect to the hosted Cassandra instance on DataStax.
  • For the data source, we will be using Open library to get all the data of all the books published in the world. Visit Open library https://openlibrary.org/, go to Developer Center, go to Data Dumps and click on Bulk Download. It’s not small amount of data, it’s all compressed data.
  • Analyzing the user experience also gives a better idea of system design. Once we identify what is it the user will get from the app, we can later work backwards and design the system and design the architecture. In Relational applications, we have Relational databases, we don’t think about app first, we think about the data model, build the database design and then app will figure out how to get the data. We have normalized database, then we run SQL queries and do joins and get the data we need. With Cassandra, we follow different approach, we approach it from the application stand point. We identify what the app needs and then work backwards into identifying data model for the app.
  • We will have a Book page that will render all the details of a book saved in the DB. We want to load all the book information as quickly as possible. We don’t want to have all this data in Relational Tables and have it do table scans to get a certain book. Our book page should certainly look like below.
  • Clicking on the Author name should navigate to the Author page. Author page should certainly look like below. We will show latest books first.
  • We will have a Search page like below.
  • We will have Non logged-in Home page and a logged-in Home page.

  • For managing app level security we will use Spring Security. For authentications we will use OAuth Login with GitHub.

Cassandra Schema Design

  • Based on the application requirements, we need to analyze what are the queries that the app would need to perform.
    Query1 -> Book by Id :- We will have all the data here that is required to show a book on the book page. Below is the corresponding chebotko diagram.Here, book Id is the partition key column. There can be redundant data like author name, bit it’s ok in Cassandra. In Cassandra, we duplicate data, so that we don’t need to do joins. Here, we have author Id, so that using a link we can go to the Books by Author Id.

    Query2 -> Books by Author Id :- We want the returned books to be in reverse chronological order. We want to first show the recent book. In relational database, we query and do order by clause & do sorting at runtime, but we cannot do like this in Cassandra. We have large amount of data, we cannot do all this at last minute, we want to do all this at upfront. With Cassandra we design a schema in such a way that when you are saving data, it ends up saved in an ordered fashion. You can choose the ordering of your table, so when you are fetching it, you are always going to fetch in the right order. So here, “published date” needs to be an ordering criteria for books by the author id. We need to make “published date” a clustering column.

    Query3 -> Book by Book Id and User Id :- When someone is logged-in & is on the Book page, a user can provide book reading status data like start date, end date, rating.

    Query4 -> Books by User Id :- We want to have all the books read by your user in one place. So, we will have User Id be the partition key & want all the books within that partition. We also want to do sorting based on the status and then based on time.
  • We are building a schema for the queries and not building the schema in the ideal state of the data. We don’t care about ideal state of data, we don’t care about normalization. There could potentially be multiple tables and we could potentially be duplicating data all over the place, but that’s totally fine with something like Cassandra. We don’t want to have completely normalized data where you are optimizing for efficiency and how the data is stored. We only care about, how efficient it is to fetch the data, which means you are duplicating the data to make sure that every fetch from app is just one look up rather than joins and all which is not possible in Cassandra.

Setting up hosted Cassandra instance

  • We will use hosted Cassandra instance by DataStax and set up free account.
  • After sign in, create a new Database instance.
  • Cassandra Database is a distributed database, so when you are setting up a database, you are essentially setting up multiple nodes. You can choose where you want the database to be hosted on, like on Google cloud, Azure or AWS.
  • Once you created the database, and once the Database is Active, click on the CQL console, here we run all the queries. Cassandra is a NOSQL database, you don’t run SQL queries on Cassandra. Cassandra uses another query language called CQL, Cassandra Query Language.
  • We know what tables to be created. We need to take the data download from the Open library API & insert the data to the database.

Creating the data loader Spring Boot app (Spring Boot + Cassandra)

    • Now we will do the 1st part of our application development, which is to load data into our database. We want to get data from the open reads data dump and then populate it over the Cassandra table.
    • We will build an app which parses the dump file and gets the information in every line & then save it to Cassandra table. We will create a Spring Boot app which will connect to the Cassandra database (We will use Spring Data for Apache Cassandra as the dependency).
    • Now our Spring Boot application is ready, it only has Spring Data Cassandra dependency, so it only tries to connect to a Cassandra instance. Now since, we did not configured any Cassandra datasource here, so it tries to connect to a local Cassandra instance. We need to tell the Spring Data Cassandra to connect to our Datastax Cassandra instance.
    • On the Datastax website, after doing the login, click on the Connect tab & go the the Java option. There is a “Download Bundle” option & Datastax provides this to connect our Java application to this Cassandra instance in a secured way.
    • After downloading, unzip the bundle and you will find certs, security stuff, trust stuff, etc. which the application needs to connect to this Cassandra instance. Copy the zip file & place it in src/main/resources directory and also need to provide its entry in application.yml.
    • Now in the application.yml we need to provide the datasource details. Check below github repository regarding the same.

https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-data-loader/src/main/resources/application.yml

    • Now we need to expose all this as a Java configuration.


https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-data-loader/src/main/java/connection/DataStaxAstraProperties.java

    • Now you need to define the below bean so that app will connect to the required Cassandra instance.
	@Bean
    public CqlSessionBuilderCustomizer sessionBuilderCustomizer(DataStaxAstraProperties astraProperties) {
        Path bundle = astraProperties.getSecureConnectBundle().toPath();
        return builder -> builder.withCloudSecureConnectBundle(bundle);
    }
    • So, until now we have a SPrint Boot app that uses Spring Data Cassandra to connect to Cassandra instance on cloud.
    • Spring Data Cassandra will take care of creating the tables in the mentioned keyspace.
    • Now to communicate to Cassandra, we will use Repository pattern i.e. we will create entity classes and use CassandraRepository interface to do database operations.


https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-data-loader/src/main/java/rai/anoop/betterreadsdataloader/book/Book.java

https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-data-loader/src/main/java/rai/anoop/betterreadsdataloader/book/BookRepository.java

https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-data-loader/src/main/java/rai/anoop/betterreadsdataloader/author/Author.java


https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-data-loader/src/main/java/rai/anoop/betterreadsdataloader/author/AuthorRepository.java

    • Now we need to parse the dump file and invoke the Repository class to save the parsed data.


https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-data-loader/src/main/java/rai/anoop/betterreadsdataloader/BetterreadsDataLoaderApplication.java

Creating the Book tracker app (Spring Boot + Spring Web + Spring Security + Cassandra)

    • Now that we uploaded all the Books information and Authors information, we now need to create a web application which displays the Book information.
    • We will have a URL for a web page corresponding to a Book. We also want authentication, people need to sign-in, register. We need to set up Spring Security for the same.
    • For Security implementation we will use GitHub login via OAuth. Below are the configurations regarding the same.

https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-webapp/src/main/resources/application.yml

    • Now we need to implement the Book view flow. We will build the URL endpoint where we can get the Book information in a nice HTML page. We will use Spring MVC to build the same.
    • We will build 2 views, an authenticated view and an unauthenticated view.

https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-webapp/src/main/java/rai/anoop/betterreads/book/BookController.java
https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-webapp/src/main/resources/templates/book.html

    • Next we need to build Book search feature. We will use the Search API the Open library API provides.
    • The best way to make a REST API call is using WebClient. We will add spring-boot-starter-webflux starter dependency.

https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-webapp/src/main/java/rai/anoop/betterreads/search/SearchController.java
https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-webapp/src/main/resources/templates/search.html

    • We also need to track user interactions with Books. We want to show the rating for a user for a given Book.

https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-webapp/src/main/java/rai/anoop/betterreads/userbooks/UserBooksController.java

    • We also need a Home page, here we will show recently read Books.
    • We don’t want to show all the Books, we only need to show 10 Books or 50 books. In order to do this we can leverage the Pageable feature of fetching data from Cassandra Repository.
    • With Cassandra, pagination is bit tricky, we cannot switch between pages, like give me 10th page. Cassandra doesn’t knows 10th page, as its a distributed database, because Cassandra ill require to calculate what the starting point is, and for the big data the size of data the Cassandra manages, we don’t want to do this. With Cassandra you can only paginate one-by-one, like page 0 and then on-by-one onwards.

https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-webapp/src/main/java/rai/anoop/betterreads/home/HomeController.java
https://github.com/raianoop07/springboot-cassandra-betterreads/blob/main/betterreads-webapp/src/main/resources/templates/home.html

Below is the github repository, to refer both the projects.
https://github.com/raianoop07/springboot-cassandra-betterreads