Hadoop Mania: February 2015

Friday 20 February 2015

Meluha Trilogy: Big Text Data Analytics and Visualization from kindle ebooks

I read meluha trilogy by Amish Tripathi last month. Although i know about this book from past couple of years, somehow i never think of picking it up but now, after reading it, i regret not picking it up earlier. It is one of the most fascinating and detail-oriented books i have read recently. The way it has connected Hindu Mythology to History is spellbounding, only comparable to Shashi Tharoor's The Great Indian Novel. The way author has created this ancient world rich with maps, geographies and mythology characters that also exist today helps in creating this magical world which sometimes seems creates illusion of being real. And book message is resonating with mine "Gods are human beings that had done great deeds".

As a homage to this great trilogy and me being a student of data science, i build up this app which visualizes this book trilogy. I am heavily influenced by Trevor Stephen's Catch-22 visualization (http://trevorstephens.com/post/86060548369/catch-22-visualized) and Les Miserables Visualization. I majorly focussed on sentiment analysis as this book trilogy is so rich of emotions and contrasting characters.

The Data Extraction and Preparation

The source of data is the ebook on my kindle. I have extracted chapter wise text from three meluha trilogy kindle ebooks. I have explained the methodology of extracting chapter-wise text from kindle ebook in my last post (http://hadoopmania.blogspot.in/2015/02/extracting-text-from-kindle-ebooks.html).

After running the code mentioned in last post,i got chapter-wise text from e-book and converted all text in lowercase . Next step was to apply natural language processing on it to get insights from such huge data

Theoretically, i came up with the following sequence of nlp steps:

1) Word Tokenization

2) Sentence Annotation

3) pos (part-of-speech) tagging

4) lemmatisation

5) named entity recognition - to find person characters within text.

6) deep syntactical parsing

7) coreference resolution

8) sentiment

Implementation-wise, i was sure of facing problem at step of named entity recognition. As existing ner models are trained on standard text corpus, it was obvious they are not going to find characters with exotic name as Shiva, Nandi, Sati, Karthik, etc with good accuracy. I started with training a ner model on book text but soon realised it huge wastage of time as i was indirectly tagging each character in book. So, i decided to try regex-ner which identifies named entities on the basis of regular-expression which can give me 100% accuracy.

For nlp implementation, i started with apache's opennlp(https://opennlp.apache.org/) but after a week of experimentation, abandon it due to lack of regex-ner and good quality coreference resolution. Then, i came across stanford corenlp parser(http://nlp.stanford.edu/software/corenlp.shtml) which is a very rich nlp toolset and fitted almost perfectly into my requirement except coreference resolution which doesn't work on named entities extracted from regex-ner. I am still working on this feature and expecting help on stackoverflow(http://stackoverflow.com/questions/28169139/run-stanford-corenlps-dcoref-on-the-basis-of-output-of-regexner).

Character Occurrence Plot
For each chapter text, i ran stanford corenlp pipeline with following nlp tools in sequence: "tokenize, ssplit, regexner". "tokenize" does word tokenization and "ssplit" does sentence annotation. "regexner" extracts named entities on the basis of a regex.
The way to set regular expression is by setting following property to stanford corenlp parser

Properties props = new Properties();
props.setProperty("regexner.mapping", "in/characters.txt");

pipeline = new StanfordCoreNLP(props);

and adding a regex like following in the tab-separated file "in/characters.txt".

((\s*)shiva(\s*)|(\s*)neelkanth(\s*)|(\s*)sati(\s*) CHARACTER
----------
--------

where each field has two columns, first one is regular expression and second is Named Entity Label.

After completing the nlp parsing, for each "CHARACTER" labelled token, calculate the word percentile for that word which gives the location of that character word within chapter and book. Then plotting the character and its occurrence percentile by book chapters give useful insights on the character influence and coverage within book.

On the left task bar, there is an option to select book of series and chapter range for which we want to visualize. As evident, character "shiva" is almost omnipresent in book series but it is other characters like "sati","bhrigu", "anandmayi" who have limited chapters of occurence but strong impact on story line.

Character Sentiments Plot
This plot, i think, is the most meaningful and impactful plot of my enterprise. Novels being always full of emotion swings and conflicts, it makes interesting visualization to know how the characters mood swings across book.

For data extraction, i ran following stanford nlp pipeline on each chapter data, "tokenize, ssplit, regexner, pos, lemma, parse, sentiment". First three components already explained, "pos" does part of speech tagging nouns, adverbs, etc. "lemma" does lemmatisation. "parse" does deep syntactical parsing and tag relationships between phrases identified by "pos". "sentiment" does sentiment scoring of each phrase in text using models build by recursive neural network and sentiment treebank.

But we only want the sentiment of the concerned characters, identified by "regexner", So, i filtered out the search space to include only those lines which have book character as "SUBJECT" or "PASSIVE SUBJECT".
for example, shiva killed the demons. OR shiva was angered by carelessness of devotees.

Then, in this filtered search space, i recorded each character's occurrence percentile and enclosing sentence's sentiment score. Plotting this information using ggplot looks something like this:

In chapter 44 and 45, "sati" has concentrated cluster of red which is "very negative" emotion, It stands justified as she was cheated by her father and engaged in battle. Similarly, "shiva" has good amount of green lines in chapter 6 of first book when he fall in love with "sati".

This plot could have been more richer and insightful if i had been able to coreference resolution on these characters identified by regexner but alas, it only runs on output of "ner" which is not able to identify book characters with accuracy.

Character Co-occurrence
This plot basically correlates the occurrence of two characters simultaneously within the book. It gives us the characters which are frequent collaborators in book storyline.

In this matrix, we can ignore the diagonal as they reflect occurrence density of a character only. Otherwise, we can see strong correlation patterns of shiva with sati, ganesh and bhagirath. Also, kali and ganesh have strong co-occurrence.

Conclusion

I am still working to enhance its sentiment analysis and working to find more visualization options to enrich the overall insights.
Though, i admit it far exceeds my initial expectations and work out to be a good learning journey in which i worked across many technical and analytic hiccups.
Regarding its real-world value, i am thinking it can be used to give users book visualization before reading it to help them choose on the basis of their sentiment preferences. Also, it can become of very high nostalgic value for book fans and can be used for targeted marketing.

Open for Healthy Criticism !!!

Sunday 8 February 2015

Extracting text from Kindle ebooks : Chapter wise

Amazon Kindle has been able to digitize the book world in a revolutionary way. Being an avid book lover and reader, i had tried to avoid, in fact hate the digitized book available in PDF and significantly, Kindle for long time. I loved that experience of turning those crispy pages and exploring a new world unfold on every page. It all changed 2 months back, when i bought amazon kindle paperwhite. Although, the only pro which i had in mind at that time, was their amazing Vocabulary Builder app but over time, i started to appreciate other things also like ability to explore more books on digital bookshop, recommendations and good simulated book reading experience. Still, I think it will take me some time to move over the hard book copies. It is a classical Man vs Machine case. I am most skeptical about a machine controlling my mind and thoughts and what better than digitized books to do so.
Coming back from digression and dreadful dreams, this blog is about how to extract chapter-wise text from kindle book formats. This is all in necessity to my personal project which i conceived in one of those dreadful dreams. The project is about doing text analysis on each chapter to find the influence of characters in book and on each other. Another part was to do sentiment analysis on characters to find their mood in different parts of book. I will explain it in next series of blogs.

Problem Statement:
Amazon Kindle Reader has its own digital format in which it encodes the books. It is mobi/epub format which we can see easily if we plug our kindle device to a computer and explore its filesystem. I wanted to read text from the kindle ebook chapter wise.

For background information, in kindle ebook format, all components(html, css, xml, fonts, images) are represented as Resources and are in XHTML format. It has 3 indexes into these Resources, as per the epub specification.

Spine: these are the Resources to be shown when a user reads the book from start to finish.
Table of Contents
: The table of contents. Table of Contents references may be in a different order and contain different Resources than the spine, and often do.
Guide: The Guide has references to a set of special Resources like the cover page, the Glossary, the copyright page, etc.

The complication is that these 3 indexes may and usually do point to different pages. A chapter may be split up in 2 pieces to fit it in to memory. Then the spine will contain both pieces, but the Table of Contents only the first. The Content page may be in the Table of Contents, the Guide, but not in the Spine. Etc.

Solution:
Thus, I started the recce on internet and found two libraries fit for it
1) Apache Tika(http://tika.apache.org/): It is an open-source library for detecting and extracting metadata and content from almost any file type. Just pass the file to its simple interface, it will automatically detect its type and call its corresponding driver to extract metadata and content.

It is as simple as:

Tika tika = new Tika();
tika.setMaxStringLength(10*1024*1024); //setting max string buffer length - 10MB

InputStream input = new BufferedInputStream(new FileInputStream("in/book1.epub"));
Metadata metadata = new Metadata();
String content = tika.parseToString(input, metadata);

As evident, it will extract the text of whole file at once. We can't extract it piecewise(here, chapterwise). We can't skip index, glossary, preface and other sections of book which aren't part of book story.

2) epublib(https://github.com/psiegman/epublib): It is another open-source library for creating epub files from existing html files. It also has a basic api to read metadata and content from an epub file.
Each Kindle ebook is represented as nl.siegmann.epublib.domain.Book object which have methods:

getMetadata to get metadata about book
getSpine to get spine reference
getTableOfContents to get reference to table of contents
getGuide to get reference to guide
getResources to get reference to all the images, chapters, sections, xhtml files, stylesheets, etc that make up the book.

Coming back to our requirement, this library almost fulfills the necessity to be able to read the book chapter wise.

InputStream is = new BufferedInputStream(new FileInputStream("path_to_kindle_ebook"));

EpubReader epubReader = new EpubReader();

Book book = epubReader.readEpub(is);

Spine bookSpine = book.getSpine();

TableOfContents tocs = book.getTableOfContents();

for(TOCReference toc : tocs.getTocReferences()){

if(toc.getTitle().toLowerCase().contains("chapter")){

Resource chapterResource =

bookSpine.getResource(bookSpine.findFirstResourceById(toc.getResourceId()));

byte[] data = resource.getData(); //it gives content in byte[]

Reader reader = resource.getReader();

InputStream is = resource.getInputStream;

}

but the issue is that chapterResource has three ways to return content: getData(), getReader() and getInputStream(). All of them return XHTML content which needs to be further parsed to extract text content.

(Important Point to note is that i haven't used bookSpine directly to scan through chapters because as mentioned before, if a big chapter is split into two sections to fit into memory, spine will have both references. Scanning of Chapters is more relevant through TableOfContents.)

So, in order to parse XHTML to extract text content, there are two ways, either write a SAX parser or use SAX parser from TIKA library to do so. Keeping programming spirit in mind, i am opting for second option.

InputStream is = new BufferedInputStream(new FileInputStream("path_to_kindle_ebook"));

EpubReader epubReader = new EpubReader();

Book book = epubReader.readEpub(is);

Spine bookSpine = book.getSpine();

TableOfContents tocs = book.getTableOfContents();

for(TOCReference toc : tocs.getTocReferences()){

if(toc.getTitle().toLowerCase().contains("chapter")){

Resource chapterResource =

bookSpine.getResource(bookSpine.findFirstResourceById(toc.getResourceId()));

String chapterTitle = toc.getTitle();

String chapterText = null;

try{

org.apache.tika.metadata.Metadata metadata = new

org.apache.tika.metadata.Metadata();

ParseContext context = new ParseContext();

BodyContentHandler handler = new BodyContentHandler(10*1024*1024);

XHTMLContentHandler xhtmlHandler = new XHTMLContentHandler(handler,

metadata);

xhtmlHandler.startDocument();

ContentHandler contentHandler = new EmbeddedContentHandler(new

BodyContentHandler(xhtmlHandler));

Parser epubContentParser = new EpubContentParser();

epubContentParser.parse(chapterResource.getInputStream(), contentHandler,

metadata, context);

xhtmlHandler.endDocument();

chapterText = contentHandler.toString().toLowerCase();

}

catch(Exception e){

System.err.println(e);

}

Input to handler object is size of text buffer which i have configured as 10MB (10*1024*1024). I know it is bit of unclean but it reuses the SAX parser in TIKA module which is already well tested. And we do have best excuse that programmer can have "REUSABILITY". On serious note, i have tested it numerous times and its work fine.

I do have implemented same functionality in my project using Iterator Design Pattern. You can check org.poc.book.reader.impl.KindleBookReader.java at github project, https://github.com/shakti-garg/bookProject.

Signing off!