Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: add Lexrank text summarization #62

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Feat: add Lexrank text summarization #62

wants to merge 2 commits into from

Conversation

genesluna
Copy link

@genesluna genesluna commented Mar 18, 2019

O Problema

Atualmente o robô de texto escolhe as primeiras n-frases do conteúdo que é retornado da wikipedia. Acontece que estas primeiras frases nem sempre são a melhor representação(resumo) do conteúdo da página.

Com isso em mente e sabendo que a ideia do projeto é utilizar ao máximo a automatização, resolvi contribuir trazendo suporte a sumarização não supervisionada de texto usando o algorítimo Lexrank de Radev http://www.jair.org/papers/paper1523.html. Basicamente, ele aplica uma classificação lexicográfica a cada frase de um documento, encontrando as frases mais importantes e reproduzindo-as.

Exemplo

Atualmente, se fizermos uma busca com o termo "Javascript" no video-maker, receberemos como resultado as seguintes frases:

  • JavaScript , often abbreviated as JS, is a high-level, interpreted programming language that conforms to the ECMAScript specification.
  • It is a programming language that is characterized as dynamic, weakly typed, prototype-based and multi-paradigm.
  • Alongside HTML and CSS, JavaScript is one of the core technologies of the World Wide Web. JavaScript enables interactive web pages and is an essential part of web applications.
  • The vast majority of websites use it, and major web browsers have a dedicated JavaScript engine to execute it.
  • As a multi-paradigm language, JavaScript supports event-driven, functional, and imperative programming styles.
  • It has APIs for working with text, arrays, dates, regular expressions, and the DOM, but the language itself does not include any I/O, such as networking, storage, or graphics facilities.
  • It relies upon the host environment in which it is embedded to provide these features.

Com o uso da sumarização automatizada teríamos o seguinte resultado:

  • As a multi-paradigm language, JavaScript supports event-driven, functional, and imperative programming styles.
  • JavaScript was influenced by programming languages such as Self and Scheme.
  • It is a programming language that is characterized as dynamic, weakly typed, prototype-based and multi-paradigm.
  • JavaScript , often abbreviated as JS, is a high-level, interpreted programming language that conforms to the ECMAScript specification.
  • The terms Vanilla JavaScript and Vanilla JS refer to JavaScript not extended by any frameworks or additional libraries.
  • The vast majority of websites use it, and major web browsers have a dedicated JavaScript engine to execute it.
  • It relies upon the host environment in which it is embedded to provide these features.

Observem que além da ordem de algumas frases ter sido alterada de acordo com a sua relevância, outras foram removidas e substituídas por frases consideradas mais relevantes pelo algorítimo.

Segue abaixo o texto completo analisado:

JavaScript , often abbreviated as JS, is a high-level, interpreted programming language that conforms to the ECMAScript specification. It is a programming language that is characterized as dynamic, weakly typed, prototype-based and multi-paradigm. Alongside HTML and CSS, JavaScript is one of the core technologies of the World Wide Web. JavaScript enables interactive web pages and is an essential part of web applications. The vast majority of websites use it, and major web browsers have a dedicated JavaScript engine to execute it. As a multi-paradigm language, JavaScript supports event-driven, functional, and imperative programming styles. It has APIs for working with text, arrays, dates, regular expressions, and the DOM, but the language itself does not include any I/O, such as networking, storage, or graphics facilities. It relies upon the host environment in which it is embedded to provide these features. Initially only implemented client-side in web browsers, JavaScript engines are now embedded in many other types of host software, including server-side in web servers and databases, and in non-web programs such as word processors and PDF software, and in runtime environments that make JavaScript available for writing mobile and desktop applications, including desktop widgets. The terms Vanilla JavaScript and Vanilla JS refer to JavaScript not extended by any frameworks or additional libraries. Scripts written in Vanilla JS are plain JavaScript code. Although there are similarities between JavaScript and Java, including language name, syntax, and respective standard libraries, the two languages are distinct and differ greatly in design. JavaScript was influenced by programming languages such as Self and Scheme.

Para ver um exemplo com o termo "Michael Jackson" clique aqui. Acredito que seja um exemplo até melhor que o anterior.

Também modifiquei o resultado da busca no algorithmia de wikipediaContent.content para wikipediaContent.summary, pois o primeiro elemento retornado dentro do 'content' é justamente o 'summary'. Com isso nós agilizamos o processamento das expressões regulares e ainda de quebra ajudamos o algorítimo Lexrank, pois ele terá que fazer um 'resumo do resumo' e não um resumo de todo conteúdo.

Tenho ciência de que o PR não será 'mergeado'. A intenção é somente a de mostrar mais uma, dentre as muitas possibilidade de automação que temos a nossa disposição hoje em dia.

@maycrodrigues
Copy link

Parabéns amigo! Que show! Vou implementar e testar no meu projeto também! Irado!!!! 👍👍👍

PS.: Estou fazendo em TS https://github.com/maycrodrigues/video-maker-typescript 😎

@acristh
Copy link

acristh commented Mar 19, 2019

Show de bola!
Dessa forma todo o texto é analisado, e evita-se perder informações importantes. 😃👏

@marceloavf
Copy link

Parabéns @genesluna!

Tinha pensado nisso assim que vi o vídeo, porém não conhecia essa inteligência de classificação lexicográfica.

@filipedeschamps
Copy link
Owner

Sensacional!!!!

robots/text.js Outdated Show resolved Hide resolved
@leodutra
Copy link
Collaborator

leodutra commented Apr 8, 2019

@genesluna Tenho uma dúvida: os segundo exemplo que você deu foi gerado pela análise?
Tem uma quebra de contexto que talvez tenhamos que resolver para integrar a melhoria:

  • The vast majority of websites use it, and major web browsers have a dedicated JavaScript engine to execute it.
  • It relies upon the host environment in which it is embedded to provide these features.

"these features" não está diretamente em concordância com a última frase anterior.

Esta segunda frase perdeu o contexto depois da Lexrank... antes estava assim:

  • It has APIs for working with text, arrays, dates, regular expressions, and the DOM, but the language itself does not include any I/O, such as networking, storage, or graphics facilities.
  • It relies upon the host environment in which it is embedded to provide these features.

Alguma ideia?
Talvez porque tenha rodado a Lexrank em sentenças ao invés do texto original inteiro?

@felipealfah
Copy link

@genesluna Tudo bom? Tentei implementar mas ele me retorna erros como a falta de módulos do lexrank, alguma dica para corrigir isso??

C:\Users\Felipe\video-maker>node index.js
internal/modules/cjs/loader.js:584
throw err;
^

Error: Cannot find module 'lexrank.js'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:582:15)
at Function.Module._load (internal/modules/cjs/loader.js:508:25)
at Module.require (internal/modules/cjs/loader.js:637:17)
at require (internal/modules/cjs/helpers.js:22:18)
at Object. (C:\Users\Felipe\video-maker\robots\text.js:4:17)
at Module._compile (internal/modules/cjs/loader.js:701:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:712:10)
at Module.load (internal/modules/cjs/loader.js:600:32)
at tryModuleLoad (internal/modules/cjs/loader.js:539:12)
at Function.Module._load (internal/modules/cjs/loader.js:531:3)

@HelioLuna
Copy link

Olá, vi que varias pessoas estavam com problemas nesta branch e resolvi dar uma ajuda. Eu consegui implementar o algoritmo do lexrank (https://www.npmjs.com/package/lexrank) seguindo os seguintes passos:

1 - Instalar no projeto: npm i lexrank
2 - Implementar no projeto:
const lexrank = require('lexrank');

async function quebrarContentEmSentencasLexicasRankeadas(content){
        return new Promise(() => {
            content.sentences = []
            
            lexrank.summarize(content.sourceContentSanitizada, 5,(error, result) => {           
                if (error) {
                    throw error
                    return reject(error)
                }
                
                result.forEach((sentence) => {
                content.sentences.push({
                    text: sentence.text,
                    keywords: [],
                    images: []
                })
                })
            console.log(content.sentences)
            })
        })
    }

versão utilizada: "lexrank": "^1.0.5"

@leodutra
Copy link
Collaborator

@HelioLuna, poderia por favor dar uma olhada no meu comentário #62 (comment)?

Consegue aproveitar a implemetação e testar talvez com o mesmo texto do OP?

@HelioLuna
Copy link

@HelioLuna, poderia por favor dar uma olhada no meu comentário #62 (comment)?

Consegue aproveitar a implemetação e testar talvez com o mesmo texto do OP?

Então, eu rodei o lexrank em cima do texto inteiro, e o lexrank que se encarregou de procura e me devolver as melhores sentenças baseadas no seu algoritmo de rankeamento.

@guilherme-argentino
Copy link

Fiquei bem interessado, mas tomei este erro e fiquei preso nele.

(node:23892) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'compact' of undefined

Isso foi dentro do sentence-tokenizer: dependencia do lexrank

@rodrigo-sntg
Copy link

Fiquei bem interessado, mas tomei este erro e fiquei preso nele.

(node:23892) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'compact' of undefined

Isso foi dentro do sentence-tokenizer: dependencia do lexrank

@guilherme-argentino, eu resolvi isso usando o lexrank.js mesmo.

colocando meu codigo abaixo:

const algorithmia = require('algorithmia')
const lexrank = require('lexrank.js')
const algorithmiaApiKey = require('../credentials/algorithmia.json').apiKey
const sentenceBoundaryDetection = require(`sbd`)

const watsonApiKey = require('../credentials/watson-nlu.json').apikey
const NaturalLanguageUnderstandingV1 = require('watson-developer-cloud/natural-language-understanding/v1.js')


const nlu = new NaturalLanguageUnderstandingV1({
    iam_apikey: watsonApiKey,
    version: '2018-04-05',
    url: 'https://gateway.watsonplatform.net/natural-language-understanding/api/'
})

const state = require('./state.js')

async function robot() {
    const content = state.load()
    await fetchContentFromWiki(content)
    sanitizeContent(content)
    // breakContentIntoSentences(content)
    await breakContentIntoLexicalRankedSentences(content)
    limitMaximumSentences(content)
    await fetchKeywordsOfAllSentences(content)

    state.save(content)

    
    async function fetchContentFromWiki(content){
        const algorithmiaAuthenticated = algorithmia(algorithmiaApiKey)
        const wikipediaAlgo = algorithmiaAuthenticated.algo('web/WikipediaParser/0.1.2')
        const wikipediaResponse = await wikipediaAlgo.pipe(content.searchTerm)
        const wikipediaContent = wikipediaResponse.get()
        
        content.sourceContentOriginal = wikipediaContent
        
        // content.sourceContentOriginal = wikipediaContent.summary

    }

    function sanitizeContent(content){
        const withoutBlankLinesAndMarkdown = removeBlankLinesAndMarkdown(content.sourceContentOriginal.content)
        const withoutDatesInParenthesis = removeDatesInParenthesis(withoutBlankLinesAndMarkdown)

        content.sourceContentSanitized = withoutDatesInParenthesis

        function removeBlankLinesAndMarkdown(text){
            const allLines = text.split('\n')

            const withoutBlankLinesAndMarkdown = allLines.filter((line) => {
                if (line.trim().length === 0 || line.trim().startsWith('=')) {
                return false
                }

                return true
            })

            return withoutBlankLinesAndMarkdown.join(' ')
        }
    }

    function removeDatesInParenthesis(text) {
        return text.replace(/\((?:\([^()]*\)|[^()])*\)/gm, '').replace(/  /g,' ')
    }

    function breakContentIntoSentences(content) {
        content.sentences = []
    
        const sentences = sentenceBoundaryDetection.sentences(content.sourceContentSanitized)
        sentences.forEach((sentence) => {
          content.sentences.push({
            text: sentence,
            keywords: [],
            images: []
          })
        })
    }

    function limitMaximumSentences(content){
        content.sentences = content.sentences.slice(0, content.maximumSentences)
    }

    async function fetchKeywordsOfAllSentences(content) {
        console.log('> [text-robot] Starting to fetch keywords from Watson')
        const listOfKeywordsToFetch = []
        for (const sentence of content.sentences) {
            sentence.keywords = await fetchWatsonAndReturnKeywords(sentence)
            listOfKeywordsToFetch.push(
              fetchWatsonAndReturnKeywords(sentence)
            )
        }
      
        await Promise.all(listOfKeywordsToFetch)

      }

    async function fetchWatsonAndReturnKeywords(sentence) {
        return new Promise((resolve, reject) => {
          nlu.analyze({
            text: sentence.text,
            features: {
              keywords: {}
            }
          }, (error, response) => {
            if (error) {
              reject(error)
              return
            }
    
            const keywords = response.keywords.map((keyword) => {
              return keyword.text
            })

            sentence.keywords = keywords
    
            resolve(keywords)
          })
        })
      }

      async function breakContentIntoLexicalRankedSentences(content) {
        content.sentences = []

        lexrank(content.sourceContentSanitized, (err, result) => {
          if (err) {
            throw error
          }

          sentences = result[0].sort(function(a,b){return b.weight.average - a.weight.average})
          
          sentences.forEach((sentence) => {
            content.sentences.push({
              text: sentence.text,
              keywords: [],
              images: []
            })
          })
        })
      }


    
}

module.exports = robot

O erro no codigo do @genesluna era que estava chamando na funcao summary.lexrank.
Eu alterei para usar apenas o lexrank.
Assim nao deu erro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: To do
Development

Successfully merging this pull request may close these issues.