[EP.04] pdf 파일을 불러와 용어를 인식 후 뜻 풀이하기

아마 다음 시간이 마지막일 것 같은데, 현재 모든 기능을 구현 해서 구글 크롬 확장 프로그램에 제출 했습니다.!! 미리 축하~~~

자 이번 시간에는 pdf 파일을 불러와 해당 파일에 단어들을 인식해 앞 시간에 정의해둔 용어가 있다면 해당 용어를 나열 해주는 기능을 만들 것입니다.

문서 분석 페이지 만들기

저희는 오늘 이것을 만들겁니다 이 페이지에서 필요한 것은

1. 파일을 업로드 하게끔 만드는 버튼

2. 파일이 정상적으로 업로드 됐는지 확인하는 기능

3. 파일에 단어를 인식하는 기능

크게 이렇게 3개가 있습니다. 부분부분 필요한 게 더 있지만! 그것은 중요하지 않아요. 먼저 크게크게 잡고 가야 세부적으로 다룰 수 있습니다. 자 이제 시작 해봅시다.

파일 업로드 기능 구현

먼저 HTML을 작성 하겠습니다.

<div id="highlightArea" class="feature-area active">
  <div class="highlight-tools">
    <button id="cursorButton" class="tool-button" title="텍스트 선택">
      <label for="fileInput">
        <span class="material-icons">note_add</span>
        <span class="tool-label">파일 업로드</span>
      </label>
      <input type="file" id="fileInput" accept=".txt,.doc,.docx,.pdf" style="display: none;">
    </button>
  </div>
  
  <div class="uploaded-file-container">
    <div class="file-info">
      <!-- 파일명이 표시되는 영역 -->
    </div>
  </div>
  
  <div class="selected-words-container">
    <div class="selected-words-header">
      <span class="material-icons">format_list_bulleted</span>
      <h3>선택된 텍스트</h3>
    </div>
    <div class="selected-words">
      <!-- 추출된 단어들이 표시되는 영역 -->
    </div>
  </div>
</div>

다음으로는 CSS입니다.

.feature-area {
  height: 100%;
  position: relative;
  display: none;
}

.feature-area.active {
  display: block;
}

.selected-word {
  background: white;
  border-radius: 8px;
  padding: 15px;
  margin-bottom: 10px;
  box-shadow: 0 1px 3px rgba(0,0,0,0.1);
}

.term-header {
  font-weight: 500;
  color: #1a73e8;
  margin-bottom: 5px;
}

.term-description {
  font-size: 14px;
  color: #5f6368;
  line-height: 1.5;
}

.search-header {
  padding: 16px;
  border-bottom: 1px solid #e0e0e0;
}

.search-container {
  display: flex;
  align-items: center;
  background: white;
  border-radius: 24px;
  padding: 8px 16px;
  box-shadow: 0 1px 2px rgba(0,0,0,0.1);
}

이제 파일 업로드에 대한 UI를 만들었으니 기능을 만들고 보겠습니다. UI는 설명할게 없어 넘기겠습니다. 혹시나 궁금하시다면 댓글 달아주세요.

fileManager.js 파일 생성 (js 폴더 안에 위치함)

해당 파일은 파일 업로드에 대한 기능을 넣을 예정입니다. 우리는 pdf를 업로드 할 것이니, 해당 라이브러리가 필요합니다.

시행 착오를 겪어본 결과. 라이브러리로는 안 되더라고요. 그래서 깃허브에 파일이 있습니다.

압축 파일.zip

0.46MB

해당 파일을 ( 최상위 프로젝트 구조 / lib ) 에 위치하도록 해주세요. 이제 fileManager.js 파일에 코드를 작성 하겠습니다.

class FileManager {
  constructor() {
    this.fileInput = document.getElementById('fileInput');
    this.fileInfo = document.querySelector('.file-info');
    this.selectedWords = document.querySelector('.selected-words');
    this.currentPage = 1;
    this.itemsPerPage = 10;
    this.words = [];
    this.terms = []; // terms.json의 데이터
    this.searchInput = null;
    this.filteredWords = [];
    
    // PDF.js 워커 설정
    pdfjsLib.GlobalWorkerOptions.workerSrc = '../lib/pdf.worker.mjs';
    
    this.loadTerms();
    this.initializeFileUpload();
    this.initializeSearchBar();
  }

  // 파일 업로드 초기화
  initializeFileUpload() {
    this.fileInput.addEventListener('change', async (event) => {
      const file = event.target.files[0];
      if (file) {
        this.displayFileName(file);
        await this.readFileContent(file);
      }
    });
  }

  // 파일명 표시
  displayFileName(file) {
    const iconMap = {
      'txt': 'description',
      'doc': 'description',
      'docx': 'description',
      'pdf': 'picture_as_pdf'
    };
    
    const extension = file.name.split('.').pop().toLowerCase();
    const icon = iconMap[extension] || 'insert_drive_file';
    
    this.fileInfo.innerHTML = `
      <span class="material-icons">${icon}</span>
      <span>${file.name}</span>
    `;
  }

  // 파일 내용 읽기
  async readFileContent(file) {
    try {
      const content = await this.getFileContent(file);
      this.displayContent(content);
    } catch (error) {
      console.error('파일 읽기 오류:', error);
      this.selectedWords.innerHTML = `
        <div class="error-message">
          <span class="material-icons">error</span>
          파일을 읽는 중 오류가 발생했습니다.
        </div>
      `;
    }
  }
}

해당 코드를 넣으면 아래와 같이 정상적으로 파일이 업로드 됩니다.

PDF 텍스트 추출 및 용어 매칭

pdf를 정상적으로 업로드 했으니, 해당 파일에서 용어를 뽑아 단어와 뜻을 나열하는 기능을 구현 해보겠습니다.

가장 어려웠던 점은 영어 단어는 쉬우나 한글 단어를 필터링하는 작업에 있어 복잡함이 많았습니다. 저의 방식대로 필터링을 적용 해봤으나, 더 좋은 방법이 있다면 언제든지 댓글 달아주세요. fileManager.js 파일에 기능을 추가 하겠습니다.

async readPdfContent(file) {
  try {
    // PDF 파일을 ArrayBuffer로 변환
    const arrayBuffer = await new Promise((resolve, reject) => {
      const reader = new FileReader();
      reader.onload = (e) => resolve(e.target.result);
      reader.onerror = (e) => reject(e);
      reader.readAsArrayBuffer(file);
    });

    // PDF 문서 로드
    const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
    let fullText = '';

    // 각 페이지의 텍스트 추출
    for (let i = 1; i <= pdf.numPages; i++) {
      const page = await pdf.getPage(i);
      const textContent = await page.getTextContent();
      
      let lastItem = null;
      let currentWord = '';
      
      // 텍스트 아이템 처리
      for (const item of textContent.items) {
        if (lastItem && 
            (item.transform[5] !== lastItem.transform[5] || 
             item.transform[4] - lastItem.transform[4] > 20)) {
          fullText += ' ' + currentWord;
          currentWord = '';
        }
        currentWord += item.str;
        lastItem = item;
      }
      fullText += ' ' + currentWord;
    }

    return fullText;
  } catch (error) {
    throw new Error('PDF 파일 읽기 실패');
  }
}

// 텍스트 내용 처리 및 용어 매칭
displayContent(content) {
  // 텍스트를 단어로 분리
  const extractedWords = content.split(/[\s,.()[\]{}'"<>\/\\|;:!?~`@#$%^&*+=\n]+/)
    .filter(word => {
      if (word.length === 0) return false;
      
      // 한글 자모음만 있는 경우 제외
      const hasOnlyJamo = /^[ㄱ-ㅎㅏ-ㅣ]+$/.test(word);
      return word.length > 0 && !hasOnlyJamo;
    });

  // terms.json에 있는 단어만 필터링하고 중복 제거
  this.words = [...new Set(
    extractedWords.filter(word => this.isTermMatch(word))
  )];
  
  this.filteredWords = this.words;
  this.renderPage();
  this.renderPagination();
}

용어를 매칭 했으니 해당 용어를 검색해서 찾아야겠죠?

// 단어가 terms.json에 있는지 확인
isTermMatch(word) {
  return this.terms.some(term => {
    const termWords = term.term.split(/[\/()]/); // 슬래시, 괄호로 구분된 단어들
    return termWords.some(termWord => {
      // 한글 단어 비교
      if (/[가-힣]/.test(word)) {
        return termWord.trim() === word;
      }
      // 영어 단어 비교 (대소문자 무시)
      return termWord.trim().toLowerCase() === word.toLowerCase();
    });
  });
}

// 검색창 초기화
initializeSearchBar() {
  const searchHeader = document.createElement('div');
  searchHeader.className = 'search-header';
  searchHeader.innerHTML = `
    <div class="search-container">
      <span class="material-icons">search</span>
      <input type="text" id="wordSearchInput" placeholder="용어 검색...">
    </div>
  `;
  
  const oldHeader = document.querySelector('.selected-words-header');
  if (oldHeader) {
    oldHeader.parentNode.replaceChild(searchHeader, oldHeader);
  }

  this.searchInput = document.getElementById('wordSearchInput');
  this.searchInput.addEventListener('input', () => this.handleSearch());
}

정상적으로 용어를 찾았으니 결과를 띄워주는 코드를 작성 하겠습니다.

renderPage() {
  const startIndex = (this.currentPage - 1) * this.itemsPerPage;
  const endIndex = startIndex + this.itemsPerPage;
  const pageWords = this.filteredWords.slice(startIndex, endIndex);
  
  this.selectedWords.innerHTML = pageWords.map(word => {
    const termData = this.terms.find(term => {
      const termWords = term.term.split(/[\/()]/);
      return termWords.some(termWord => 
        termWord.trim().toLowerCase() === word.toLowerCase()
      );
    });

    return `
      <div class="selected-word">
        <div class="term-header">${word}</div>
        <div class="term-description">${termData ? termData.description : '설명이 없습니다.'}</div>
      </div>
    `;
  }).join('');
}

renderPagination() {
  const totalPages = Math.ceil(this.filteredWords.length / this.itemsPerPage);
  
  if (this.currentPage > totalPages) {
    this.currentPage = totalPages;
  }
  if (this.currentPage < 1) {
    this.currentPage = 1;
  }
  
  let paginationContainer = document.querySelector('.file-pagination');
  if (!paginationContainer) {
    paginationContainer = document.createElement('div');
    paginationContainer.className = 'file-pagination';
    this.selectedWords.parentNode.appendChild(paginationContainer);
  }
  
  paginationContainer.innerHTML = `
    <button id="prevPage" ${this.currentPage <= 1 ? 'disabled' : ''}>이전</button>
    <span>${this.currentPage} / ${totalPages || 1}</span>
    <button id="nextPage" ${this.currentPage >= totalPages ? 'disabled' : ''}>다음</button>
  `;

  // 페이지네이션 이벤트 리스너 설정
  this.setupPaginationListeners(paginationContainer);
}

해당 3개의 코드 블록은 모두 fileManager.js 파일에 넣으면 됩니다. 해당 기능 하나 구현 하는 데 무수히 많은 코드이지만, 분리 해뒀으니 이해 하시는 데, 크게 어려움은 없을거라 생각합니다.

오늘은 문서 분석에 대한 코드들을 작성 해봤습니다. 다음 시간에는 마지막 국어 사전 기능에 대해 설명하고 끝내겠습니다.

저의 코드들은 https://github.com/YangMun/BizCode 에 있으며 많은 관심 부탁드립니다.

크롬 확장 프로그램에서 "비즈코드" 라고 검색하시면 됩니다.

'개발하기 > Google Extension' 카테고리의 다른 글

[EP.05] 국어사전 API를 사용해 단어 검색 페이지 만들기 (0)	2025.01.16
[EP.03] 비즈니스 용어를 정리하고 해당 페이지의 UI를 구현하기 (1)	2024.12.19
[EP.02] 확장자 프로그램 기획하기: 비즈니스 용어 도우미 (1)	2024.12.18
[Ep.01]Google 확장자 업무 용어 프로그램 만들기 (+간단한 맛보기) (0)	2024.12.17