-
Notifications
You must be signed in to change notification settings - Fork 19
RSS ingestor #1276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
eilmiv
wants to merge
26
commits into
ElixirTeSS:master
Choose a base branch
from
pan-training:rss_ingestor
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
RSS ingestor #1276
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
5af12ee
Refactor Dublin Core ingestion from OAI-PMH ingestor
eilmiv e36dbc2
Add RSS ingestion for materials and events
eilmiv be76ff7
Add tests for RSS ingestors
eilmiv 8c2880b
Add ingestors to factory
eilmiv 54895a2
Add support for common extensions
eilmiv 3cea73b
Fix Zeitwerk inflection problem with RSS
eilmiv 2c7c05e
Add support for relative urls
eilmiv a515e46
Fixes from testing many RSS feeds
eilmiv cd91db6
Remove start and end date for events based on date published in rss
eilmiv b8f19c6
Add feed url discovery from youtube url
eilmiv b2780cf
Fix error class that was too specific
eilmiv 0f042e7
Fix link handling in atom feeds
eilmiv 89e5f53
Use relative import for loading the custom rss media extention
eilmiv 662c450
Add comment for dublin core to text conversion options
eilmiv 0d9556d
Improve error message when there is an unsupported feed type.
eilmiv 031d9fa
Reuse code from youtube renderer for youtube link detection in RSS in…
eilmiv b67ad80
Small refactors in rss ingestion
eilmiv 02bb1d5
More specific errors in rss ingestion
eilmiv 0bd85c7
Refactor Yahoo Media RSS namespace patch
eilmiv b7196b9
Separate youtube ingestor
eilmiv 6fe1eff
Remove event rss ingestor
eilmiv f8d181a
Test youtube ingestor
eilmiv 4a23622
Address github copilot comments in rss ingestor implementation
eilmiv ac7a31a
Revert "Remove event rss ingestor"
eilmiv 26008d7
Refactor rss ingestion
eilmiv 8083387
Parsing event dates from event description
eilmiv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| require 'rss' | ||
| require 'rss/atom' | ||
|
|
||
| # Extension for the Yahoo Media RSS namespace (xmlns:media="http://search.yahoo.com/mrss/"). | ||
| # Used by feeds that carry rich media metadata, e.g. YouTube channel feeds which include | ||
| # <media:group>, <media:title>, and <media:description> elements. | ||
|
|
||
| module RSS | ||
| module Media | ||
| MEDIA_PREFIX = 'media' | ||
| MEDIA_URI = 'http://search.yahoo.com/mrss/' | ||
|
|
||
| module MediaGroupDescriptionModel | ||
| extend BaseModel | ||
|
|
||
| def self.append_features(klass) | ||
| super | ||
| return if klass.instance_of?(Module) | ||
|
|
||
| klass.install_must_call_validator(MEDIA_PREFIX, MEDIA_URI) | ||
| klass.install_have_child_element('group', MEDIA_URI, '?', 'media_group') | ||
| end | ||
| end | ||
|
|
||
| BaseListener.install_class_name(MEDIA_URI, 'group', 'MediaGroup') | ||
| BaseListener.install_get_text_element(MEDIA_URI, 'title', 'media_title') | ||
| BaseListener.install_get_text_element(MEDIA_URI, 'description', 'media_description') | ||
| end | ||
|
|
||
| module Atom | ||
| Feed.install_ns(Media::MEDIA_PREFIX, Media::MEDIA_URI) | ||
|
|
||
| class Feed | ||
| include Media::MediaGroupDescriptionModel | ||
|
|
||
| class Entry | ||
| include Media::MediaGroupDescriptionModel | ||
|
|
||
| class MediaGroup < Element | ||
| include RSS09 | ||
|
|
||
| @tag_name = 'group' | ||
|
|
||
| class << self | ||
| def required_prefix | ||
| Media::MEDIA_PREFIX | ||
| end | ||
|
|
||
| def required_uri | ||
| Media::MEDIA_URI | ||
| end | ||
| end | ||
|
|
||
| install_must_call_validator(Media::MEDIA_PREFIX, Media::MEDIA_URI) | ||
| install_text_element('title', Media::MEDIA_URI, '?', 'media_title') | ||
| install_text_element('description', Media::MEDIA_URI, '?', 'media_description') | ||
| end | ||
| end | ||
| end | ||
| end | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| module Ingestors | ||
| module DublinCoreIngestion | ||
| def build_material_from_dublin_core_data(dc) | ||
| material = OpenStruct.new | ||
|
|
||
| material.title = dc[:title] | ||
| material.description = convert_description(dc[:description]) | ||
| material.authors = normalize_dublin_core_values(dc[:creators]) | ||
| material.contributors = normalize_dublin_core_values(dc[:contributors]) | ||
|
|
||
| rights = normalize_dublin_core_values(dc[:rights]) | ||
| material.licence = rights.find { |r| r.start_with?('http://', 'https://') } || rights.first || 'notspecified' | ||
|
|
||
| parsed_dates = parse_dublin_core_dates(dc[:dates]) | ||
| material.date_created = parsed_dates.first | ||
| material.date_modified = parsed_dates.last if parsed_dates.size > 1 | ||
|
|
||
| identifiers = normalize_dublin_core_values(dc[:identifiers]) | ||
| material.doi = extract_dublin_core_doi(identifiers) | ||
| material.url = identifiers.find { |id| id.start_with?('http://', 'https://') } | ||
|
|
||
| material.keywords = normalize_dublin_core_values(dc[:subjects]) | ||
| material.resource_type = normalize_dublin_core_values(dc[:types]) | ||
| material.contact = dublin_core_text(dc[:publisher]) | ||
|
|
||
| material | ||
| end | ||
|
|
||
| def build_event_from_dublin_core_data(dc) | ||
| event = OpenStruct.new | ||
|
|
||
| event.title = dc[:title] | ||
| event.description = convert_description(dc[:description]) | ||
| event.organizer = normalize_dublin_core_values(dc[:creators]).first | ||
| event.contact = dublin_core_text(dc[:publisher]) || event.organizer | ||
| event.keywords = normalize_dublin_core_values(dc[:subjects]) | ||
| event.event_types = normalize_dublin_core_values(dc[:types]) | ||
|
|
||
| dates = parse_dublin_core_dates(dc[:dates]) | ||
| event.start = dates.first | ||
| event.end = dates.last || dates.first | ||
|
|
||
| identifiers = normalize_dublin_core_values(dc[:identifiers]) | ||
| event.url = identifiers.find { |id| id.start_with?('http://', 'https://') } | ||
|
|
||
| event | ||
| end | ||
|
|
||
| def parse_dublin_core_dates(dates) | ||
| normalize_dublin_core_values(dates).map do |date_value| | ||
| Date.parse(date_value) | ||
| rescue Date::Error, ArgumentError | ||
| nil | ||
| end.compact | ||
| end | ||
|
|
||
| def extract_dublin_core_doi(identifiers) | ||
| doi = normalize_dublin_core_values(identifiers).find do |id| | ||
| id.start_with?('10.') || id.start_with?('https://doi.org/') || id.start_with?('http://doi.org/') | ||
| end | ||
| return nil unless doi | ||
|
|
||
| normalized = doi.sub(%r{https?://doi\.org/}, '') | ||
| "https://doi.org/#{normalized}" | ||
| end | ||
|
|
||
| def normalize_dublin_core_values(values) | ||
| Array(values).map { |v| dublin_core_text(v).to_s.strip } | ||
| .reject(&:blank?).uniq | ||
| end | ||
|
|
||
| # this method is also used by RSS ingestion under an alias | ||
| def dublin_core_text(value) | ||
| return nil if value.nil? | ||
| return value.content if value.respond_to?(:content) # rss gem xml nodes | ||
| return value.text if value.respond_to?(:text) && !value.is_a?(String) # Nokogiri xml nodes | ||
|
|
||
| value.to_s | ||
| end | ||
| end | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| require 'tess_rdf_extractors' | ||
|
|
||
| module Ingestors | ||
| class EventRSSIngestor < Ingestor | ||
| include RSSIngestion | ||
|
|
||
| def initialize | ||
| super | ||
|
|
||
| @bioschemas_manager = BioschemasIngestor.new | ||
| end | ||
|
|
||
| def self.config | ||
| { | ||
| key: 'event_rss', | ||
| title: 'RSS / Atom Feed', | ||
| category: :events | ||
| } | ||
| end | ||
|
|
||
| def read(url) | ||
| read_from_rss_feed(url) | ||
| end | ||
|
|
||
| private | ||
|
|
||
| def ingest_record(record) | ||
| add_event(record) | ||
| end | ||
|
|
||
| def build_record_from_rss_item(item, feed_url) | ||
| event = build_event_from_dublin_core_data(extract_dublin_core(item)) | ||
|
|
||
| event.title ||= text_value(item.title) | ||
| event.url = Addressable::URI.join(feed_url, text_value(item.link)).to_s | ||
| event.description ||= convert_description(text_value(item.description) || text_value(item.content_encoded)) | ||
| event.keywords = merge_unique(event.keywords, extract_rss_keywords(item)) | ||
| organizer = text_value(item.respond_to?(:author) ? item.author : nil) | ||
| event.organizer ||= organizer | ||
| event.contact ||= organizer | ||
| event.start = parse_date_from_description(event.description) | ||
| event.end = nil | ||
|
|
||
| event | ||
| end | ||
|
|
||
| def build_record_from_atom_item(item, feed_url) | ||
| event = build_event_from_dublin_core_data(extract_dublin_core(item)) | ||
|
|
||
| event.title ||= text_value(item.title) | ||
| event.url = Addressable::URI.join(feed_url, text_value(extract_atom_link(item))).to_s | ||
| event.description ||= convert_description(text_value(item.summary) || text_value(item.content)) | ||
| event.keywords = merge_unique(event.keywords, extract_atom_keywords(item)) | ||
| organizer = extract_atom_authors(item).first | ||
| event.organizer ||= organizer | ||
| event.contact ||= organizer | ||
| event.start = parse_date_from_description(event.description) | ||
| event.end = nil | ||
|
|
||
| event | ||
| end | ||
|
|
||
| def extract_rdf_bioschemas_records(content) | ||
| return [] unless content.present? | ||
|
|
||
| events = Tess::Rdf::EventExtractor.new(content, :rdfxml).extract do |params| | ||
| @bioschemas_manager.convert_params(params) | ||
| end | ||
| courses = Tess::Rdf::CourseExtractor.new(content, :rdfxml).extract do |params| | ||
| @bioschemas_manager.convert_params(params) | ||
| end | ||
| course_instances = Tess::Rdf::CourseInstanceExtractor.new(content, :rdfxml).extract do |params| | ||
| @bioschemas_manager.convert_params(params) | ||
| end | ||
|
|
||
| @bioschemas_manager.deduplicate(events + courses + course_instances) | ||
| rescue StandardError => e | ||
| Rails.logger.error("#{e.class}: #{e.message}") | ||
| Rails.logger.error(e.backtrace.join("\n")) if e.backtrace&.any? | ||
| @messages << 'An error occurred while extracting Bioschemas Events.' | ||
| [] | ||
| end | ||
|
|
||
| def parse_date_from_description(event_description) | ||
| # gets the date by parsing the beginning of the event description | ||
| # this simple heuristic worked in multiple tested event rss feeds | ||
| # takes start date for date ranges (end date if start date is just a single number) | ||
| Date.parse(event_description.to_s[0..80]) | ||
| rescue Date::Error, ArgumentError | ||
| nil | ||
| end | ||
| end | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| require 'tess_rdf_extractors' | ||
|
|
||
| module Ingestors | ||
| class MaterialRSSIngestor < Ingestor | ||
| include RSSIngestion | ||
|
|
||
| def initialize | ||
| super | ||
|
|
||
| @bioschemas_manager = BioschemasIngestor.new | ||
| end | ||
|
|
||
| def self.config | ||
| { | ||
| key: 'material_rss', | ||
| title: 'RSS / Atom Feed', | ||
| category: :materials | ||
| } | ||
| end | ||
|
|
||
| def read(url) | ||
| read_from_rss_feed(url) | ||
| end | ||
|
|
||
| private | ||
|
|
||
| def ingest_record(record) | ||
| add_material(record) | ||
| end | ||
|
|
||
| def build_record_from_rss_item(item, feed_url) | ||
| material = build_material_from_dublin_core_data(extract_dublin_core(item)) | ||
|
|
||
| material.title ||= text_value(item.title) | ||
| item_link = text_value(item.link) | ||
| material.url = Addressable::URI.join(feed_url, item_link).to_s if item_link.present? | ||
| itunes_summary = text_value(item.itunes_summary) if item.respond_to?(:itunes_summary) | ||
| material.description ||= convert_description(text_value(item.description) || text_value(item.content_encoded) || itunes_summary) | ||
| material.keywords = merge_unique(material.keywords, extract_rss_keywords(item)) | ||
| author = item.author if item.respond_to?(:author) | ||
| itunes_author = item.itunes_author if item.respond_to?(:itunes_author) | ||
| material.authors = merge_unique(material.authors, [text_value(author)] + [text_value(itunes_author)].compact) | ||
| material.contact ||= material.authors&.first | ||
| guid = item.guid if item.respond_to?(:guid) | ||
| material.doi ||= extract_dublin_core_doi([text_value(guid)]) | ||
|
|
||
| item_date = parse_time(item.pubDate) if item.respond_to?(:pubDate) | ||
| item_date ||= parse_time(item.date) if item.respond_to?(:date) | ||
| material.date_published ||= item_date | ||
| material.date_created = prefer_precise_time(material.date_created, item_date) | ||
| material.date_modified = prefer_precise_time(material.date_modified, parse_time(item.date)) if item.respond_to?(:date) | ||
|
|
||
| material | ||
| end | ||
|
|
||
| def build_record_from_atom_item(item, feed_url) | ||
| material = build_material_from_dublin_core_data(extract_dublin_core(item)) | ||
|
|
||
| media_title = text_value(item.media_group&.media_title) | ||
| material.title ||= text_value(item.title) || media_title | ||
| atom_link = text_value(extract_atom_link(item)) | ||
| material.url = Addressable::URI.join(feed_url, atom_link).to_s if atom_link.present? | ||
| media_group_description = text_value(item.media_group&.media_description) | ||
| material.description ||= convert_description(text_value(item.summary) || text_value(item.content) || media_group_description) | ||
| material.keywords = merge_unique(material.keywords, extract_atom_keywords(item)) | ||
| material.authors = merge_unique(material.authors, extract_atom_authors(item)) | ||
| material.contact ||= material.authors&.first | ||
| material.doi ||= extract_dublin_core_doi([text_value(item.id)]) | ||
|
|
||
| published = parse_time(item.published) | ||
| updated = parse_time(item.updated) | ||
| material.date_created = prefer_precise_time(material.date_created, published) | ||
| material.date_published ||= published || updated | ||
| material.date_modified = prefer_precise_time(material.date_modified, updated) | ||
|
|
||
| material | ||
| end | ||
|
|
||
| def extract_rdf_bioschemas_records(content) | ||
| return [] unless content.present? | ||
|
|
||
| materials = Tess::Rdf::LearningResourceExtractor.new(content, :rdfxml).extract do |params| | ||
| @bioschemas_manager.convert_params(params) | ||
| end | ||
|
|
||
| @bioschemas_manager.deduplicate(materials) | ||
| rescue StandardError => e | ||
| Rails.logger.error("#{e.class}: #{e.message}") | ||
| Rails.logger.error(e.backtrace.join("\n")) if e.backtrace&.any? | ||
| @messages << 'An error occurred while extracting Bioschemas LearningResources.' | ||
| [] | ||
| end | ||
| end | ||
| end | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.